Arm Cortex-M3/-M4 processors provides instructions for signal processing, for example SIMD (Single Instruction Multi Data). Especially Cortex-M4 is designed for DSP applications and it supports advanced SIMD, MAC (Multiply and Accumulate) Instructions. In addition, Cortex-M4F devices have FPU (floating point unit) for handling floating point calculations.

There are several ways to use these instructions, for example using assembler routines or intrinsic functions, but one of the most practical approaches is to use the Arm Cortex Microcontroller Software Interface Standard (CMSIS) DSP library. The CMSIS-DSP library is designed for Cortex-M processors and it provides optimized functions for digital signal processing such as matrix functions, statistic functions, advanced math functions etc.

A prebuild CMSIS-DSP library and its source code is provided in IAR Embedded Workbench for Arm and in this article, we will take a look at how to use CMSIS-DSP library with together with IAR Embedded Workbench for Arm and how this can improve the performance.

In IAR Embedded Workbench for Arm, you enable the use of the CMSIS-DSP library by first choosing a Cortex-M device, for example the Arm Cortex-M4F device STM32F407ZG.

Second, set the CMSIS-DSP library option in the *General Options>Library*Configuration page. This will set the PATH for C preprocessor and import the pre-build CMSIS library.

These settings are all you need to be able to use CMSIS-DSP from IAR Embedded Workbench for Arm.

Let’s see how to call the CMSIS-DSP function and its performance. Here we will use the sqrt (square root) function and compare with the standard math function:

#include <arm_math.h>

#include <math.h>

#include <stdio.h>

int main()

{

float32_t f_input_cmsis_dsp = 2;

float32_t f_result_cmsis_dsp;

float f_input = 2;

float f_result;

/* Using CMSIS-DSP library */

arm_sqrt_f32(f_input_cmsis_dsp,&f_result_cmsis_dsp);

printf("f1: %f\n",f_result_cmsis_dsp);

/* Standard math function */

f_result = sqrt(f_input);

printf("f2: %f\n",f_result);

return 0;

}

The results are identical and correct.

`f1: 1.414214`

f2: 1.414214

Next, let’s take a look at the performance.

The CYCLECOUNTER register in IAR Embedded Workbench are useful to check how many cycles that are consumed for the running code. The CCSTEP register is handy and useful when checking the number of cycles during the last performed C/C++ source or assembler step.

Set breakpoints and note the CCSTEP value for the sqrt functions:

In this case, CMSIS-DSP sqrt function is more than 10 times faster than the standard math function.

arm_sqrt_f32 : 52 cycles

sqrt : 752 cycles

From this simple example, we can see that CMSIS-DSP is very easy to use and that it improves the performance significantly.

Now, let’s take a look at one more practical example of CMSIS-DSP library. Fast Fourier Transform, FFT, is one of the most popular features of digital signal processing which can analysis frequency element from wave form data. IAR Embedded Workbench for Arm includes some CMSIS-DSP demo projects and in the following example, we use a STM32 example project by opening the *ST>STM32F4xx> IAR-STM32F407ZG-SK>DSP Lib demo project*.

This workspace includes 11 demo projects.

Select arm_fft_bin_example.

This project includes arm_fft_bin_data.c which contains an array describing a 10 KHz signal disturbed with white noise.

As the input data to the FFT algorithm should be complex numbers, odd numbers are the actual data and even numbers are the imaginary data and should be set to 0.

Input signal disturbed with white noise.

FFT result data are always symmetric and the output from the FFT demo contains a specific frequency component but also white noise.

Let’s go back to the main source code and notice we are using four CMSIS-DSP functions.

arm_status status;

arm_cfft_radix4_instance_f32 S;

float32_t maxValue;

status = ARM_MATH_SUCCESS;

/* Initialize the CFFT/CIFFT module */

status =arm_cfft_radix4_init_f32(&S, fftSize,ifftFlag, doBitReverse);

/* Process the data through the CFFT/CIFFT module */

arm_cfft_radix4_f32(&S, testInput_f32_10khz);

/* Process the data through the Complex Magnitude Module for

calculating the magnitude at each bin */

arm_cmplx_mag_f32(testInput_f32_10khz, testOutput, fftSize);

/* Calculates maxValue and returns corresponding BIN value */

arm_max_f32(testOutput, fftSize, &maxValue, &testIndex);

As the comments are indicating, the first one initializes FFT module, the second function is the actual FFT calculation, the third function calculates the magnitude of each bin of the FFT result from complex numbers and the fourth function find the max value and index from the output array.

The results are exactly same with the spread sheet chart shown before.

Now, let’s see the performance of each function with CCSTEP.

arm_cfft_radix4_init_f32 | 54 |

arm_cfft_radix4_f32 | 100256 |

arm_cmplx_mag_f32 | 26913 |

arm_max_f32 | 8744 |

Total cycle is 135,967 cycles. If CPU runs at 100MHz, the total time will be 1,359 us. When audio sampling rate is 44 kHz, 2048 sampling will take about 45,056 us. Compare to the number, we can see the DSP performance is quite fast.

Let’s change the core to Cortex-M3 and see how the performance changes.

arm_cfft_radix4_init_f32 | 54 |

arm_cfft_radix4_f32 | 1852707 |

arm_cmplx_mag_f32 | 377358 |

arm_max_f32 | 23844 |

If CPU runs at 100MHz, the total time will be 22,539 us. We can see how Cortex-M4 is optimized for DSP applications.

Cortex-M processors provide high-performance instructions and especially Cortex-M4 supports instructions for DSP applications. To bring out the performance, think about using IAR Embedded Workbench for Arm together with the CMSIS-DSP library. If you cannot find the function in the library, you could also refer the source code under *\arm\CMSIS\DSP_Lib\Source* in IAR Embedded Workbench for Arm and create your own library.

*This article is written by Kyota Yokoo, Field Applications Engineer at IAR Systems in Tokyo, Japan.*