Using IAR Embedded Workbench for ARM and the CMSIS-DSP library

Improve performance of digital signal processing with IAR Embedded Workbench for ARM

ARM Cortex-M3/M4 processors provides instructions for signal processing, for example SIMD (Single Instruction Multi Data). Especially Cortex-M4 is designed for DSP applications and it supports advanced SIMD, MAC (Multiply and Accumulate) Instructions. In addition, Cortex-M4F devices have FPU (floating point unit) for handling floating point calculations.

There are several ways to use these instructions, for example using assembler routines or intrinsic functions, but one of the most practical approaches is to use the ARM Cortex Microcontroller Software Interface Standard (CMSIS) DSP library. The CMSIS-DSP library is designed for Cortex-M processors and it provides optimized functions for digital signal processing such as matrix functions, statistic functions, advanced math functions etc.

A prebuild CMSIS-DSP library and its source code is provided in IAR Embedded Workbench for ARM and in this article, we will take a look at how to use CMSIS-DSP library with together with IAR Embedded Workbench for ARM and how this can improve the performance.

Configuring the CMSIS-DSP library

In IAR Embedded Workbench for ARM, you enable the use of the CMSIS-DSP library by first choosing a Cortex-M device, for example the Cortex-M4F device STM32F407ZG.

configuring_cmsis-dsp_library

Second, set the CMSIS-DSP library option in the General Options>LibraryConfiguration page. This will set the PATH for C preprocessor and import the pre-build CMSIS library.

import_the_pre-build_CMSIS_library

These settings are all you need to be able to use CMSIS-DSP from IAR Embedded Workbench for ARM.

Simple test for CMSIS-DSP library

Let’s see how to call the CMSIS-DSP function and its performance. Here we will use the sqrt (square root) function and compare with the standard math function:

#include <arm_math.h>
#include <math.h>
#include <stdio.h>
int main()
{
  float32_t f_input_cmsis_dsp = 2;
  float32_t f_result_cmsis_dsp;
 
  float f_input = 2;
  float f_result;
 
  /* Using CMSIS-DSP library */
  arm_sqrt_f32(f_input_cmsis_dsp,&f_result_cmsis_dsp);
  printf("f1: %f\n",f_result_cmsis_dsp);
 
  /* Standard math function */
  f_result = sqrt(f_input);
  printf("f2: %f\n",f_result);
 
  return 0;
}

The results are identical and correct.

f1: 1.414214
f2: 1.414214

Next, let’s take a look at the performance.

The CYCLECOUNTER register in IAR Embedded Workbench are useful to check how many cycles that are consumed for the running code. The CCSTEP register is handy and useful when checking the number of cycles during the last performed C/C++ source or assembler step.

cpu_registers

Set breakpoints and note the CCSTEP value for the sqrt functions:

CCSTEP_value

In this case, CMSIS-DSP sqrt function is more than 10 times faster than the standard math function.

arm_sqrt_f32 :    52 cycles
sqrt :                 752 cycles

From this simple example, we can see that CMSIS-DSP is very easy to use and that it improves the performance significantly.

Practical example of FFT

Now, let’s take a look at one more practical example of CMSIS-DSP library. Fast Fourier Transform, FFT, is one of the most popular features of digital signal processing which can analysis frequency element from wave form data. IAR Embedded Workbench for ARM includes some CMSIS-DSP demo projects and in the following example, we use a STM32 example project by opening the ST>STM32F4xx> IAR-STM32F407ZG-SK>DSP Lib demo project.

STM32_example_in_IAR_Embedded_Workbench

This workspace includes 11 demo projects.

demo_projects

Select arm_fft_bin_example.

arm_fft_bin_example

This project includes arm_fft_bin_data.c which contains an array describing a 10 KHz signal disturbed with white noise.

arm_fft_bin_data.c

As the input data to the FFT algorithm should be complex numbers, odd numbers are the actual data and even numbers are the imaginary data and should be set to 0.

Input_signal

Input signal disturbed with white noise.

FFT_result_data

FFT result data are always symmetric and the output from the FFT demo contains a specific frequency component but also white noise.

Let’s go back to the main source code and notice we are using four CMSIS-DSP functions.

  arm_status status; 
  arm_cfft_radix4_instance_f32 S;
  float32_t maxValue;     
  status = ARM_MATH_SUCCESS;
    
  /* Initialize the CFFT/CIFFT module */  
  status = arm_cfft_radix4_init_f32(&S, fftSize,ifftFlag, doBitReverse);

  /* Process the data through the CFFT/CIFFT module */
  arm_cfft_radix4_f32(&S, testInput_f32_10khz);
    
  /* Process the data through the Complex Magnitude Module for  
  calculating the magnitude at each bin */
  arm_cmplx_mag_f32(testInput_f32_10khz, testOutput, fftSize);  
    
  /* Calculates maxValue and returns corresponding BIN value */
  arm_max_f32(testOutput, fftSize, &maxValue, &testIndex);

As the comments are indicating, the first one initializes FFT module, the second function is the actual FFT calculation, the third function calculates the magnitude of each bin of the FFT result from complex numbers and the fourth function find the max value and index from the output array.

The results are exactly same with the spread sheet chart shown before.

result

Now, let’s see the performance of each function with CCSTEP.

arm_cfft_radix4_init_f32 54
arm_cfft_radix4_f32 100256
arm_cmplx_mag_f32 26913
arm_max_f32 8744

Total cycle is 135,967 cycles. If CPU runs at 100MHz, the total time will be 1,359 us. When audio sampling rate is 44 kHz, 2048 sampling will take about 45,056 us. Compare to the number, we can see the DSP performance is quite fast.

Let’s change the core to Cortex-M3 and see how the performance changes.

arm_cfft_radix4_init_f32 54
arm_cfft_radix4_f32 1852707
arm_cmplx_mag_f32 377358
arm_max_f32 23844

If CPU runs at 100MHz, the total time will be 22,539 us. We can see how Cortex-M4 is optimized for DSP applications.

Cortex-M4

Conclusion

Cortex-M processors provide high-performance instructions and especially Cortex-M4 supports instructions for DSP applications. To bring out the performance, think about using IAR Embedded Workbench for ARM together with the CMSIS-DSP library. If you cannot find the function in the library, you could also refer the source code under \arm\CMSIS\DSP_Lib\Source in IAR Embedded Workbench for ARM and create your own library.

This article is written by Kyota Yokoo, Field Applications Engineer at IAR Systems in Tokyo, Japan.

© IAR Systems 1995-2016 - All rights reserved.