Automatic NEON vectorization

Using IAR Embedded Workbench for ARM compiler vectorization to enhance DSP performance for ARM Cortex-A NEON units

IAR Systems introduced automatic vectorization compiler support for NEON technology in version 7.10 of IAR Embedded Workbench for ARM. This article will focus on how to take advantage of automatic vectorization for your next ARM Cortex-A design which includes integrated NEON technology.


In the never-ending quest for reaching increased performance and low power, semiconductor chip manufacturers continue to push the boundaries of Moore’s Law with the migration to smaller silicon geometries. However, due to limited battery life on mobile devices, power consumption has increasingly been recognized as the key factor in limiting the maximum processor operating speed at a given semiconductor manufacturing process. This is due to the non-linear behavior of power to frequency at the faster manufacturing process corners, where higher operating frequencies result in exponential increase in power. To overcome this issue, semiconductor IP suppliers and chip manufacturers alternatively have sought to implement various micro-architectural enhancements to significantly increase the performance of a processor core, while operating at lower operating speeds with minimal impact in both power consumption and die size. One such feature enhancement is trending towards more data parallelism and vector processing through the use of SIMD computation units.

SIMD stands for Single-Instruction/Multiple Data. The key concept behind SIMD is that many sequential computations can be combined in parallel using specific machine instructions that operate on multiple data paths (registers/memory) simultaneously. This parallel processing using SIMD is not a new concept and was used back as early as mid-1970s on the Cray-1 supercomputer. Other traditional microcomputer architectures included only one arithmetic logic unit that sequentially executed one instruction on one pair of operands at a time. These processors are referred to as SISD architectures, or Single-Instruction/Single-Data. SIMD instructions have the potential to provide significant performance improvements in many math-intensive digital signal processing multimedia applications.

Overview of ARM Cortex-Ax NEON

The ARMv7 architecture introduced the optional Advanced SIMD NEON extensions for ARMv-7A and ARMv7R profiles. NEON is a wide 64/128-bit SIMD data processing architecture which defining groups of instructions that allows it to operate on multiple data elements in parallel using the same instruction, which results in accelerated performance for digital signal processing applications (Figure 1).

The instruction can operate on vector data elements stored in 64-bit D doubleword vector registers and 128-bit Q, quadword, vector registers. These register vector elements are of the same data type, which can be signed or unsigned 8-bit, 16-bit, 32-bit, or 64-bit integer data. NEON also supports 32-bit single-precision floating point.

Figure 1: ARM Cortex A’s NEON Unit

The NEON register bank provides a 256-byte (or 32 x 64-bit) register file. These are distinct from the core registers. There are two explicitly aliased views of the same register bank (Figure 2) which the NEON unit considers as either 32 x 64-bit doubleword registers (D0 to D31) or 16 x 128-bit quadword registers (Q0-Q15). The vector instructions will determine the appropriate register usage so the software does not to explicitly switch or move data between registers.

Figure 2:  NEON Register Bank

can be done on 2, 4, 8 or 16 elements in parallel, depending on data type and register size. For example, operations on 32-bit integers allow 4 data elements in parallel.  Operations on 16-bit integers support 8 data elements in parallel.  And finally, operations on 8-bit data support 16 data elements in parallel.  Some example NEON vector assembly language instructions are shown below in Figure 3.

Figure 3: Example Vector Multiply Instructions with 32-bit Fixed and Floating Point Data

Compiler vectorization in IAR Embedded Workbench for ARM

Algorithm developers might take advantage of using NEON instructions with a variety of options, either by writing their own hand-coded assembly modules or by using C compiler intrinsic provided by ARM.   However, intrinsics are often difficult for many algorithm experts to use, and also results in source code which is not portable between different microcomputer architectures and C compilers. To make it easier to take advantage of NEON, vectorization support was introduced in version 7.10 of IAR Embedded Workbench for ARM. Vectorization allows the compiler to take advantage of NEON vector instructions without you needing to worry about the low-level vector assembly language instructions or portability of your code. Vectorization is also often referred to as automatic vectorization or auto-vectorization by different compiler vendors.

What is vectorization?

A speed optimization where computations are performed on several values simultaneously by converting scalar implementation, which operates a single pair of operands at a time, to a vector implementation which operates a single operation on multiple pairs of operands at once. A vector is a set of scalar data items, all of the same type, stored in memory. Vector processing occurs when arithmetic and logical operation are applied to vectors.  Thus, this conversion from scalar processing to vector code by the compiler is called vectorization. During this conversion, the compiler will first analyze the source code and determine whether certain loops can be mapped to a vector algorithm. The compiler will then transform the scalar loop into a sequence of vector operations which perform arithmetic computations on a block of elements.

Consider the following function which adds two arrays a[ ] and b[ ] together to a third array c[ ]:

int a[256], b[256], c[256];
void add_arrays()
              int i;
   for (i = 0; i < 256; ++i)
      c[i] = a[i] + b[i];
   }                             }

The for{ } loop shown above simply perform a scalar addition operation using three 256-element arrays.    By default with no vectorization optimization (or any other compiler optimizations) enabled, the compiler generated the following assembly instructions to add the array elements together.  Element a[i] is loaded into register R4 and added together with b[i] element which was loaded into register R5.  The result c[i] after the addition is stored into register R4.

The above function executes without NEON instructions in 1541 cycles:

1541 cycles total = (256 loop iterations x 6 instructions) + 5 loop setup cycles

Now, let’s see what happens when we enable vectorization in the above code example and see how much of a speed improvement can be achieved.  After we enable vectorization and recompile our code, the generated assembly code now looks like this:

Our regular arithmetic instructions using Rx registers have been replaced by vector instructions operating on the NEON’s Dx and Qx registers (vector instructions are shown in blue). In this case, the vector instructions will load, add and store four array elements simultaneously.  The first loop iteration adds the first four vector elements stored in Q0 and Q1 back into result Q0 register.

c0= a0 + b0
c1= a1 + b1
c2= a+ b2
c3= a3 + b3

Our second loop iteration will add the next four vector elements stored in Q0 and Q1 back into result Q0:

c4= a4 + b4
c5= a5 + b5
c6= a+ b6
c7= a7 + b7

And so on… The dataflow for the VADD.I16 instruction for the first loop iteration is shown in Figure 4.

Figure 4. Vector Addition Instruction of Four 32-bit Integer Data Elements

Note that in this case the loop counter was reduced from 256 iterations (without vectorization) to 64 (with vectorization). This time, the 256 array elements are summed together with only 388 cycles.

                388 cycles total = (64 loop iteration s x 6 instructions/iteration) + 4 loop setup cycles

So we have approximately 400% performance improvement with our simple addition algorithm operating on 32-bit integer data. But what if we only required 16-bits of precision in our calculations? Let’s do this by changing our arrays of ints to arrays of shorts.

shorta[256], b[256], c[256]
   int i; 
   for (i = 0; i < 256; ++i)
      c[i] = a[i] + b[i];

By this simple change to the source code of reducing our data size in half, we can double our number of vector elements to be processed within the vector instructions. The compiled assembly instructions confirm this optimization:

In this case, the vector instructions (shown in blue) will load, add and store eight array elements simultaneously. The first loop iteration will add the following vector elements into the Q0 result register.

c0= a0 + b0
c1= a1 + b1
c2= a+ b2
c3= a3 + b3
c4= a4 + b4
c5= a5 + b5
c6= a+ b6
c7= a7 + b7

Our second loop iteration adds the next eight array elements into the Q0 result register:

c8= a8 + b8
c9= a9 + b9
c10= a10 + b10
c11= a11 + b11
c12= a12 + b12
c13= a13 + b13
c14= a14 + b14
c15= a15 + b15

And so on… The dataflow for the VADD.I16 instruction for the first loop iteration is shown in Figure 5 below.

Since we can perform eight 16-bit transfers per cycle and eight arithmetic instructions per cycle, the loop counter was reduced even further from 256 iterations (without vectorization) down to 32 (with vectorization). The 256 array element summation executes in only 196 cycles.

            196 cycles total = (32 loop iteration s x 6 instructions/iteration) + 4 loop setup cycles

These speed improvements by a factor of 4x (32-bit integer math) and 8x (16-bit integer math) were easily achieved with no complicated source code changes to take advantage of the ARM NEON extensions. With the vectorizing compiler in IAR Embedded Workbench, you can maintain 100% portable code without the need to resort to using compiler intrinsics or #pragmas to utilize NEON vector instructions. The compiler will simply handle the dirty work for you in order to optimize your code using NEON vector instructions.

Enabling vectorization in IAR Embedded Workbench

Adding auto-vectorization support for your Cortex-A NEON design is very easy to do under IDE project options in IAR Embedded Workbench for ARM, version 7.10.  Let’s see how this is done.

Enabling vectorization from the IDE Project Options

First the particular Cortex Ax core will need to support NEON extensions. This depends on the integration of NEON vector floating point unit for a given chip manufacturer’s Cortex-A device, since it is an optional block for some A5 cores, and fully integrated in others like the A8 and A9. To let the compiler know that vectorization optimization is available, we first tell it that FVPv3 and NEON is supported on the device. This option is set under Project > Options > General Options > FPU VFPv3 + NEON and can be seen in Figure 6. For certain Cortex A cores, this FPU setting is the default.

Figure 6. Enabling Vector Floating Point Support Under Embedded Workbench Project Options

The second step to enable vectorization is to go to the C/C++ Compiler category under the IDE project options and select to enable vectorization within the optimization tab.  This is done by going to Project > Options > C/C++ Compiler > Optimizations > Enable transformations > Vectorize (checkbox selected). This is shown in Figure 7.  This selection will enable the generation of NEON vector assembly language instructions from the C source for loops.   The equivalent command line option is –vectorize.

 Important Note: Loops will only be vectorized if the processor has NEON capability and the Project Options are configured with a High Level/Speed Optimization setting (or if the -Ohs command line optimization option is used).

Figure 7.  Enabling NEON Vectorization Under Embedded Workbench Project Options

Enabling vectorization at the source file level

The C/C++ Compiler in IAR Embedded Workbench for ARM also allow finer grain control of vectorization within a C source file through the use of pragma directives, where vectorization can be enabled or disabled before a specific function or for loop as follows:


Immediately before a function

  • #pragma optimize = vectorize
  • #pragma optimize = no_vectorize

Immediately before a specific loop

  • #pragma vectorize
  • #pragma vectorize = never

For example, the following pragma directive will enable NEON vector instructions for the loop:

#pragma vectorize
for (i = 0; i < 1024; ++i)
   a[i] = b[i] * c[i];


The NEON extensions to ARM Cortex-A give the ARM processors strong consideration for digital signal processing applications that were once the domain of dedicated DSPs. The benefits of NEON for low-power multimedia applications are obvious. The power consumption of the NEON unit is equivalent to the integer unit on the ARM Cortex core, and therefore, offers more performance with less die area and total power compared to adding more ARM cores to the chip. With the ability to process data in parallel, we can also reduce our processor bandwidth utilization and allocate more resources for other tasks.   Therefore, the overall system has better response time and better battery lifetime and even offers the potential to reduce PCB board space and overall cost when migrating to a NEON-powered device.

The ability to use a vectorizing compiler to accelerate DSP development may make an ARM NEON-based product even more appealing for your next design. Vectorization optimization in IAR Embedded Workbench for ARM is very easy to use. It allows maintaining of code portability and there is no need to resort to using compiler intrinsics or hand coded assembly to use the NEON co-processor. Vectorization offers a less complex design giving you shorter development time and shorter time to market.  Let the compiler do the work for you!


  1. ARM Limited, Introducing NEON Development Article, 2009, 
  2. M. Anderson, 2011, ARM Neon Instruction Set and Why You Should Care,
  3. Del Mar North, Neon-Test Code Tutorial, April 2010, 
  4. Incube Solutions Pvt, Ltd., Optimization of Multimedia Codecs using ARM NEON,
  5. M. Harnisch, Using your C Compiler to expoint NEON Advanced SIMD, 2010, 
  6. G. Mitra, A. Rendell, J, Zhou, Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms,
  7. P. Beckmann, DSP Concepts, 2011, Real Time Audio Processing Capabilities of Microcontrollers, Applications Processors, and DSPs

This article is written by John Tomarakos, Field Applications Engineer, IAR Systems.

© IAR Systems 1995-2016 - All rights reserved.