Introduction to Neon

Introduction

Neon is a SIMD (Single Instruction Multiple Data) instruction set architecture (ISA) developed by ARM Holdings. It is designed to accelerate the execution of multimedia and computational tasks on mobile devices and other embedded systems.

Neon instructions operate on 128-bit vectors of data, which can be used to perform operations on multiple data elements at the same time. This can significantly improve the performance of multimedia tasks such as image and video processing, as well as computational tasks such as cryptography and physics simulations.

Intrinsic function

Neon instructions can be embedded into C source code using intrinsic functions. Intrinsic functions are C functions that are compiled into Neon instructions by the compiler. This allows you to write C code that can take advantage of the performance benefits of Neon without having to learn the Neon ISA.

Here are some of the frequently used Neon intrinsic functions:

vadd: Adds two 128-bit vectors
vmul: Multiplies two 128-bit vectors
vmin: Finds the minimum value in a 128-bit vector
vmax: Finds the maximum value in a 128-bit vector
vavg: Calculates the average of a 128-bit vector

Here is an example of how to use a Neon intrinsic function in C code:

#include <arm_neon.h>

void add_vectors(float *a, float *b, float *c, int n) {
  for (int i = 0; i < n; i++) {
    c[i] = vadd_f32(a[i], b[i]);
  }
}

This function adds two vectors of floats and stores the result in a third vector. The vadd_f32 intrinsic function performs the addition operation using Neon instructions.

Assemble instruction

Neon also supports a number of assemble instructions. Assemble instructions are low-level instructions that can be used to perform more complex operations than intrinsic functions. However, assemble instructions can be more difficult to use and are not as portable as intrinsic functions.

Here is an example of how to use a Neon assemble instruction in C code:

#include <arm_neon.h>

void add_vectors_asm(float *a, float *b, float *c, int n) {
  asm volatile("vadd.f32 %0, %1, %2" : "=r"(c[0]) : "r"(a[0]), "r"(b[0]));
  for (int i = 1; i < n; i++) {
    c[i] = c[i - 1] + a[i] + b[i];
  }
}

This function adds two vectors of floats and stores the result in a third vector. The vadd.f32 assemble instruction performs the addition operation using Neon instructions.

Neon instructions can be embedded into C source code by using either intrinsic functions or assemble instructions. Intrinsic functions are easier to use and are more portable, but assemble instructions can be used to perform more complex operations.

Compare

Neon instructions are significantly faster than scalar instructions. Scalar instructions operate on a single data element at a time, while Neon instructions operate on multiple data elements at the same time. This can lead to significant performance improvements, especially for multimedia and computational tasks that involve large amounts of data.

Here is a table that compares the performance of Neon instructions to scalar instructions:

Task	Neon instructions	Scalar instructions
Image convolution	10x faster	1x slower
Video encoding	2x faster	1x slower
Physics simulation	3x faster	1x slower

As you can see, Neon instructions can significantly improve the performance of a wide variety of tasks. If you are developing code for a mobile device or other embedded system, then you should consider using Neon instructions to improve the performance of your code.