ARM NEON Usage Note

December 07, 2018

简介

SIMD, 即Single Instruction Multiple Data（单指令多数据）的并行操作。CPU 在处理向量数据时有它的局限性。CPU的优势在于处理复杂多变的指令，而对于那种大数据量的重复性操作，ARM 为了增加处理效率，增加了这种并行处理模块, 即NEON(Advanced SIMD)。主要是在ARMv7 架构后的处理器使用。

NEON 的主要 components：

NEON register file
NEON integer execute pipeline
NEON single-precision floating-point execute pipeline
NEON load/store and permute pipeline

NEON 指令和 floating-point 指令使用的是相同的 register file。不同于ARM core的register file。此 register file 可以以 32-bit, 64-bit, 128-bit 方式访问。 The contents of the NEON registers are vectors of elements of the same data type. A vector is divided into lanes and each lane contains a data value called an element. 通常NEON 并行操作数量n, 就等于vectors 拆分的lanes（通道）数。例如：

64-bit NEON vectors can contain:

—Eight 8-bit elements.
—Four 16-bit elements.
—Two 32-bit elements.
—One 64-bit element.

128-bit NEON vectors can contain:

—Sixteen 8-bit elements.
—Eight 16-bit elements.
—Four 32-bit elements.
—Two 64-bit elements.

NEON 单元总共的 register file 资源可以视作:

16个 128-bit的 Q(quadword) 寄存器，Q0-Q15。
32个 64-bit 的 D(Doubleword)寄存器，D0~D31。

NEON 指令通常有几种使用方式：

使用预先用NEON 优化好的库, 主要是一些Mechine Learning 和 computer Vision的库。如: Ne10，libyuv, skia 等。
AutoVectorization （自动向量化编译器）
NEON intrinsics
NEON assembly

这里主要说一下 intrinsics 方式。

intrinsics

相较于 assambly模式， intrinsic 更易于使用和记忆

vector datatypeatatype

<type><size>x<number\_of\_lanes>\_t

For example:

int16x4_t is a vector describes a vector of four 16-bit short int value
float32x4_t describes a vector of four 32-bit float values

也可以组成多个vector的数组

<type><size>x<number\_of\_lanes>x<length\_of\_array>\_t

struct int16x4x2_t{
int16x4_t val[2];
};


*   **Prototype of NEON intrinsics**


An additional q flag is provided to specify that the intrinsic operates on 128-bit vectors. For example:

*   vmul\_s16, multiplies two vectors of signed 16-bit values. This compiles to VMUL.I16 d2, d0, d1.
*   vaddl\_u8, is a long add of two 64-bit vectors containing unsigned 8-bit values, resulting in a 128-bit vector of unsigned 16-bit values. This compiles to VADDL.U8 q1, d0, d1.

*   **使用 NEON intrinsics**

Intrinsics that use the ‘q’ suffix usually operate on Q registers. Intrinsics without the ‘q’ suffix usually operate on D registers but some of these intrinsics might use Q registers.  
The examples below show different variants of the same intrinsic.

uint8x8_t vadd_u8(uint8x8_t a, uint8x8_t b);


The intrinsic vadd\_u8 does not have the ‘q’ suffix. In this case, the input and output vectors are 64-bit vectors, which use D registers.

uint8x16_t vaddq_u8(uint8x16_t a, uint8x16_t b);


The intrinsic vaddq\_u8 has the ‘q’ suffix, so the input and output vectors are 128-bit vectors, which use Q registers.

uint16x8_t vaddl_u8(uint8x8_t a, uint8x8_t b);


The intrinsic vaddl\_u8 does not have the ‘q’ suffix. In this case, the input vectors are 64-bit and output vector is 128-bit.

#include <arm_neon.h> uint32x4_t double_elements(uint32x4_t input){ return(vaddq_u32(input, input)); }


* * *

参考资料：  
[ARM NEON](https://developer.arm.com/technologies/neon)  
[A\_neon\_programmers\_guide\_en.pdf](https%3A%2F%2Fstatic.docs.arm.com%2Fden0018%2Fa%2FDEN0018A_neon_programmers_guide_en.pdf)