AVX指令集

这里简要介绍AVX指令集的一些基本指令,可以通过调用C++的库函数实现SIMD。

历史

并行编程

  • MME, 1996
  • SSE, 1999
  • AVX, 2008
  • AVX2, 2011

数据类型

Data Type Description
__m128 128-bit vector containing 4 floats
__m128d 128-bit vector containing 2 doubles
__m128i 128-bit vector containing integers
__m256 256-bit vector containing 8 floats
__m256d 256-bit vector containing 4 doubles
__m256i 256-bit vector containing integers
  • integers can be chars, shorts, ints, or longs

函数命名规范(naming conventions)

_mm<bit_width>_<name>_<data_type>

  • <bit_width>: the return size, 128 - empty, 256 - 256
  • <name>: describes the operation performed by the intrinsic
  • <data_type>: the function’s primary arguments
Instructions Description
ps packed single-precision
pd packed double-precision
epi8/epi16/epi32/epi64 signed integers
epu8/epu16/epu32/epu64 unsigned integers
si128/si256 unspecified vector
m128/m128i/m128d
m256/m256i/m256d
input vector types

举例:_mm256_srlv_epi64 64-bit signed int -> 256-bit vector

完整例子

#include <immintrin.h>
#include <stdio.h>

int main() {

    /* Initialize the two argument vectors */
    __m256 evens = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0);
    __m256 odds = _mm256_set_ps(1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0);

    /* Compute the difference between the two vectors */
    __m256 result = _mm256_sub_ps(evens, odds);

    /* Display the elements of the result vector */
    float* f = (float*) &result; // type conversion
    printf("%f %f %f %f %f %f %f %f\n",
      f[0], f[1], f[2], f[3], f[4], f[5], f[6], f[7]);

    return 0;
}

编译时加-mavx-mavx2

常见指令

初始化

  • _mm256_setzero_ps
  • _mm256_set1_ps
  • _mm256_set_ps: predefined values
  • _mm256_setr_ps: reversed order

访存

  • _mm256_load_ps
  • _mm256_maskload_ps(address, integer vector): mask 1 read, 0 setzero
  • aligned_alloc(32, 64 * sizeof(float)): 32-byte boundary

算术逻辑

  • _mm256_add/sub/mul_ps
  • _mm256_and/cmpeq_ps
  • _mm256_hadd/hsub_ps
  • _mm256_mullo_epi32 add mul mullo

融合乘积(Fuse Multiply and Add (FMA))

编译指令-mfma

  • _mm_fmadd_ps: res = a * b + c
  • _mm_fmadd_ss: res[0] = a[0] * b[0] + c[0]

重排

  • _mm256_permute_ps: based on 8-bit control value
  • _mm256_shuffle_ps: first 2, second 2 permute shuffle

参考资料