Challenge of Data Layout in High-Level Synthesis

This blog will briefly talk about a recent “bug” I found in High-Level Synthesis (HLS), which emphasizes the challenge of writing efficient code using HLS.

Recently I’m working on a project of mapping neural networks onto FPGA, where the NNs have tens of layers. I encountered an issue – HLS tool always generates two different instances for the same function call without giving any warnings, even though I have already specified the resource limit. To figure out the reason, I write a simple example consisting of two kernels and see whether HLS behaves as expected.

This example cascades two GEMM kernels as follows. We pipeline the middle j loop, so the innermost k loop is unrolled. As a result, arr_A and arr_B should also be partitioned. This kernel does not have any problems and should be easily verified to have II=1.

void gemm(int arr_A[32][32],
          int arr_B[32][32],
          int arr_C[32][32]) {
  #pragma HLS array_partition variable=arr_A complete dim=2
  #pragma HLS array_partition variable=arr_B complete dim=1
  l_i: for (int i = 0; i < 32; i++) {
    l_j: for (int j = 0; j < 32; j++) {
      #pragma HLS pipeline II=1
      l_k: for (int k = 0; k < 32; k++) {
        arr_C[i][j] += arr_A[i][k] * arr_B[k][j];
      }
    }
  }
}

void top(
  int A[32][32],
  int B[32][32],
  int D[32][32],
  int E[32][32]
) {
  #pragma HLS ALLOCATION instances=gemm limit=1 function
  int C[32][32];
  gemm(A, B, C);
  gemm(C, D, E);
}

However, the weird thing happens – HLS generates two different instances for this GEMM kernel and one of them is significantly slower with II=2. The results are listed below (synthesized by Vivado HLS 20191).

Kernel Latency (cycles) Iteration Latency II DSP FF LUT
gemm 2049 4 2 96 2443 2530
gemm_1 1030 6 1 96 3619 1968

So what makes the difference between these two function calls? Actually there is one implicit rule that is not written in our code – array C is also partitioned. If we look at the second kernel, the input C as the first argument of the function is also completely partitioned along the 2nd dimension, which is to say, the three arrays in the first function call are all partitioned, while only C and D in the second call are partitioned. To test if unnecessary array partition harms the performance, I test the following kernel with explicit partitioning on arr_C, and it exactly produces the II=2 result.

void gemm(int arr_A[32][32],
          int arr_B[32][32],
          int arr_C[32][32]) {
  #pragma HLS array_partition variable=arr_A complete dim=2
  #pragma HLS array_partition variable=arr_B complete dim=1
  #pragma HLS array_partition variable=arr_C complete dim=2
  l_i: for (int i = 0; i < 32; i++) {
    l_j: for (int j = 0; j < 32; j++) {
      #pragma HLS pipeline II=1
      l_k: for (int k = 0; k < 32; k++) {
        arr_C[i][j] += arr_A[i][k] * arr_B[k][j];
      }
    }
  }
}

From the execution log, we can see HLS conservatively assumes there is a dependency on the accumulation statement (arr_C[i][j] += ...), so it cannot schedule all the operations into one cycle.

INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'gemm' 
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'l_i_l_j'.
WARNING: [SCHED 204-68] The II Violation in module 'gemm' (Loop: l_i_l_j): Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 1)
   between 'store' operation ('arr_C_0_addr_write_ln25', limit.cpp:25) of variable 'add_ln25_31', limit.cpp:25 on array 'arr_C_0' and 'load' operation ('arr_C_0_load', limit.cpp:25) on array 'arr_C_0'.
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 2, Depth = 4.
WARNING: [SCHED 204-21] Estimated clock period (22.826ns) exceeds the target (target clock period: 10ns, clock uncertainty: 1.25ns, effective delay budget: 8.75ns).
WARNING: [SCHED 204-21] The critical path in module 'gemm' consists of the following:
	'load' operation ('arr_C_0_load', limit.cpp:25) on array 'arr_C_0' [124]  (3.25 ns)
	'mux' operation ('tmp_1', limit.cpp:25) [187]  (3.21 ns)
	'add' operation ('add_ln25_46', limit.cpp:25) [358]  (0 ns)
	'add' operation ('add_ln25_48', limit.cpp:25) [360]  (4.37 ns)
	'add' operation ('add_ln25_52', limit.cpp:25) [364]  (0 ns)
	'add' operation ('add_ln25_61', limit.cpp:25) [373]  (4.37 ns)
	'add' operation ('add_ln25_31', limit.cpp:25) [374]  (4.37 ns)
	'store' operation ('arr_C_30_addr_write_ln25', limit.cpp:25) of variable 'add_ln25_31', limit.cpp:25 on array 'arr_C_30' [377]  (3.25 ns)
INFO: [SCHED 204-11] Finished scheduling.

Now, we know why there is one instance having II of 2. The next step is to make those two function calls exactly the same and merge them together. We can (1) use the dependency pragma to eliminate the false dependency of arr_C and then (2) explicitly partition the third argument of the function (making sure array E is also partitioned). We get the following kernel code.

void gemm(int arr_A[32][32],
          int arr_B[32][32],
          int arr_C[32][32]) {
  #pragma HLS array_partition variable=arr_A complete dim=2
  #pragma HLS array_partition variable=arr_B complete dim=1
  #pragma HLS array_partition variable=arr_C complete dim=2
  l_i: for (int i = 0; i < 32; i++) {
    l_j: for (int j = 0; j < 32; j++) {
      #pragma HLS pipeline II=1
      #pragma HLS dependence variable=arr_C inter false
      l_k: for (int k = 0; k < 32; k++) {
        arr_C[i][j] += arr_A[i][k] * arr_B[k][j];
      }
    }
  }
}

Finally only one instance is created, and the instance can achieve II=1.

+ Latency: 
    * Summary: 
    +---------+---------+-----------+-----------+------+------+---------+
    |  Latency (cycles) |   Latency (absolute)  |   Interval  | Pipeline|
    |   min   |   max   |    min    |    max    |  min |  max |   Type  |
    +---------+---------+-----------+-----------+------+------+---------+
    |     2063|     2063| 20.630 us | 20.630 us |  2063|  2063|   none  |
    +---------+---------+-----------+-----------+------+------+---------+

    + Detail: 
        * Instance: 
        +-----------------+-------+---------+---------+-----------+-----------+------+------+---------+
        |                 |       |  Latency (cycles) |   Latency (absolute)  |   Interval  | Pipeline|
        |     Instance    | Module|   min   |   max   |    min    |    max    |  min |  max |   Type  |
        +-----------------+-------+---------+---------+-----------+-----------+------+------+---------+
        |grp_gemm_fu_402  |gemm   |     1030|     1030| 10.300 us | 10.300 us |  1030|  1030|   none  |
        +-----------------+-------+---------+---------+-----------+-----------+------+------+---------+

Back to the original code, we can find that the root cause of this issue is the misalignment of the partition types between the input and output arrays. If we view array partitioning as another type of data organization method, then this issue can be viewed as the data layout problem in the HLS scenario – the best data layout for the first kernel may not be the best one for the second kernel. Intuitively, we have two solutions here:

  1. Align the layout of the inputs and outputs, and make sure for all the function calls, the types of the arguments are the same. Even though this may cause some kernels cannot have the best performance, it’s enough if the overall performance is much better. We actually leverage this method in our GEMM example.
  2. Add a layout transformation function. This is a common practice in deep learning frameworks2. Although it may bring extra overheads, similar to the 1st point, if the overall performance is better, then this conversion is acceptable.

This layout problem not only exists in deep learning applications. Once we consider a design with multiple kernels, the data layout of the kernels matter a lot.

A similar issue can be found in the FFT kernel. The following code shows the typical implementation of the Cooley-Tukey FFT with SIZE=1024. The code is referenced from the PP4FPGA3 book. This piece of code is easy to achieve II=1 for each stage4, but it is not the most efficient one, and we can further increase the parallelism. Based on the FFT algorithm, we know all the elements in X and OUT are accessed exactly once in one stage, so we can safely unroll the dft_loop with a factor (8 or 16). Therefore, we also need to cyclic partition the input and output arrays. However, the problem is that the access pattern is totally different among different stages in FFT, so no matter how we set the partition factor, we cannot guarantee all the stages can achieve II=1. I haven’t had a solution to it, but it can leave as an interesting problem for the future work.

#include "fft.h"

// omit the concrete implementation
// please refer to my another blog (in Chinese) for the full piece of code
// https://chhzh123.github.io/blogs/2021-06-10-ccc2021/
void bit_reverse(DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]);
void fft_stage_first(DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]);
void fft_stages(DTYPE X_R[SIZE], DTYPE X_I[SIZE], int STAGES, DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]);
void fft_stage_last(DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]);

void fft(DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE])
{
#pragma HLS dataflow

  //Call fft
  DTYPE Stage1_R[SIZE], Stage1_I[SIZE];
  DTYPE Stage2_R[SIZE], Stage2_I[SIZE];
  DTYPE Stage3_R[SIZE], Stage3_I[SIZE];
  DTYPE Stage4_R[SIZE], Stage4_I[SIZE];
  DTYPE Stage5_R[SIZE], Stage5_I[SIZE];
  DTYPE Stage6_R[SIZE], Stage6_I[SIZE];
  DTYPE Stage7_R[SIZE], Stage7_I[SIZE];
  DTYPE Stage8_R[SIZE], Stage8_I[SIZE];
  DTYPE Stage9_R[SIZE], Stage9_I[SIZE];
  DTYPE Stage10_R[SIZE], Stage10_I[SIZE];

  bit_reverse(X_R, X_I, Stage1_R, Stage1_I);
  fft_stage_first(Stage1_R, Stage1_I, Stage2_R, Stage2_I);
  fft_stages(Stage2_R, Stage2_I, 2, Stage3_R, Stage3_I);
  fft_stages(Stage3_R, Stage3_I, 3, Stage4_R, Stage4_I);
  fft_stages(Stage4_R, Stage4_I, 4, Stage5_R, Stage5_I);
  fft_stages(Stage5_R, Stage5_I, 5, Stage6_R, Stage6_I);
  fft_stages(Stage6_R, Stage6_I, 6, Stage7_R, Stage7_I);
  fft_stages(Stage7_R, Stage7_I, 7, Stage8_R, Stage8_I);
  fft_stages(Stage8_R, Stage8_I, 8, Stage9_R, Stage9_I);
  fft_stages(Stage9_R, Stage9_I, 9, Stage10_R, Stage10_I);
  fft_stage_last(Stage10_R, Stage10_I, OUT_R, OUT_I);
}

void fft_stages(DTYPE X_R[SIZE], DTYPE X_I[SIZE], int stage, DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]) {
  int DFTpts = 1 << stage;    // DFT = 2^stage = points in sub DFT
  int numBF = DFTpts / 2;     // Butterfly WIDTHS in sub-DFT
  int step = SIZE >> 2;
  // Perform butterflies for j-th stage
  butterfly_loop:
  for (int j = 0; j < numBF; j++) {
    DTYPE c = W_real[j<<(10-stage)];
    DTYPE s = W_imag[j<<(10-stage)];
    dft_loop:
    for (int i = j; i < SIZE; i += DFTpts) {
    // * we can insert unrolling here
    #pragma HLS pipeline
      int i_lower = i + numBF; // index of lower point in butterfly
      DTYPE temp_R = X_R[i_lower] * c - X_I[i_lower] * s;
      DTYPE temp_I = X_I[i_lower] * c + X_R[i_lower] * s;
      OUT_R[i_lower] = X_R[i] - temp_R;
      OUT_I[i_lower] = X_I[i] - temp_I;
      OUT_R[i] = X_R[i] + temp_R;
      OUT_I[i] = X_I[i] + temp_I;
    }
  }
}

From these two examples, we can see writing efficient HLS code is not trivial and always requires the engineer to have a good understanding of the HLS tool. Recently we also see a trend of increasing complex applications that have lots of kernels, which places more challenges on HLS programming. Previous works on HLS optimization mostly focus on operator-level optimization, but now we need to think more about this inter-kernel interaction or the datapath across different kernels, which is an essential part of graph-level optimization and can give us more opportunities to optimize the accelerator design.


II (1st kernel/2nd) w/o alloc pragma w/ alloc pragma
Vivado HLS 2019 2/1 2/1
Vitis HLS 2022 1/1 2/1
  1. Using different versions of Vivado may lead to different results. It seems the Vitis HLS is not stable enough, so we need to avoid using it at this time. 

  2. Some kerenels are faster using the NCHW layout, while the others are faster using NHWC, so there will be some layout conversion between the CUDA kernel calls. 

  3. https://github.com/sazczmh/PP4FPGAS_Study_Notes_S1C05_HLS_FFT/blob/master/fft.cpp 

  4. Bit reverse function cannot achieve II=1 due to its complex memory access pattern, which cannot be easily modeled by current HLS tools that only have complete/block/cyclic partition.