This blog will briefly talk about a recent “bug” I found in High-Level Synthesis (HLS), which emphasizes the challenge of writing efficient code using HLS.
Recently I’m working on a project of mapping neural networks onto FPGA, where the NNs have tens of layers. I encountered an issue – HLS tool always generates two different instances for the same function call without giving any warnings, even though I have already specified the resource limit. To figure out the reason, I write a simple example consisting of two kernels and see whether HLS behaves as expected.
This example cascades two GEMM kernels as follows. We pipeline the middle j
loop, so the innermost k
loop is unrolled. As a result, arr_A
and arr_B
should also be partitioned. This kernel does not have any problems and should be easily verified to have II=1.
void gemm(int arr_A[32][32],
int arr_B[32][32],
int arr_C[32][32]) {
#pragma HLS array_partition variable=arr_A complete dim=2
#pragma HLS array_partition variable=arr_B complete dim=1
l_i: for (int i = 0; i < 32; i++) {
l_j: for (int j = 0; j < 32; j++) {
#pragma HLS pipeline II=1
l_k: for (int k = 0; k < 32; k++) {
arr_C[i][j] += arr_A[i][k] * arr_B[k][j];
}
}
}
}
void top(
int A[32][32],
int B[32][32],
int D[32][32],
int E[32][32]
) {
#pragma HLS ALLOCATION instances=gemm limit=1 function
int C[32][32];
gemm(A, B, C);
gemm(C, D, E);
}
However, the weird thing happens – HLS generates two different instances for this GEMM kernel and one of them is significantly slower with II=2. The results are listed below (synthesized by Vivado HLS 20191).
Kernel | Latency (cycles) | Iteration Latency | II | DSP | FF | LUT |
---|---|---|---|---|---|---|
gemm | 2049 | 4 | 2 | 96 | 2443 | 2530 |
gemm_1 | 1030 | 6 | 1 | 96 | 3619 | 1968 |
So what makes the difference between these two function calls? Actually there is one implicit rule that is not written in our code – array C
is also partitioned. If we look at the second kernel, the input C as the first argument of the function is also completely partitioned along the 2nd dimension, which is to say, the three arrays in the first function call are all partitioned, while only C
and D
in the second call are partitioned. To test if unnecessary array partition harms the performance, I test the following kernel with explicit partitioning on arr_C
, and it exactly produces the II=2 result.
void gemm(int arr_A[32][32],
int arr_B[32][32],
int arr_C[32][32]) {
#pragma HLS array_partition variable=arr_A complete dim=2
#pragma HLS array_partition variable=arr_B complete dim=1
#pragma HLS array_partition variable=arr_C complete dim=2
l_i: for (int i = 0; i < 32; i++) {
l_j: for (int j = 0; j < 32; j++) {
#pragma HLS pipeline II=1
l_k: for (int k = 0; k < 32; k++) {
arr_C[i][j] += arr_A[i][k] * arr_B[k][j];
}
}
}
}
From the execution log, we can see HLS conservatively assumes there is a dependency on the accumulation statement (arr_C[i][j] += ...
), so it cannot schedule all the operations into one cycle.
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'gemm'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'l_i_l_j'.
WARNING: [SCHED 204-68] The II Violation in module 'gemm' (Loop: l_i_l_j): Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 1)
between 'store' operation ('arr_C_0_addr_write_ln25', limit.cpp:25) of variable 'add_ln25_31', limit.cpp:25 on array 'arr_C_0' and 'load' operation ('arr_C_0_load', limit.cpp:25) on array 'arr_C_0'.
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 2, Depth = 4.
WARNING: [SCHED 204-21] Estimated clock period (22.826ns) exceeds the target (target clock period: 10ns, clock uncertainty: 1.25ns, effective delay budget: 8.75ns).
WARNING: [SCHED 204-21] The critical path in module 'gemm' consists of the following:
'load' operation ('arr_C_0_load', limit.cpp:25) on array 'arr_C_0' [124] (3.25 ns)
'mux' operation ('tmp_1', limit.cpp:25) [187] (3.21 ns)
'add' operation ('add_ln25_46', limit.cpp:25) [358] (0 ns)
'add' operation ('add_ln25_48', limit.cpp:25) [360] (4.37 ns)
'add' operation ('add_ln25_52', limit.cpp:25) [364] (0 ns)
'add' operation ('add_ln25_61', limit.cpp:25) [373] (4.37 ns)
'add' operation ('add_ln25_31', limit.cpp:25) [374] (4.37 ns)
'store' operation ('arr_C_30_addr_write_ln25', limit.cpp:25) of variable 'add_ln25_31', limit.cpp:25 on array 'arr_C_30' [377] (3.25 ns)
INFO: [SCHED 204-11] Finished scheduling.
Now, we know why there is one instance having II of 2. The next step is to make those two function calls exactly the same and merge them together. We can (1) use the dependency pragma to eliminate the false dependency of arr_C
and then (2) explicitly partition the third argument of the function (making sure array E
is also partitioned). We get the following kernel code.
void gemm(int arr_A[32][32],
int arr_B[32][32],
int arr_C[32][32]) {
#pragma HLS array_partition variable=arr_A complete dim=2
#pragma HLS array_partition variable=arr_B complete dim=1
#pragma HLS array_partition variable=arr_C complete dim=2
l_i: for (int i = 0; i < 32; i++) {
l_j: for (int j = 0; j < 32; j++) {
#pragma HLS pipeline II=1
#pragma HLS dependence variable=arr_C inter false
l_k: for (int k = 0; k < 32; k++) {
arr_C[i][j] += arr_A[i][k] * arr_B[k][j];
}
}
}
}
Finally only one instance is created, and the instance can achieve II=1.
+ Latency:
* Summary:
+---------+---------+-----------+-----------+------+------+---------+
| Latency (cycles) | Latency (absolute) | Interval | Pipeline|
| min | max | min | max | min | max | Type |
+---------+---------+-----------+-----------+------+------+---------+
| 2063| 2063| 20.630 us | 20.630 us | 2063| 2063| none |
+---------+---------+-----------+-----------+------+------+---------+
+ Detail:
* Instance:
+-----------------+-------+---------+---------+-----------+-----------+------+------+---------+
| | | Latency (cycles) | Latency (absolute) | Interval | Pipeline|
| Instance | Module| min | max | min | max | min | max | Type |
+-----------------+-------+---------+---------+-----------+-----------+------+------+---------+
|grp_gemm_fu_402 |gemm | 1030| 1030| 10.300 us | 10.300 us | 1030| 1030| none |
+-----------------+-------+---------+---------+-----------+-----------+------+------+---------+
Back to the original code, we can find that the root cause of this issue is the misalignment of the partition types between the input and output arrays. If we view array partitioning as another type of data organization method, then this issue can be viewed as the data layout problem in the HLS scenario – the best data layout for the first kernel may not be the best one for the second kernel. Intuitively, we have two solutions here:
This layout problem not only exists in deep learning applications. Once we consider a design with multiple kernels, the data layout of the kernels matter a lot.
A similar issue can be found in the FFT kernel. The following code shows the typical implementation of the Cooley-Tukey FFT with SIZE=1024. The code is referenced from the PP4FPGA3 book. This piece of code is easy to achieve II=1 for each stage4, but it is not the most efficient one, and we can further increase the parallelism. Based on the FFT algorithm, we know all the elements in X
and OUT
are accessed exactly once in one stage, so we can safely unroll the dft_loop
with a factor (8 or 16). Therefore, we also need to cyclic partition the input and output arrays. However, the problem is that the access pattern is totally different among different stages in FFT, so no matter how we set the partition factor, we cannot guarantee all the stages can achieve II=1. I haven’t had a solution to it, but it can leave as an interesting problem for the future work.
#include "fft.h"
// omit the concrete implementation
// please refer to my another blog (in Chinese) for the full piece of code
// https://chhzh123.github.io/blogs/2021-06-10-ccc2021/
void bit_reverse(DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]);
void fft_stage_first(DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]);
void fft_stages(DTYPE X_R[SIZE], DTYPE X_I[SIZE], int STAGES, DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]);
void fft_stage_last(DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]);
void fft(DTYPE X_R[SIZE], DTYPE X_I[SIZE], DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE])
{
#pragma HLS dataflow
//Call fft
DTYPE Stage1_R[SIZE], Stage1_I[SIZE];
DTYPE Stage2_R[SIZE], Stage2_I[SIZE];
DTYPE Stage3_R[SIZE], Stage3_I[SIZE];
DTYPE Stage4_R[SIZE], Stage4_I[SIZE];
DTYPE Stage5_R[SIZE], Stage5_I[SIZE];
DTYPE Stage6_R[SIZE], Stage6_I[SIZE];
DTYPE Stage7_R[SIZE], Stage7_I[SIZE];
DTYPE Stage8_R[SIZE], Stage8_I[SIZE];
DTYPE Stage9_R[SIZE], Stage9_I[SIZE];
DTYPE Stage10_R[SIZE], Stage10_I[SIZE];
bit_reverse(X_R, X_I, Stage1_R, Stage1_I);
fft_stage_first(Stage1_R, Stage1_I, Stage2_R, Stage2_I);
fft_stages(Stage2_R, Stage2_I, 2, Stage3_R, Stage3_I);
fft_stages(Stage3_R, Stage3_I, 3, Stage4_R, Stage4_I);
fft_stages(Stage4_R, Stage4_I, 4, Stage5_R, Stage5_I);
fft_stages(Stage5_R, Stage5_I, 5, Stage6_R, Stage6_I);
fft_stages(Stage6_R, Stage6_I, 6, Stage7_R, Stage7_I);
fft_stages(Stage7_R, Stage7_I, 7, Stage8_R, Stage8_I);
fft_stages(Stage8_R, Stage8_I, 8, Stage9_R, Stage9_I);
fft_stages(Stage9_R, Stage9_I, 9, Stage10_R, Stage10_I);
fft_stage_last(Stage10_R, Stage10_I, OUT_R, OUT_I);
}
void fft_stages(DTYPE X_R[SIZE], DTYPE X_I[SIZE], int stage, DTYPE OUT_R[SIZE], DTYPE OUT_I[SIZE]) {
int DFTpts = 1 << stage; // DFT = 2^stage = points in sub DFT
int numBF = DFTpts / 2; // Butterfly WIDTHS in sub-DFT
int step = SIZE >> 2;
// Perform butterflies for j-th stage
butterfly_loop:
for (int j = 0; j < numBF; j++) {
DTYPE c = W_real[j<<(10-stage)];
DTYPE s = W_imag[j<<(10-stage)];
dft_loop:
for (int i = j; i < SIZE; i += DFTpts) {
// * we can insert unrolling here
#pragma HLS pipeline
int i_lower = i + numBF; // index of lower point in butterfly
DTYPE temp_R = X_R[i_lower] * c - X_I[i_lower] * s;
DTYPE temp_I = X_I[i_lower] * c + X_R[i_lower] * s;
OUT_R[i_lower] = X_R[i] - temp_R;
OUT_I[i_lower] = X_I[i] - temp_I;
OUT_R[i] = X_R[i] + temp_R;
OUT_I[i] = X_I[i] + temp_I;
}
}
}
From these two examples, we can see writing efficient HLS code is not trivial and always requires the engineer to have a good understanding of the HLS tool. Recently we also see a trend of increasing complex applications that have lots of kernels, which places more challenges on HLS programming. Previous works on HLS optimization mostly focus on operator-level optimization, but now we need to think more about this inter-kernel interaction or the datapath across different kernels, which is an essential part of graph-level optimization and can give us more opportunities to optimize the accelerator design.
II (1st kernel/2nd) | w/o alloc pragma | w/ alloc pragma |
---|---|---|
Vivado HLS 2019 | 2/1 | 2/1 |
Vitis HLS 2022 | 1/1 | 2/1 |
Using different versions of Vivado may lead to different results. It seems the Vitis HLS is not stable enough, so we need to avoid using it at this time. ↩
Some kerenels are faster using the NCHW layout, while the others are faster using NHWC, so there will be some layout conversion between the CUDA kernel calls. ↩
https://github.com/sazczmh/PP4FPGAS_Study_Notes_S1C05_HLS_FFT/blob/master/fft.cpp ↩
Bit reverse function cannot achieve II=1 due to its complex memory access pattern, which cannot be easily modeled by current HLS tools that only have complete/block/cyclic partition. ↩