本文将详细介绍Vivado HLS的配置、入门及优化方法,包括各类pragma及库函数,同时给出大量参考资料供查阅。
C-HLS可以简单理解为C/C++语言的扩展,即提供了一些硬件编译指示,从而使得高层的规范(specification)可以被映射到RTL层级的电路描述。
C/C++中的设施与硬件设施有如下对应。
C/C++ | 硬件 |
---|---|
函数 | 模块(module) |
参数 | 输入/输出端口(port) |
算子 | 函数单元 |
标量 | 线(wire)或寄存器 |
数组 | 内存(memory) |
控制流 | 控制逻辑 |
通常情况下RTL代码/硬件模块层次与原始C/C++代码层次一致。
下面以矩阵乘法为例(摘自Zynq Book Tutorials Exercise 3),需要写下列4个程序。
matrix_mult.h
:头文件,包括基本宏定义、类型定义及函数原型
#ifndef __MATRIXMUL_H__
#define __MATRIXMUL_H__
#include <cmath>
using namespace std;
// Compare TB vs HW C-model and/or RTL
#define HW_COSIM
#define IN_A_ROWS 5
#define IN_A_COLS 5
#define IN_B_ROWS 5
#define IN_B_COLS 5
typedef char mat_a;
typedef char mat_b;
typedef short mat_prod;
// Prototype of top level function for C-synthesis
void matrix_mult(
mat_a a[IN_A_ROWS][IN_A_COLS],
mat_b b[IN_B_ROWS][IN_B_COLS],
mat_prod prod[IN_A_ROWS][IN_B_COLS]);
#endif // __MATRIXMUL_H__ not defined
matrix_mult.cpp
:核心函数实现
#include "matrix_mult.h"
void matrix_mult(
mat_a a[IN_A_ROWS][IN_A_COLS],
mat_b b[IN_B_ROWS][IN_B_COLS],
mat_prod prod[IN_A_ROWS][IN_B_COLS])
{
// Iterate over the rows of the A matrix
Row:
for (int i = 0; i < IN_A_ROWS; i++)
{
// Iterate over the columns of the B matrix
Col:
for (int j = 0; j < IN_B_COLS; j++)
{
prod[i][j] = 0;
// Do the inner product of a row of A and col of B
Product:
for (int k = 0; k < IN_B_ROWS; k++)
{
prod[i][j] += a[i][k] * b[k][j];
}
}
}
}
matrix_mult_test.cpp
:测试代码,用于软硬件协同模拟
#include <iostream>
#include "matrix_mult.h"
using namespace std;
int main(int argc, char **argv)
{
mat_a in_mat_a[5][5] = {
{0, 0, 0, 0, 1},
{0, 0, 0, 1, 0},
{0, 0, 1, 0, 0},
{0, 1, 0, 0, 0},
{1, 0, 0, 0, 0}};
mat_b in_mat_b[5][5] = {
{1, 1, 1, 1, 1},
{0, 1, 1, 1, 1},
{0, 0, 1, 1, 1},
{0, 0, 0, 1, 1},
{0, 0, 0, 0, 1}};
mat_prod hw_result[5][5], sw_result[5][5];
int error_count = 0;
// Generate the expected result
// Iterate over the rows of the A matrix
for (int i = 0; i < IN_A_ROWS; i++)
{
for (int j = 0; j < IN_B_COLS; j++)
{
// Iterate over the columns of the B matrix
sw_result[i][j] = 0;
// Do the inner product of a row of A and col of B
for (int k = 0; k < IN_B_ROWS; k++)
{
sw_result[i][j] += in_mat_a[i][k] * in_mat_b[k][j];
}
}
}
#ifdef HW_COSIM
// Run the Vivado HLS matrix multiplier
matrix_mult(in_mat_a, in_mat_b, hw_result);
#endif
// Print product matrix
for (int i = 0; i < IN_A_ROWS; i++)
{
for (int j = 0; j < IN_B_COLS; j++)
{
#ifdef HW_COSIM
// Check result of HLS vs. expected
if (hw_result[i][j] != sw_result[i][j])
{
error_count++;
}
#else
cout << sw_result[i][j];
#endif
}
}
#ifdef HW_COSIM
if (error_count)
cout << "TEST FAIL: " << error_count << "Results do not match!" << endl;
else
cout << "Test passed!" << endl;
#endif
return error_count;
}
run_hls.tcl
:自动化编译运行代码
# run.tcl
# open the HLS project mm.prj
set src_dir "."
open_project -reset matrix_mult_prj
# set the top-level function of the design
set_top mmult_hw
# add design and testbench files
add_files $src_dir/matrix_mult.h
add_files $src_dir/matrix_mult.cpp
add_files -tb $src_dir/matrix_mult_test.cpp
open_solution "solution"
# use Zynq device
set_part {xc7z020clg484-1}
# target clock period is 10 ns
create_clock -period 10 -name default
# do a c simulation
csim_design -clean
# synthesize the design
csynth_design
# do a co-simulation
#cosim_design
# close project and quit
close_project
exit
通过vivado_hls -f run_hls.tcl
调用。
命令行运行的结果如下。
****** Vivado(TM) HLS - High-Level Synthesis from C, C++ and SystemC v2018.1 (64-bit)
**** SW Build 2188600 on Wed Apr 4 18:40:38 MDT 2018
**** IP Build 2185939 on Wed Apr 4 20:55:05 MDT 2018
** Copyright 1986-2018 Xilinx, Inc. All Rights Reserved.
#######
# set up projects
#######
# c simulation
INFO: [SIM 211-2] *************** CSIM start ***************
INFO: [SIM 211-4] CSIM will launch GCC as the compiler.
Compiling ../../../../matrix_mult_test.cpp in debug mode
Compiling ../../../../matrix_mult.cpp in debug mode
Generating csim.exe
Test passed!
INFO: [SIM 211-1] CSim done with 0 errors.
INFO: [SIM 211-3] *************** CSIM finish ***************
# synthesis
INFO: [HLS 200-10] Analyzing design file './matrix_mult.cpp' ...
INFO: [HLS 200-10] Validating synthesis directives ...
INFO: [HLS 200-111] Finished Checking Pragmas Time (s): cpu = 00:00:01 ; elapsed = 00:00:13 . Memory (MB): peak = 101.551 ; gain = 44.590
INFO: [HLS 200-111] Finished Linking Time (s): cpu = 00:00:01 ; elapsed = 00:00:14 . Memory (MB): peak = 101.563 ; gain = 44.602
INFO: [HLS 200-10] Starting code transformations ...
INFO: [HLS 200-111] Finished Standard Transforms Time (s): cpu = 00:00:02 ; elapsed = 00:00:15 . Memory (MB): peak = 102.961 ; gain = 46.000
INFO: [HLS 200-10] Checking synthesizability ...
INFO: [HLS 200-111] Finished Checking Synthesizability Time (s): cpu = 00:00:02 ; elapsed = 00:00:15 . Memory (MB): peak = 103.191 ; gain = 46.230
INFO: [HLS 200-111] Finished Pre-synthesis Time (s): cpu = 00:00:02 ; elapsed = 00:00:16 . Memory (MB): peak = 124.961 ; gain = 68.000
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:02 ; elapsed = 00:00:17 . Memory (MB): peak = 124.961 ; gain = 68.000
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'matrix_mult' ...
INFO: [HLS 200-10]
----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'matrix_mult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-11] Finished scheduling.
INFO: [HLS 200-111] Elapsed time: 17.173 seconds; current allocated memory: 75.025 MB.
INFO: [BIND 205-100] Starting micro-architecture generation ...
INFO: [BIND 205-101] Performing variable lifetime analysis.
INFO: [BIND 205-101] Exploring resource sharing.
INFO: [BIND 205-101] Binding ...
INFO: [BIND 205-100] Finished micro-architecture generation.
INFO: [HLS 200-111] Elapsed time: 0.281 seconds; current allocated memory: 75.202 MB.
INFO: [HLS 200-10]
----------------------------------------------------------------
INFO: [HLS 200-10] -- Generating RTL for module 'matrix_mult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [RTGEN 206-500] Setting interface mode on port 'matrix_mult/a' to 'ap_memory'.
INFO: [RTGEN 206-500] Setting interface mode on port 'matrix_mult/b' to 'ap_memory'.
INFO: [RTGEN 206-500] Setting interface mode on port 'matrix_mult/prod' to 'ap_memory'.
INFO: [RTGEN 206-500] Setting interface mode on function 'matrix_mult' to 'ap_ctrl_hs'.
INFO: [SYN 201-210] Renamed object name 'matrix_mult_mac_muladd_8s_8s_16ns_16_1_1' to 'matrix_mult_mac_mbkb' due to the length limit 20
INFO: [RTGEN 206-100] Generating core module 'matrix_mult_mac_mbkb': 1 instance(s).
INFO: [RTGEN 206-100] Finished creating RTL model for 'matrix_mult'.
INFO: [HLS 200-111] Elapsed time: 0.229 seconds; current allocated memory: 75.575 MB.
INFO: [HLS 200-111] Finished generating all RTL models Time (s): cpu = 00:00:03 ; elapsed = 00:00:19 . Memory (MB): peak = 124.961 ; gain = 68.000
INFO: [SYSC 207-301] Generating SystemC RTL for matrix_mult.
INFO: [VHDL 208-304] Generating VHDL RTL for matrix_mult.
INFO: [VLOG 209-307] Generating Verilog RTL for matrix_mult.
INFO: [HLS 200-112] Total elapsed time: 18.761 seconds; peak allocated memory: 75.575 MB.
INFO: [Common 17-206] Exiting vivado_hls at Thu Mar 12 16:38:28 2020...
可以得到下面的结果(见生成的matrix_mult_prj\solution\syn\report\matrix_mult_csynth.rpt
文件)
================================================================
== Performance Estimates
================================================================
+ Timing (ns):
* Summary:
+--------+-------+----------+------------+
| Clock | Target| Estimated| Uncertainty|
+--------+-------+----------+------------+
|ap_clk | 10.00| 8.70| 1.25|
+--------+-------+----------+------------+
+ Latency (clock cycles):
* Summary:
+-----+-----+-----+-----+---------+
| Latency | Interval | Pipeline|
| min | max | min | max | Type |
+-----+-----+-----+-----+---------+
| 311| 311| 311| 311| none |
+-----+-----+-----+-----+---------+
+ Detail:
* Instance:
N/A
* Loop:
+--------------+-----+-----+----------+-----------+-----------+------+----------+
| | Latency | Iteration| Initiation Interval | Trip | |
| Loop Name | min | max | Latency | achieved | target | Count| Pipelined|
+--------------+-----+-----+----------+-----------+-----------+------+----------+
|- Row | 310| 310| 62| -| -| 5| no |
| + Col | 60| 60| 12| -| -| 5| no |
| ++ Product | 10| 10| 2| -| -| 5| no |
+--------------+-----+-----+----------+-----------+-----------+------+----------+
通过流水线方式,降低初始间隔(initial interval, II),提升并行度,提升吞吐率。
void matrix_mult(
mat_a a[IN_A_ROWS][IN_A_COLS],
mat_b b[IN_B_ROWS][IN_B_COLS],
mat_prod prod[IN_A_ROWS][IN_B_COLS])
{
// Iterate over the rows of the A matrix
Row:
for (int i = 0; i < IN_A_ROWS; i++)
{
// Iterate over the columns of the B matrix
Col:
for (int j = 0; j < IN_B_COLS; j++)
{
#pragma HLS PIPELINE II=1
prod[i][j] = 0;
// Do the inner product of a row of A and col of B
Product:
for (int k = 0; k < IN_B_ROWS; k++)
{
prod[i][j] += a[i][k] * b[k][j];
}
}
}
}
重新编译运行可以得到
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'matrix_mult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'Row_Col'.
WARNING: [SCHED 204-69] Unable to schedule 'load' operation ('a_load_1', ./matrix_mult.cpp:22) on array 'a' due to limited memory ports. Please consider using a memory core with more ports or partitioning the array 'a'.
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 3, Depth = 5.
WARNING: [SCHED 204-21] Estimated clock period (10.779ns) exceeds the target (target clock period: 10ns, clock uncertainty: 1.25ns, effective delay budget: 8.75ns).
WARNING: [SCHED 204-21] The critical path consists of the following:
'mul' operation ('tmp_7_2', ./matrix_mult.cpp:22) (3.36 ns)
'add' operation ('tmp2', ./matrix_mult.cpp:22) (3.02 ns)
'add' operation ('tmp_8_4', ./matrix_mult.cpp:22) (2.08 ns)
'store' operation (./matrix_mult.cpp:22) of variable 'tmp_8_4', ./matrix_mult.cpp:22 on array 'prod' (2.32 ns)
INFO: [SCHED 204-11] Finished scheduling.
从下面的性能分析报告中可以看到Row
和Col
被合并了,latency大大减少,提升了近4倍!(事实上在更大的数据集下,单一的流水线即可提升10+倍)
================================================================
== Performance Estimates
================================================================
+ Timing (ns):
* Summary:
+--------+-------+----------+------------+
| Clock | Target| Estimated| Uncertainty|
+--------+-------+----------+------------+
|ap_clk | 10.00| 10.78| 1.25|
+--------+-------+----------+------------+
+ Latency (clock cycles):
* Summary:
+-----+-----+-----+-----+---------+
| Latency | Interval | Pipeline|
| min | max | min | max | Type |
+-----+-----+-----+-----+---------+
| 78| 78| 78| 78| none |
+-----+-----+-----+-----+---------+
+ Detail:
* Instance:
N/A
* Loop:
+-----------+-----+-----+----------+-----------+-----------+------+----------+
| | Latency | Iteration| Initiation Interval | Trip | |
| Loop Name | min | max | Latency | achieved | target | Count| Pipelined|
+-----------+-----+-----+----------+-----------+-----------+------+----------+
|- Row_Col | 76| 76| 5| 3| 1| 25| yes |
+-----------+-----+-----+----------+-----------+-----------+------+----------+
需要完成循环所需总的时钟周期数为
\[N_{loop}=(J\times N_{body})+N_{control}\]注意到在上面scheduling的报告中,提到虽然我们的目标II是1,但是最好只能做到3,因为内存端口限制了。因此要提升性能,需要将数组进行划分,以提升IO效率。
void matrix_mult(
mat_a a[IN_A_ROWS][IN_A_COLS],
mat_b b[IN_B_ROWS][IN_B_COLS],
mat_prod prod[IN_A_ROWS][IN_B_COLS])
{
#pragma HLS ARRAY_RESHAPE variable=a complete dim=2
#pragma HLS ARRAY_RESHAPE variable=b complete dim=1
// Iterate over the rows of the A matrix
Row:
for (int i = 0; i < IN_A_ROWS; i++)
{
// Iterate over the columns of the B matrix
Col:
for (int j = 0; j < IN_B_COLS; j++)
{
prod[i][j] = 0;
// Do the inner product of a row of A and col of B
Product:
for (int k = 0; k < IN_B_ROWS; k++)
{
prod[i][j] += a[i][k] * b[k][j];
}
}
}
}
最后可得到结果报告如下,latency降到了29,也即比原始最naive的矩阵乘法已经提升了10倍!而我们只需要在原始C++代码中插入3行即可。
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'matrix_mult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'Row_Col'.
INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 4.
WARNING: [SCHED 204-21] Estimated clock period (11.477ns) exceeds the target (target clock period: 10ns, clock uncertainty: 1.25ns, effective delay budget: 8.75ns).
WARNING: [SCHED 204-21] The critical path consists of the following:
'mul' operation ('tmp_7_4', ./matrix_mult.cpp:25) (3.36 ns)
'add' operation ('tmp3', ./matrix_mult.cpp:25) (3.02 ns)
'add' operation ('tmp2', ./matrix_mult.cpp:25) (3.02 ns)
'add' operation ('tmp_8_4', ./matrix_mult.cpp:25) (2.08 ns)
INFO: [SCHED 204-11] Finished scheduling.
================================================================
== Performance Estimates
================================================================
+ Timing (ns):
* Summary:
+--------+-------+----------+------------+
| Clock | Target| Estimated| Uncertainty|
+--------+-------+----------+------------+
|ap_clk | 10.00| 11.48| 1.25|
+--------+-------+----------+------------+
+ Latency (clock cycles):
* Summary:
+-----+-----+-----+-----+---------+
| Latency | Interval | Pipeline|
| min | max | min | max | Type |
+-----+-----+-----+-----+---------+
| 29| 29| 29| 29| none |
+-----+-----+-----+-----+---------+
+ Detail:
* Instance:
N/A
* Loop:
+-----------+-----+-----+----------+-----------+-----------+------+----------+
| | Latency | Iteration| Initiation Interval | Trip | |
| Loop Name | min | max | Latency | achieved | target | Count| Pipelined|
+-----------+-----+-----+----------+-----------+-----------+------+----------+
|- Row_Col | 27| 27| 4| 1| 1| 25| yes |
+-----------+-----+-----+----------+-----------+-----------+------+----------+
当然在报告中还有更加详细的内存、资源(LUT、FF、BRAM、DSP)占用信息,这里就没有再贴出来。但需要注意HLS对于资源的估计相当不精确,与后端综合后的结果相比可能有非常大的差异。
#pragma HLS pipeline II=<int>
#pragma HLS unroll factor=<N>
#pragma HLS array_partition variable=<variable> <block, cyclic, complete> factor=<int> dim=<int>
#pragma HLS array_reshape variable=<variable> <block, cyclic, complete> factor=<int> dim=<int>
array_partition
会将数组元素完全拆开,而array_reshape
则会将这些拆开的元素以新的方式组合起来,从而减少内存消耗#pragma HLS dataflow
#pragma HLS loop_tripcount min=<int> max=<int> avg=<int>
?
形式显示详情见UG P331
dout_t loop_pipeline(din_t A[N]) {
int i, j;
static dout_t acc;
LOOP_I: for(i = 0; i < 20; i++){
LOOP_J: for(j = 0; j < 20; j++){
acc += A[i] * j;
}
}
return acc;
}
pipeline
,那么全部代码串行执行
pipeline
,则LOOP_J
在硬件上只有1份拷贝(单一的乘法器)
pipeline
,则LOOP_J
会被unroll
产生20份拷贝,会有20个乘法器和20个数组访问需要被调度
pipeline
,则一共产生数千个LUT和寄存器
类似于下面的代码看似存在依赖,实际上是可以并行的。
for (int i = 0; i < 4; ++i)
acc += a[i] * b[i];
实际上+=
是可以最后做的,因此展开后可变成
acc += a[0]*b[0] + a[1]*b[1] + a[2]*b[2] + a[3]*b[3];
后面的四个数据不存在依赖,均可以并行读入。
Vivado HLS提供了hls::stream<>
的模板类(引入头文件<hls_stream.h>
),表现为无限长度的FIFO(无需定义大小,在硬件实现上是深度为1),数据只能从队列中读出来一次,顶层接口用ap_fifo
实现。
做C++函数传递时,只能通过传引用方式传递,如&my_stream
。
如果hls::stream
用于任务之间的数据传递,那么需要考虑将这些任务实现在一个DATAFLOW
区域内。
如果在非数据流区域,则任务会被一个一个串行完成,也就是说FIFO应该足够大去保存其中间结果,否则会报错(注意如果没有采用显式function,那经过loop unrolling,Vivado HLS依然会辨别不出是否为数据流区域,因此最好还是用函数声明)。
ERROR: [XFORM 203-733] An internal stream xxxx.xxxx.V.user.V' with default size is
used in a non-dataflow region, which may result in deadlock. Please consider to
resize the stream using the directive 'set_directive_stream' or the 'HLS stream'
pragma.
默认hls::stream
的读写都是阻塞的(blocked),也就是当FIFO空时想读,FIFO满时想写会阻塞;如果是非阻塞读写则会返回真值示意是否成功。
// Usage of void write(const T & wdata)
hls::stream<int> my_stream;
int src_var = 42;
my_stream.write(src_var);
// my_stream << src_var;
// Usage of void read(T &rdata)
hls::stream<int> my_stream;
int dst_var;
my_stream.read(dst_var);
// int dst_var = my_stream.read();
// my_stream >> dst_var;
// non-blocking read/write
my_stream.read_nb
my_stream.write_nb
// full/empty
my_stream.full()
my_stream.empty()
下列编译指示可以用于定制化流数据的类型
#pragma HLS stream variable=<variable> depth=<int> dim=<int>
注意需要保证FIFO的读写次数一致,使用hls::stream
可以在csim时就发现问题所在,而采用传统的数组则没有办法发现读写次数不一致的问题,这将导致后端cosim死循环,硬件执行deadlock等。
INTERFACE
,则需要保证位宽是8的倍数
void test(ap_uint<1> A[10][10])
#pragma HLS INTERFACE m_axi port=A offset=slave bundle=gmem0
#pragma HLS INTERFACE s_axilite port=A bundle=control
ERROR: [v++ 203-801] Interface parameter bitwidth 'A.V' (/home/hc2238/heterocl-demo/s1-project/kernel.cpp:15:1)
must be a multiple of 8 for AXI4 master port.
需要包含头文件<hls_video.h>
,其中最有用的是LineBuffer和WindowBuffer。
// hls::LineBuffer<rows, columns, type> variable;
hls::LineBuffer<3,5,char> Buff_A;
Buff_A.shift_pixels_down(2);
Buff_A.insert_top_row(100,2);
Value = Buff_A.getval(1,3); // 9
Row | Column 0 | Column 1 | Column 2 | Column 3 | Column 4 |
---|---|---|---|---|---|
Row 0 | 1 | 2 | 3 | 4 | 5 |
Row 1 | 6 | 7 | 8 | 9 | 10 |
Row 2 | 11 | 12 | 13 | 14 | 15 |
经过上述操作变成
Row | Column 0 | Column 1 | Column 2 | Column 3 | Column 4 |
---|---|---|---|---|---|
Row 0 | 1 | 2 | 100 | 4 | 5 |
Row 1 | 6 | 7 | 3 | 9 | 10 |
Row 2 | 11 | 12 | 8 | 14 | 15 |
其他API包括
shift_pixels_up()
shift_pixels_down()
insert_bottom_row()
insert_top_row()
getval(row,column)
// hls::Window<row, column, type> variable;
hls::Window<3,3,char> Buff_B;
shift_pixels_up()
shift_pixels_down()
shift_pixels_left()
shift_pixels_right()
insert_pixel(value,row,colum)
:直接覆盖insert_row()
insert_bottom_row()
insert_top_row()
insert_col()
insert_left_col()
insert_right_col()
getval(row, column)
Column 0 | Column 1 | Column 2 | Row |
---|---|---|---|
1 | 2 | 3 | Row 0 |
6 | 7 | 8 | Row 1 |
11 | 12 | 13 | Row 2 |
经过Buff_B.shift_pixels_up()
可得到
Column 0 | Column 1 | Column 2 | Row |
---|---|---|---|
6 | 7 | 8 | Row 0 |
11 | 12 | 13 | Row 1 |
New | New | New | Row 2 |
可以看到这种模式对卷积的实现是非常高效的。
经过char C[3] = {50, 50, 50}; Buff_B.insert_row(C,1);
会得到
Column 0 | Column 1 | Column 2 | Row |
---|---|---|---|
1 | 2 | 3 | Row 0 |
50 | 50 | 50 | Row 1 |
11 | 12 | 13 | Row 2 |
可以看到WindowBuffer通常都是对一整块内存进行操作,而LineBuffer更多针对单一元素。
s_axilite
, 被动接受数据, 标量数据。AXI的轻量级版本,用于存储器映射的单次数据通信会话。这个版本的好处是简化了的接口占用较少的逻辑部分面积。这个版本不支持批量数据,因此只支持每次传输单个数据。m_axi
, 主动传输数据。最高性能的接口,适合存储器映射的通信,支持每个地址阶段最高 256 个数据传输周期的批量传输。axis
, FIFO, 不可被寻址HLS INTERFACE可选属性:
depth=<int>
:定义FIFO大小bundle=<string>
:将函数参数全部归到同一端口进行输入,默认只会用一个端口进行数据传输offset=<string>
:控制m_axi
和s_axilite
的地址偏移量
s_axilite
:<string>
指定register map里的地址m_axi
:<string>
可以为以下值:
direct
: Generate a scalar input offset portslave
: Generate an offset port and automatically map it to an AXI4-Lite slave interfaceoff
: Do not generate an offset port任意精度整数(Arbitrary Precision, AP),具体实现可参见HLS Arbitrary Precision Types,其实都是C++的模板类。
#include "ap_int.h"
ap_int
有符号,ap_uint
无符号ap_uint<24>
代表24位无符号整数#include "ap_fixed.h"
ap_fixed
和ap_ufixed
ap_fixed<W,I,Q,O>
W
:总字长I
:整数字长Q
:量化(quantization)模式O
:上溢(overflow)模式如ap_ufixed<11,8,AP_TRN,AP_WRAP>
代表
AP_TRN
表示量化时采用截断(truncation)AP_WRAP
表示用wrapping来处理上溢(即直接丢除最高位,这会导致循环);另外一种是浸润模式AP_SAT
,高于最大值都当最大值,低于最小值都当最小值相关类成员操作如下:(在UG C++ Arbitrary Precision Types一节)
length
to_int
, to_double
, to_string
ap_concat_ref ap_(u)int::concat (ap_(u)int low)
ap_concat_ref ap_(u)int::operator , (ap_(u)int high, ap_(u)int low)
ap_uint<10> Rslt;
ap_int<3> Val1 = -3;
ap_int<7> Val2 = 54;
Rslt = (Val2, Val1); // Yields: 0x1B5
Rslt = Val1.concat(Val2); // Yields: 0x2B6
(Val1, Val2) = 0xAB; // Yields: Val1 == 1, Val2 == 43
ap_bit_ref ap_(u)int::operator [] (int bit)
ap_range_ref ap_(u)int::range (unsigned Hi, unsigned Lo)
ap_range_ref ap_(u)int::operator () (unsigned Hi, unsigned Lo)
ap_uint<4> Rslt;
ap_uint<8> Val1 = 0x5f;
ap_uint<8> Val2 = 0xaa;
Rslt = Val1.range(3, 0); // Yields: 0xF
Val1(3,0) = Val2(3, 0); // Yields: 0x5A
Val1(4,1) = Val2(4, 1); // Yields: 0x55
Rslt = Val1.range(7, 4); // Yields: 0xA; bit-reversed!
bool ap_(u)int::and_reduce ()
set
, clear
, invert
hls::vector
hls::vector<T, N>
代表含有N
个元素类型为T
的SIMD向量,属于vitis新加的特性。常见算子都已经重载,因此可以直接进行并行算术运算,其效果与直接unroll
相同。完整例子可见Tutorial。
typedef hls::vector<int, 8> t_int8Vec;
t_int8Vec intVectorA, intVectorB;
void processVecStream(hls::stream<t_int8Vec> &inVecStream1,hls::stream<t_int8Vec> &inVecStream2, hls::stream<int8Vec> &outVecStream)
{
for(int i=0;i<32;i++)
{
#pragma HLS pipeline II=1
t_int8Vec aVec = inVecStream1.read();
t_int8Vec bBec = inVecStream2.read();
//performs a vector operation on 8 integers in parallel
t_int8Vec cVec = aVec * bVec;
outVecStream.write(cVec);
}
}
HLS对时钟周期的计数是取输入和输出之间最多的寄存器数目
操作连结指可以在一个时钟周期内连续完成多个操作。
考虑mul 3ns,add 2ns
pipeline
或者unroll
时,内层循环默认展开unroll
,在HLS report中将不会作为loop出现array_partition
)需要声明为全局变量,否则作为局部变量会非常慢#pragma HLS resource variable=shift reg core=RAM_1P
更改a=b+c
用DSP,可添加#pragma HLS RESOURCE variable=a core=AddSub_DSP
对应的Tcl如下
# Simulate the C++ design
csim_design
# Synthesize the design
csynth_design
# Co-simulate the design
cosim_design
# Implement the design
export_design -flow impl
Vitis是Xilinx新推出的一个更高层次的编程框架,内嵌Vivado HLS以及后端的Runtime,在命令行下的编译执行操作要比原来的Vivado方便很多。
下载好上述完整安装包后,双击xsetup
可以运行安装程序,注意这里需要有图形化界面及Java支持。之后的安装选项即可选择Vitis,默认安装在/tools/Xilinx/Vitis/2020.1
文件夹下。安装好后执行settings64.sh
可以自动配置好环境变量。
如果需要下载Runtime (XRT),可在这个页面下载。其中即包含了后端编译的运行脚本,可以直接编译生成比特流,然后通过OpenCL的编程界面上板。
具体编程与之前的Vivado HLS不同在于其涉及到host-device的数据传输,因此需要添加#pragma hls interface
,否则无法通过综合。
Alveo加速卡的相关信息可见官网,以及Nimbix的FPGA云服务。
参考代码库:
Vivado HLS不提供并行编译选项,因此综合大型代码耗费的时间会比较长。但如果硬件设计做得好(如流水线添加合理,数组划分正确),即便是大型代码也可以在10分钟内综合完成。也就是说,如果某个design综合的时间过长,那一定是优化没做好。
一些综合中出现的问题可能可在这个博客中找到。
在编译综合大型电路设计之前,一定要先跑csim和cosim验证结果正确性,之后才生成bitstream上板。
在vivado_hls cosim时可能会出现以下问题。
/home/jfrye/sw/Xilinx/Vivado/2018.2/include/mpfr.h:244:2: error: ‘__gmp_const’ does not name a type
__MPFR_DECLSPEC __gmp_const char * mpfr_get_version _MPFR_PROTO ((void));
^~~~~~~~~~~
/home/jfrye/sw/Xilinx/Vivado/2018.2/include/mpfr.h:245:2: error: ‘__gmp_const’ does not name a type
__MPFR_DECLSPEC __gmp_const char * mpfr_get_patches _MPFR_PROTO ((void));
这是Xilinx内部使用的头文件与系统头文件冲突导致,可以通过修改include
内的文件来修复。
修改/tools/Xilinx/Vivado/2020.1/include/mpfr.h
,将系统导入头文件<gmp.h>
,改成当前文件夹导入"gmp.h"
。
/* Check if GMP is included, and try to include it (Works with local GMP) */
#ifndef __GMP_H__
#include "gmp.h"
#endif
如果还是不行,则需要在每个源文件(main.cpp
和test.cpp
)头顶添加下面语句,参见此问题解决方案。
#include <gmp.h> // 这里应该用"/tools/Xilinx/Vivado/2020.1/include/gmp.h"的绝对路径
#define __gmp_const const
同时需要在.tcl
文件中添加合理的编译flag,参见此回答。
通常,如果cosim的结果很久没有出来,或者百分比超过100%,那这个设计就是有问题的,可以提前手动终止,参见此回答。
如果要在WSL内使用vivado_hls
,其实还是相当麻烦的。之前通过大量的尝试,才得到了一个比较好的解决方案。由于Vivado在Linux下的安装一定要图形界面,因此尝试在WSL内安装了图形桌面后,调用xsetup
安装,但似乎安装界面Java虚拟机的大量解释开销,一直都没法进入正常的安装界面,故此方法最后还是放弃。
最后试出来的方法是在Windows环境下安装好Vivado套件后,在WSL内通过两层封装进行调用。
首先需要拷贝一份Xilinx\Vivado\2020.1\bin\vivado_hls.bat
(不妨命名拷贝为my_vivado_hls.bat
),然后修改文件内容。将弹出新窗口的%COMSPEC%
指令移除,直接换成vivado_hls
的调用。完整的bat文件如下,这里可以通过%1 %2
进行命令行指令的传递。
@echo off
set PATH=%~dp0;%PATH%;%~dp0..\tps\win64\msys64\usr\bin;%~dp0..\tps\win64\msys64\mingw64\bin
set AUTOESL_HOME=%~dp0..
set VIVADO_HLS_HOME=%~dp0..
echo ===============================
echo == Vivado HLS Command Prompt
echo == Available commands:
echo == vivado_hls,apcc,gcc,g++,make
echo ===============================
set RDI_OS_ARCH=32
if [%PROCESSOR_ARCHITECTURE%] == [x86] (
if defined PROCESSOR_ARCHITEW6432 (
set RDI_OS_ARCH=64
)
) else (
if defined PROCESSOR_ARCHITECTURE (
set RDI_OS_ARCH=64
)
)
if not "%RDI_OS_ARCH%" == "64" goto _NotX64
set COMSPEC=%WINDIR%\SysWOW64\cmd.exe
rem %COMSPEC%
vivado_hls %1 %2
goto EOF
:_NotX64
set COMSPEC=%WINDIR%\System32\cmd.exe
rem %COMSPEC% /c %0 %1 %2 %3 %4 %5 %6 %7 %8 %9
vivado_hls %1 %2
:EOF
这是Windows端的封装,这样弄完后就可以直接在PowerShell里通过.\my_vivado_hls -f run.tcl
执行综合过程,而不会弹出新的窗口。
至于WSL端的封装则是创建一个bash文件,里面通过cmd.exe
执行上述指令,脚本如下。
#!/bin/bash
cmd.exe /c <path_to_xilinx>/Xilinx/Vivado/2020.1/bin/my_vivado_hls $1 $2
这里同样通过$1 $2
传递参数,同时将此脚本保存为vivado_hls
(注意<path_to_xilinx>
应该是Windows下的路径,同时需要对脚本文件chmod +x
添加权限),假装它就是一个可执行文件,同时放在可被Linux的PATH搜索到的地方。这样执行which vivado_hls
也能正常执行(这是alias
所做不到的)。
最终就可以愉快地在WSL里调用vivado_hls -f run.tcl
进行综合啦!
本文主要参照以下资料进行整理,特别向这些课程/书籍/资料的作者和老师致以感谢。
The Zynq Book, http://www.zynqbook.com/, Chinese edition ↩
Zhiru Zhang, Cornell ECE 5775: High-Level Digital Design Automation, Fall 2018 ↩
Thierry Moreau, UW CSE548 Lab 3: Custom Acceleration with FPGAs, Spring 2017 ↩
Xilinx SDAccel Development Environment Help for 2018.2 XDF ↩
Ryan Kastner, Janarbek Matai, and Stephen Neuendorffer (UCSB), Parallel Programming for FPGAs (Github Online Book, Chinese), 2018 ↩