这篇文章主要谈谈由Halide这门图像处理领域专用语言(Domain Specific Language, DSL)中衍生出来的故事。 关于DSL的内容则请见领域特定语言。
图像处理结合了蒙版计算和流应用的挑战 – 多蒙版多阶段,但同一模板只用一次。
现状是
解决方案:
最简单的两阶段流水
alloc blurx[2048][3072]
for each y in 0..2048:
for each x in 0..3072:
blurx[y][x] = In[y][x-1] + In[y][x] + In[y][x+1]
alloc out[2046][3072]
for each y in 1..2047:
for each x in 0..3072:
out[y][x]=blurx[y-1][x] + blurx[y][x] + blurx[y+1][x]
交织不存储中间值,局部性好但是由于共享的blurx
没有被重用,因此有冗余计算
alloc out[2046][3072]
for each y in 1..2047:
for each x in 0..3072:
alloc blurx[-1..1]
for each i in -1..1:
blurx[i]= In[y-1+i][x-1]+In[y-1+i][x]+In[y-1+i][x+1]
out[y][x] = blurx[0] + blurx[1] + blurx[2]
滑动窗口(sliding window),只计算每个blurx
一次并存储;引入条件依赖,牺牲并行性
alloc out[2046][3072]
alloc blurx[3][3072]
for each y in -1..2047:
for each x in 0..3072:
blurx[(y+1)%3][x]=In[y+1][x-1]+In[y+1][x]+In[y+1][x+1]
if y < 1: continue
out[y][x] = blurx[(y-1)%3][x]
+ blurx[ y % 3 ][x]
+ blurx[(y+1)%3][x]
上述的策略存在的问题:
一个更好的方式是在铺砖(tile)层面达到平衡
alloc out[2046][3072]
for each ty in 0..2048/32:
for each tx in 0..3072/32:
alloc blurx[-1..33][32]
for y in -1..33:
for x in 0..32:
blurx[y][x] = In[ty*32+y][tx*32+x-1]
+ In[ty*32+y][tx*32+x]
+ In[ty*32+y][tx*32+x+1]
for y in 0..32:
for x in 0..32:
out[ty*32+y][tx*32+x] = blurx[y-1][x]
+ blurx[y ][x]
+ blurx[y+1][x]
注意:Halide以C++作为宿主语言,后端用llvm(+自己SIMD优化)直接编译为汇编
Func blur_3x3(Func input) {
Func blur_x, blur_y;
Var x, y, xi, yi;
// The algorithm - no storage or order
blur_x(x, y) = (input(x-1, y) + input(x, y) + input(x+1, y))/3;
blur_y(x, y) = (blur_x(x, y-1) + blur_x(x, y) + blur_x(x, y+1))/3;
// The schedule - defines order, locality; implies storage
blur_y.tile(x, y, xi, yi, 256, 32)
.vectorize(xi, 8).parallel(y);
blur_x.compute_at(blur_y, x).vectorize(x, 8);
return blur_y;
}
搜索空间巨大 $>10^{720}$ 种调度策略,采用基因算法
五个图像处理算法
组合性的赞歌!
GraphIt does not introduce any new optimizations. Instead, the DSL achieves competitive or better performance compared to other frameworks by generating efficient implementations of known combinations of optimizations, and finding previously unexplored combinations by searching through a much larger space of optimizations.
同样分离开算法语言与调度语言,使程序员免除考虑底层并行同步的问题
图的优化
struct
filter(func f)
apply(func f)
from(vertexset vset)
to(vertexset vset)
configApplyDirection(label, config)
configApplyParallelization(label, config, [grainsize], [direction]
configApplyNumSSG
configApplyNUMA
fuseFields({vec1, vec2, ...})
:融合多个数组进struct
fuseForLoop
fuseApplyFunctions
通过scoped labels和name nodes访问
#l1# for i in 1:10
#s1# edges.apply(func1);
end
#l2# for i in 1:10
#s1# edges.apply(func2);
end
schedule:
program->fuseForLoop("l1", "l2", "l3")
->fuseApplyFunctions("l3:l1:s1", "l3:l2:s1", "fusedFunc")
->configApplyDirection("l3:l1:s1", "DensePull");
用多维向量表示不同的优化组合,四维向量
\[\langle S, B, O, I\rangle:=\langle \text{SSG\_ID}, \text{BSG\_ID}, \text{OuterIter}, \text{InnerIter}\rangle\]每一个维度都用tag标记,比如configApplyDirection("s1","SparsePush")
的图迭代空间为</,/,O[src,SR,SA],I[dst,SR]>
,其中SR为serial,SA为sparse array
编译到C++,目前不支持多后端
两种设计DSL的方法
Halide | GraphIt | Chisel6+Firrtl7 | Spatial | |
---|---|---|---|---|
面向平台 | CPU/GPU | CPU | FPGA | FPGA |
分离算法与执行 | √ | √ | × | ×(隐性) |
IR | √ | × | √ | √ |
编译优化 | 大量 | 几乎无 | 部分 | 大量 |
Autotuning | √ | √ | × | √ |
GraphIt和Chisel更多算Metalanguage/Code Generator,而不是真正的compiler
更加详细关于DSL的内容请见领域特定语言。
Jonathan Ragan-Kelley (MIT), Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe, Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines, PLDI, 2013 ↩
Stencil code, https://en.wikipedia.org/wiki/Stencil_code ↩
Yunming Zhang (MIT), Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe, GraphIt: A High-Performance Graph DSL, OOPSLA, 2018 ↩
David Koeplinger (Stanford), Christos Kozyrakis, Kunle Olukotun, et al., Spatial: A Language and Compiler for Application Accelerators, PLDI, 2018 ↩
Tianqi Chen (UW), Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Arvind Krishnamurthy, et al., TVM: An Automated End-to-End Optimizing Compiler for Deep Learning, OSDI, 2018 ↩
Jonathan Bachrach (UCB), Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, Krste Asanovic, et al., Chisel: Constructing Hardware in a Scala Embedded Language, DAC, 2012 ↩
Adam Izraelevitz (UCB), Jack Koenig, Patrick Li, Richard Lin, Angie Wang, Albert Magyar, Donggyu Kim, Colin Schmidt, Chick Markley, Jim Lawson, Jonathan Bachrach, Reusability is FIRRTL Ground: Hardware Construction Languages, Compiler Frameworks, and Transformations, ICCAD, 2017 ↩