Publications
Publications
ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines

Jinming Zhuang*, Shaojie Xiang*, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, Peipei Zhou
FPGAACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2025 (Best Paper Nominee) |
[abs] |
[bib] |
As AI continues to grow, modern applications are becoming more data- and compute-intensive, driving the development of specialized AI chips to meet these demands. One example is AMD's AI Engine (AIE), a dedicated hardware system that includes a 2D array of high-frequency very-long instruction words (VLIW) vector processors to provide high computational throughput and reconfigurability. However, AIE's specialized architecture presents tremendous challenges in programming and compiler optimization. Existing AIE programming frameworks lack a clean abstraction to represent multi-level parallelism in AIE; programmers have to figure out the parallelism within a kernel, manually do the partition, and assign sub-tasks to different AIE cores to exploit parallelism. These significantly lower the programming productivity. Furthermore, some AIE architectures include FPGAs to provide extra flexibility, but there is no unified intermediate representation (IR) that captures these architectural differences. As a result, existing compilers can only optimize the AIE portions of the code, overlooking potential FPGA bottlenecks and leading to suboptimal performance.
To address these limitations, we introduce ARIES, an agile multi-level intermediate representation (MLIR) based compilation flow for reconfigurable devices with AIEs. ARIES introduces a novel programming model that allows users to map kernels to separate AIE cores, exploiting task- and tile-level parallelism without restructuring code. It also includes a declarative scheduling interface to explore instruction-level parallelism within each core. At the IR level, we propose a unified MLIR-based representation for AIE architectures, both with or without FPGA, facilitating holistic optimization and better portability across AIE device families. For the General Matrix Multiply (GEMM) benchmark, ARIES achieves 4.92 TFLOPS, 15.86 TOPS, and 45.94 TOPS throughput under FP32, INT16, and, INT8 data types on Versal VCK190 respectively. Compared with the state-of-the-art (SOTA) work CHARM for AIE, ARIES improves the throughput by 1.17x, 1.59x, and 1.47x correspondingly. For ResNet residual layer, ARIES achieves up to 22.58x speedup compared with optimized SOTA work Riallto on Ryzen-AI NPU. ARIES is open-sourced on GitHub:
https://github.com/arc-research-lab/Aries.
@inproceedings{zhuang2025aries,
title={ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines},
author={Zhuang, Jinming and Xiang, Shaojie and Chen, Hongzheng and Zhang, Niansong and Yang, Zhuoping and Mao, Tony and Zhang, Zhiru and Zhou, Peipei},
booktitle={Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays},
pages={92--102},
year={2025}
}
Allo: A Programming Model for Composable Accelerator Design

Hongzheng Chen*, Niansong Zhang*, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, Zhiru Zhang
PLDIACM SIGPLAN Conference on Programming Language Design and Implementation, 2024 |
[abs] |
[bib] |
|
|
Blog (Zhihu)
Special-purpose hardware accelerators are increasingly pivotal for sustaining performance improvements in emerging applications, especially as the benefits of technology scaling continue to diminish. However, designers currently lack effective tools and methodologies to construct complex, high-performance accelerator architectures in a productive manner. Existing high-level synthesis (HLS) tools often require intrusive source-level changes to attain satisfactory quality of results. Despite the introduction of several new accelerator design languages (ADLs) aiming to enhance or replace HLS, their advantages are more evident in relatively simple applications with a single kernel. Existing ADLs prove less effective for realistic hierarchical designs with multiple kernels, even if the design hierarchy is flattened. In this paper, we introduce Allo, a composable programming model for efficient spatial accelerator design. Allo decouples hardware customizations, including compute, memory, communication, and data type from algorithm specification, and encapsulates them as a set of customization primitives. Allo preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner. This approach facilitates holistic optimizations that span across function boundaries. We conduct comprehensive experiments on commonly-used HLS benchmarks and several realistic deep learning models. Our evaluation shows that Allo can outperform state-of-the-art HLS tools and ADLs on all test cases in the PolyBench. For the GPT2 model, the inference latency of the Allo generated accelerator is 1.7x faster than the NVIDIA A100 GPU with 5.4x higher energy efficiency, demonstrating the capability of Allo to handle large-scale designs.
@article{chen2024allo,
title={Allo: A programming model for composable accelerator design},
author={Chen, Hongzheng and Zhang, Niansong and Xiang, Shaojie and Zeng, Zhichen and Dai, Mengjia and Zhang, Zhiru},
journal={Proceedings of the ACM on Programming Languages},
volume={8},
number={PLDI},
pages={593--620},
year={2024},
publisher={ACM New York, NY, USA}
}
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang
ACM TRETSACM Transactions on Reconfigurable Technology and Systems, 2024 (FCCMIEEE International Symposium on Field-Programmable Custom Computing Machines‘24 Journal Track) |
[abs] |
[bib] |
|
Blog (Zhihu)
Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead.
This article investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on field-programmable gate arrays (FPGAs). Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart.
To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT2) on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4× speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2× speedup compared to Design for Excellence, an FPGA overlay, in the prefill stage, while achieving a 1.9× speedup and a 5.7× improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.
@article{chen2024understanding,
title={Understanding the potential of fpga-based spatial acceleration for large language model inference},
author={Chen, Hongzheng and Zhang, Jiahao and Du, Yixiao and Xiang, Shaojie and Yue, Zichao and Zhang, Niansong and Cai, Yaohui and Zhang, Zhiru},
journal={ACM Transactions on Reconfigurable Technology and Systems},
volume={18},
number={1},
pages={1--29},
year={2024},
publisher={ACM New York, NY}
}
Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training

Hongzheng Chen, Cody Hao Yu, Shuai Zheng, Zhen Zhang, Zhiru Zhang, Yida Wang
ASPLOSACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024 |
[abs] |
[bib] |
|
|
Amazon Science
Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price of sub-optimal model training performance. On the other hand, practitioners propose various approaches to improving the training efficiency by sacrificing some of the flexibility, ranging from making the graph static for more thorough optimization (e.g., XLA) to customizing optimization towards large-scale distributed training (e.g., DeepSpeed and Megatron-LM).
In this paper, we aim to address the tension between usability and training efficiency through separation of concerns. Inspired by DL compilers that decouple the platform-specific optimizations of a tensor-level operator from its arithmetic definition, this paper proposes a schedule language, Slapo, to decouple model execution from definition. Specifically, Slapo works on a PyTorch model and uses a set of schedule primitives to convert the model for common model training optimizations such as high-performance kernels, effective 3D parallelism, and efficient activation checkpointing. Compared to existing optimization solutions, Slapo progressively optimizes the model "as-needed" through high-level primitives, and thus preserving programmability and debuggability for users to a large extent. Our evaluation results show that by scheduling the existing hand-crafted optimizations in a systematic way using Slapo, we are able to improve training throughput by up to 2.92× on a single machine with 8 NVIDIA V100 GPUs, and by up to 1.41× on multiple machines with up to 64 GPUs, when compared to the out-of-the-box performance of DeepSpeed and Megatron-LM.
@inproceedings{chen2024slapo,
title={Slapo: A schedule language for progressive optimization of large deep learning model training},
author={Chen, Hongzheng and Yu, Cody Hao and Zheng, Shuai and Zhang, Zhen and Zhang, Zhiru and Wang, Yida},
booktitle={Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},
pages={1095--1111},
year={2024}
}
Formal Verification of Source-to-Source Transformations for HLS

Louis-Noël Pouchet, Emily Tucker, Niansong Zhang, Hongzheng Chen, Debjit Pal, Gabriel Rodríguez, Zhiru Zhang
FPGAACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2024 (Best Paper Award) |
[abs] |
[bib] |
High-level synthesis (HLS) can greatly facilitate the description of complex hardware implementations, by raising the level of abstraction up to a classical imperative language such as C/C++, usually augmented with vendor-specific pragmas and APIs. Despite productivity improvements, attaining high performance for the final design remains a challenge, and higher-level tools like source-to-source compilers have been developed to generate programs targeting HLS toolchains. These tools may generate highly complex HLS-ready C/C++ code, reducing the programming effort and enabling critical optimizations. However, whether these HLS-friendly programs are produced by a human or a tool, validating their correctness or exposing bugs otherwise remains a fundamental challenge. In this work we target the problem of efficiently checking the semantics equivalence between two programs written in C/C++ as a means to ensuring the correctness of the description provided to the HLS toolchain, by proving an optimized code version fully preserves the semantics of the unoptimized one. We introduce a novel formal verification approach that combines concrete and abstract interpretation with a hybrid symbolic analysis. Notably, our approach is mostly agnostic to how control-flow, data storage, and dataflow are implemented in the two programs. It can prove equivalence under complex bufferization and loop/syntax transformations, for a rich class of programs with statically interpretable control-flow. We present our techniques and their complete end-to-end implementation, demonstrating how our system can verify the correctness of highly complex programs generated by source-to-source compilers for HLS, and detect bugs that may elude co-simulation.
@inproceedings{pouchet2024formal,
title={Formal verification of source-to-source transformations for hls},
author={Pouchet, Louis-No{\"e}l and Tucker, Emily and Zhang, Niansong and Chen, Hongzheng and Pal, Debjit and Rodr{\'\i}guez, Gabriel and Zhang, Zhiru},
booktitle={Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays},
pages={97--107},
year={2024}
}
BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing
Tianfeng Liu*, Yangrui Chen*, Dan Li, Chuan Wu, Yibo Zhu, Jun He, Yanghua Peng, Hongzheng Chen, Hongzhi Chen, Chuanxiong Guo
NSDIUSENIX Symposium on Networked Systems Design and Implementation, 2023 |
[abs] |
[bib] |
Graph neural networks (GNNs) have extended the success of deep neural networks (DNNs) to non-Euclidean graph data, achieving ground-breaking performance on various tasks such as node classification and graph property prediction. Nonetheless, existing systems are inefficient to train large graphs with billions of nodes and edges with GPUs. The main bottlenecks are the process of preparing data for GPUs–subgraph sampling and feature retrieving. This paper proposes BGL, a distributed GNN training system designed to address the bottlenecks with a few key ideas. First, we propose a dynamic cache engine to minimize feature retrieving traffic. By co-designing caching policy and the order of sampling, we find a sweet spot of low overhead and a high cache hit ratio. Second, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Finally, careful resource isolation reduces contention between different data preprocessing stages. Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 1.9 x on average.
@inproceedings{liu2023bgl,
title={BGL: GPU-EfficientGNN training by optimizing graph data I/O and preprocessing},
author={Liu, Tianfeng and Chen, Yangrui and Li, Dan and Wu, Chuan and Zhu, Yibo and He, Jun and Peng, Yanghua and Chen, Hongzheng and Chen, Hongzhi and Guo, Chuanxiong},
booktitle={20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)},
pages={103--118},
year={2023}
}
Accelerator Design with Decoupled Hardware Customizations: Benefits and Challenges
Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini, Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, Zhiru Zhang
DACACM/IEEE Design Automation Conference, 2022 (Invited Paper) |
[abs] |
[bib]
The past decade has witnessed increasing adoption of high-level synthesis (HLS) to implement specialized hardware accelerators targeting either FPGAs or ASICs. However, current HLS programming models entangle algorithm specifications with hardware customization techniques, which lowers both the productivity and portability of the accelerator design. To tackle this problem, recent efforts such as HeteroCL propose to decouple algorithm definition from essential hardware customization techniques in compute, data type, and memory, increasing productivity, portability, and performance.
While the decoupling of the algorithm and customizations provides benefits to the compilation/synthesis process, they also create new hurdles for the programmers to productively debug and validate the correctness of the optimized design. In this work, using HeteroCL and realistic machine learning applications as case studies, we first explain the key advantages of the decoupled programming model brought to a programmer to rapidly develop high-performance accelerators. Using the same case studies, we will further show how seemingly benign usage of the customization primitives can lead to new challenges to verification. We will then outline the research opportunities and discuss some of our recent efforts as the first step to enable a robust and viable verification solution in the future.
@inproceedings{pal2022accelerator,
title={Accelerator design with decoupled hardware customizations: benefits and challenges},
author={Pal, Debjit and Lai, Yi-Hsiang and Xiang, Shaojie and Zhang, Niansong and Chen, Hongzheng and Casas, Jeremy and Cocchini, Pasquale and Yang, Zhenkun and Yang, Jin and Pouchet, Louis-No{\"e}l and others},
booktitle={Proceedings of the 59th ACM/IEEE Design Automation Conference},
pages={1351--1354},
year={2022}
}
HeteroFlow: An Accelerator Programming Model with Decoupled Data Placement for Software-Defined FPGAs

Shaojie Xiang, Yi-Hsiang Lai, Yuan Zhou, Hongzheng Chen, Niansong Zhang, Debjit Pal, Zhiru Zhang
FPGAACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022 |
[abs] |
[bib] |
To achieve high performance with FPGA-equipped heterogeneous compute systems, it is crucial to co-optimize data placement and compute scheduling to maximize data reuse and bandwidth utilization for both on- and off-chip memory accesses. However, optimizing the data placement for FPGA accelerators is a complex task. One must acquire in-depth knowledge of the target FPGA device and its associated memory system in order to apply a set of advanced optimizations. Even with the latest high-level synthesis (HLS) tools, programmers often have to insert many low-level vendor-specific pragmas and substantially restructure the algorithmic code so that the right data are accessed at the right loop level using the right communication schemes. These code changes can significantly compromise the composability and portability of the original program. To address these challenges, we propose HeteroFlow, an FPGA accelerator programming model that decouples the algorithm specification from optimizations related to orchestrating the placement of data across a customized memory hierarchy. Specifically, we introduce a new primitive named .to(), which provides a unified programming interface for specifying data placement optimizations at different levels of granularity: (1) coarse-grained data placement between host and accelerator, (2) medium-grained kernel-level data placement within an accelerator, and (3) fine-grained data placement within a kernel. We build HeteroFlow on top of the open-source HeteroCL DSL and compilation framework. Experimental results on a set of realistic benchmarks show that, programs written in HeteroFlow can match the performance of extensively optimized manual HLS design with much fewer lines of code.
@inproceedings{xiang2022heteroflow,
title={Heteroflow: An accelerator programming model with decoupled data placement for software-defined fpgas},
author={Xiang, Shaojie and Lai, Yi-Hsiang and Zhou, Yuan and Chen, Hongzheng and Zhang, Niansong and Pal, Debjit and Zhang, Zhiru},
booktitle={Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
pages={78--88},
year={2022}
}
Krill: A Compiler and Runtime System for Concurrent Graph Processing

Hongzheng Chen, Minghua Shen, Nong Xiao, Yutong Lu
SCInternational Conference for High Performance Computing, Networking, Storage and Analysis, 2021 |
[abs] |
[bib] |
|
As a large number of emerging graph applications spread across different domains, the need for processing massive concurrent graph jobs (CGJs) is increasing. However, existing graph processing systems designed for a single job cannot efficiently tackle multiple CGJs, where they suffer from interfering memory access patterns and inefficient property management. In this paper, we introduce Krill, a compiler and runtime system for processing concurrent graph jobs. We propose an SAP model, which decouples graph structure, algorithm, and property. In the compiler, we propose leveraging the property buffer to easily write and manage property data. In the runtime system, we propose a novel technique named graph kernel fusion to reduce memory accesses, which fuses all the jobs and processes them as a whole. Experimental results show our system significantly reduces the number of memory accesses for CGJs by more than 6x compared with the baseline, and achieves up to 6.76x speedup with 3.84x shorter response latency than GraphM, the state-of-the-art concurrent graph processing system.
@inproceedings{chen2021krill,
title={Krill: a compiler and runtime system for concurrent graph processing},
author={Chen, Hongzheng and Shen, Minghua and Xiao, Nong and Lu, Yutong},
booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
pages={1--16},
year={2021}
}
FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations

Yichi Zhang, Junhao Pan, Xinheng Liu, Hongzheng Chen, Deming Chen, Zhiru Zhang
FPGAACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2021 (Best Paper Nominee) |
[abs] |
[bib] |
Binary neural networks (BNNs) have 1-bit weights and activations. Such networks are well suited for FPGAs, as their dominant computations are bitwise arithmetic and the memory requirement is also significantly reduced. However, compared to start-of-the-art compact convolutional neural network (CNN) models, BNNs tend to produce a much lower accuracy on realistic datasets such as ImageNet. In addition, the input layer of BNNs has gradually become a major compute bottleneck, because it is conventionally excluded from binarization to avoid a large accuracy loss.
This work proposes FracBNN, which exploits fractional activations to substantially improve the accuracy of BNNs. Specifically, our approach employs a dual-precision activation scheme to compute features with up to two bits, using an additional sparse binary convolution. We further binarize the input layer using a novel thermometer encoding. Overall, FracBNN preserves the key benefits of conventional BNNs, where all convolutional layers are computed in pure binary MAC operations (BMACs). We design an efficient FPGA-based accelerator for our novel BNN model that supports the fractional activations. To evaluate the performance of FracBNN under a resource-constrained scenario, we implement the entire optimized network architecture on an embedded FPGA (Xilinx Ultra96 v2). Our experiments on ImageNet show that FracBNN achieves an accuracy comparable to MobileNetV2, surpassing the best-known BNN design on FPGAs with an increase of 28.9% in top-1 accuracy and a 2.5x reduction in model size. FracBNN also outperforms a recently introduced BNN model with an increase of 2.4% in top-1 accuracy while using the same model size. On the embedded FPGA device, FracBNN demonstrates the ability of real-time image classification.
@inproceedings{zhang2021fracbnn,
title={FracBNN: Accurate and FPGA-efficient binary neural networks with fractional activations},
author={Zhang, Yichi and Pan, Junhao and Liu, Xinheng and Chen, Hongzheng and Chen, Deming and Zhang, Zhiru},
booktitle={The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
pages={171--182},
year={2021}
}
Entropy-Directed Scheduling for FPGA High-Level Synthesis
Minghua Shen, Hongzheng Chen*, Nong Xiao
IEEE TCADIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020 |
[abs] |
[bib] |
High-level synthesis (HLS) is important for compiling an application design onto field-programmable gate array (FPGA) but still faces challenges of balancing scalability and quality of results in the scheduling process. In this article, we propose an entropy-directed scheduling (EDS) algorithm that efficiently generates high-quality schedules for FPGA HLS. This article is novel in three ways. First, we make the first attempt to adopt entropy in the scheduling of HLS, which is an intuitive and robust measurement with lots of good analytic properties. Second, we creatively leverage the maximum entropy principle to describe the scheduling process, which is proved equivalent to the optimal solution in some particular cases. Third, we make EDS automatically analyze the input graph structure and leverage a three-stage scheduling process to obtain high-quality results. As a result, EDS has the lowest time complexity among existing scheduling algorithms and is flexible to solve both latency- and resource-constrained problems while satisfying other constraints. The experimental results show that for latency-constrained scheduling problem, EDS reduces up to 68% resource usage and is 294× faster than the force-directed scheduling algorithm. For resource-constrained scheduling problem, EDS obtains near-optimal solutions with an average speedup of 16 410× compared with ILP. To our best knowledge, this is the first EDS algorithm for FPGA HLS.
@article{shen2019entropy,
title={Entropy-directed scheduling for FPGA high-level synthesis},
author={Shen, Minghua and Chen, Hongzheng and Xiao, Nong},
journal={IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems},
volume={39},
number={10},
pages={2588--2601},
year={2019},
publisher={IEEE}
}
A Deep-Reinforcement-Learning-Based Scheduler for FPGA HLS
Hongzheng Chen, Minghua Shen
ICCADIEEE/ACM International Conference on Computer-Aided Design, 2019 |
[abs] |
[bib] |
As the most critical stage in FPGA HLS, scheduling depends heavily on heuristics due to their speed, flexibility, and scalability. However, designing heuristics easily involves human bias, which makes scheduling unpredictable in some specific cases. To solve the problem, we propose an efficient deep reinforcement learning (Deep-RL) based scheduler for FPGA HLS. It has the potential to reduce the human involvement maximumly and learn to schedule by itself. The proposed scheduler consists of three steps. First, we design a novel state and action representation for constrained scheduling problems, which is the foundation of the learning task. Then, we leverage a training pipeline to train the policy network. Specifically, supervised learning is used to initialize the weights of the network and reinforcement learning is used to improve the performance, both of which make the Deep-RL based scheduler practical for HLS. At last, we compare our scheduler with the ASAP schedule and the optimal ILP schedule. Experimental results show that the proposed scheduler can reduce up to 74% resource usage compared with the original ASAP schedule, and the gap between the optimal solution is very small. Notably, this is the first work leveraging reinforcement learning in HLS and has great potential to be integrated into different HLS systems.
@inproceedings{chen2019deep,
title={A deep-reinforcement-learning-based scheduler for fpga hls},
author={Chen, Hongzheng and Shen, Minghua},
booktitle={2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)},
pages={1--8},
year={2019},
organization={IEEE}
}
Workshops / Preprints
HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
Hongzheng Chen*, Yingheng Wang*, Yaohui Cai*, Hins Hu*, Jiajie Li*, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang
arxiv, 2025 |
[abs] |
[bib] |
While Large Language Models (LLMs) have demonstrated significant advance-
ments in reasoning and agent-based problem-solving, current evaluation method-
ologies fail to adequately assess their capabilities: existing benchmarks either rely
on closed-ended questions prone to saturation and memorization, or subjective
comparisons that lack consistency and rigor. In this work, we introduce HeuriGym,
an agentic framework designed for evaluating heuristic algorithms generated by
LLMs for combinatorial optimization problems, characterized by clearly defined
objectives and expansive solution spaces. HeuriGym empowers LLMs to propose
heuristics, receive evaluative feedback via code execution, and iteratively refine
their solutions. We evaluate nine state-of-the-art models on nine problems across
domains such as computer systems, logistics, and biology, exposing persistent
limitations in tool use, planning, and adaptive reasoning. To quantify performance,
we propose the Quality-Yield Index (QYI), a metric that captures both solution pass
rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro
attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source
benchmark aims to guide the development of LLMs toward more effective and
realistic problem-solving in scientific and engineering domains.
@article{chen2025heurigym,
title={HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization},
author={Hongzheng Chen and Yingheng Wang and Yaohui Cai and Hins Hu and Jiajie Li and Shirley Huang and Chenhui Deng and Rongjian Liang and Shufeng Kong and Haoxing Ren and Samitha Samaranayake and Carla P. Gomes and Zhiru Zhang},
journal={arXiv preprint arXiv:2506.07972},
year={2025}
}
Allo: Catalyzing Accelerator Design and Programming for Machine Learning
Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhiru Zhang
C4ML@CGOCompilers for Machine Learning Workshop at International Symposium on Code Generation and Optimization, 2025 |
[abs] |
[bib] |
As the benefits of technology scaling diminish, specialized hardware accelerators are crucial for performance in emerging machine learning applications. However, designers currently lack effective tools and methodologies to construct complex, high-performance accelerator architectures. Existing high-level synthesis (HLS) tools often require intrusive source-level changes to attain satisfactory quality of results. While new accelerator design languages (ADLs) aim to enhance or replace HLS, they are typically more effective for simple applications with a single kernel, rather than for hierarchical designs with multiple kernels.
In the first part of this talk, we will introduce Allo, a composable programming model for efficient hardware accelerator design. Allo decouples hardware customizations, including compute, memory, communication, and data types from algorithm specification, and encapsulates them as a set of customization primitives, which enables verifiable stepwise optimizations. Allo also preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner, enabling both temporal and spatial composition.
We will then illustrate how Allo optimizes large-scale designs with two case studies. First, we develop a spatial accelerator architecture for large language models (LLMs) and prototype it on an AMD U280 FPGA, demonstrating higher energy efficiency than NVIDIA GPUs in generative inference settings. In addition, we deploy a convolutional neural network (CNN) design using Allo on the AMD Ryzen AI Engine, achieving substantial speedups over prior approaches.
@article{chen2024allo,
title={Allo: A programming model for composable accelerator design},
author={Chen, Hongzheng and Zhang, Niansong and Xiang, Shaojie and Zeng, Zhichen and Dai, Mengjia and Zhang, Zhiru},
journal={Proceedings of the ACM on Programming Languages},
volume={8},
number={PLDI},
pages={593--620},
year={2024},
publisher={ACM New York, NY, USA}
}
Uncovering Magic with Magic: Schedule Reconstruction from High-Performance Kernel Libraries
Hongzheng Chen
PLDI Student Research Competition (SRC)ACM SIGPLAN Conference on Programming Language Design and Implementation Student Research Competition, 2024 (Bronze) |
[abs] |
[bib] |
To fully harness the power of hardware accelerators, high-performance kernels are essential for various application domains in deep learning and scientific computing. Many of these kernels are carefully crafted and optimized by hand using languages like C/C++, demanding significant engineering efforts. Recent developments in kernel libraries have embraced schedule languages that decouple program optimization (ie, schedule) from algorithm specification. Users only need to write a few lines of schedule code to transform a basic program into an optimized one. However, this shift merely transfers the coding burden from writing optimized kernels to writing efficient schedules. While autoschedulers help generate optimized programs automatically, users have little control over the optimizations and their granularity remains opaque. To tackle this challenge, we introduce a novel research problem: schedule reconstruction, that is, reconstructing the schedule primitives from an optimized kernel implementation. Unlike existing approaches that focus on generating optimized kernels, our goal is to reveal the optimizations within these kernels by generating stepby-step schedule primitives. By providing an algorithm specification and an optimized kernel implementation, we propose a two-stage program synthesis technique to automatically generate the necessary schedule for this transformation. While this paper represents an initial exploration of schedule synthesis, our work demonstrates the potential of this idea to:(1) Aiding programmers in comprehending the intricacies of high-performance program optimizations, making it easier to debug and allowing them to experiment with different tradeoff combinations. (2) Potentially providing templates for automating compilers to generate even more high-performance code than manually optimized ones.
@article{chen2024pldisrc,
title={Uncovering Magic with Magic: Schedule Reconstruction from High-Performance Kernel Libraries},
author={Chen, Hongzheng},
journal={ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) Student Research Competition (SRC)},
year={2024}
}
Structured Pruning is All You Need for Pruning CNNs at Initialization
Yaohui Cai, Weizhe Hua, Hongzheng Chen, G. Edward Suh, Christopher De Sa, Zhiru Zhang
arXiv:2203.02549, 2022 |
[abs] |
[bib]
Pruning is a popular technique for reducing the model size and computational cost of convolutional neural networks (CNNs). However, a slow retraining or fine-tuning procedure is often required to recover the accuracy loss caused by pruning. Recently, a new research direction on weight pruning, pruning-at-initialization (PAI), is proposed to directly prune CNNs before training so that fine-tuning or retraining can be avoided. While PAI has shown promising results in reducing the model size, existing approaches rely on fine-grained weight pruning which requires unstructured sparse matrix computation, making it difficult to achieve real speedup in practice unless the sparsity is very high. This work is the first to show that fine-grained weight pruning is in fact not necessary for PAI. Instead, the layerwise compression ratio is the main critical factor to determine the accuracy of a CNN model pruned at initialization. Based on this key observation, we propose PreCropping, a structured hardware-efficient model compression scheme. PreCropping directly compresses the model at the channel level following the layerwise compression ratio. Compared to weight pruning, the proposed scheme is regular and dense in both storage and computation without sacrificing accuracy. In addition, since PreCropping compresses CNNs at initialization, the computational and memory costs of CNNs are reduced for both training and inference on commodity hardware. We empirically demonstrate our approaches on several modern CNN architectures, including ResNet, ShuffleNet, and MobileNet for both CIFAR-10 and ImageNet.
@article{cai2022pai,
title={Structured pruning is all you need for pruning CNNs at initialization},
author={Cai, Yaohui and Hua, Weizhe and Chen, Hongzheng and Suh, G Edward and De Sa, Christopher and Zhang, Zhiru},
journal={arXiv preprint arXiv:2203.02549},
year={2022}
}