I am a Ph.D. candidate in the Computer Systems Laboratory at Cornell University, advised by Prof. Zhiru Zhang. I received my B.E. in Computer Science with highest honors from Sun Yat-sen University.

I have been fortunate to intern at Google, NVIDIA, AWS, and ByteDance, contributing to projects in large-scaleSlapo [ASPLOS’24] with AWS machineBGL [NSDI’23] with ByteDance learningMagellan [C4ML @ CGO’26] with Google systemsTawa [CGO’26] with NVIDIA. I have also closely collaborated with three major hardware vendors Intel, AMD, and NVIDIA on variousTawa [CGO’26] with NVIDIA compilerAllo [PLDI’24] with AMD projectsHeteroCL-MLIR [DAC’22] with Intel. My research is mainly supported by the SRC JUMP 2.0 ACE Center and has received recognition through three Best Paper nominations and a Best Paper Award at top-tier hardware conferences. I was also named a 2024 ML and Systems Rising Star and selected as a finalist for the 2025 Qualcomm Innovation Fellowship.

Research Highlights

My research focuses on building compilers, programming systems, and accelerators for large-scale machine learning workloads, with an emphasis on large language models (LLMs). In particular, I aim to build performant and scalable systems that enable programmers to harness heterogeneous hardware (GPUs/TPUs/NPUs) for emerging machine learning applications (e.g., GenAI) in a more productive way.

indicates projects where I am the project lead

Accelerator Programming Frameworks:

  • Tawa [CGO’26] first introduces automatic warp specialization to generate efficient LLM kernels such as FlashAttention-3/4 on NVIDIA Hopper and Blackwell GPUs. The proposed NVWS dialect has been merged upstream into OpenAI Triton.
  • Slapo [ASPLOS’24] is a distributed LLM pre-training framework deployed at AWS, designed to balance usability and performance. It has influenced the design of ByteDance’s veScale and Meta’s TorchTitan.
  • BGL [NSDI’23] is a production-scale GNN training framework used at ByteDance, reducing billion-node graph training time from weeks to days.

Accelerator Design Languages: Allo [PLDI’24] / Dato [arXiv’25] is a Python/MLIR-based programming language for efficient ML accelerator design. It is adopted by 10+ universities and companiesCornell, UCLA, UIUC, Brown, UofT, UVA, UofChiago, UIC, Imperial, Tsinghua, SJTU, Intel, AMD, Microsoft. Allo integrates multiple supporting tools, including the PEQC [FPGA’24, 🏆 Best Paper Award] equivalence checker, the HeteroFlow [FPGA’22] dataflow programming framework, the ARIES [FPGA’25, 🎗️ Best Paper Nominee] backend for AMD AI Engine / NPUs.

Accelerator Architectures for ML:

ML for Systems:

News

Publications

Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References
Hongzheng Chen, Bin Fan, Alexander Collins, Bastian Hagedorn, Evghenii Gaburov, Masahiro Masuda, Matthew Brookhart, Chris Sullivan, Jason Knight, Zhiru Zhang, Vinod Grover
CGOIEEE/ACM International Symposium on Code Generation and Optimization, 2026 | [abs] | [bib] |

Characterizing and Optimizing Realistic Workloads on a Commercial Compute-in-SRAM Device
Niansong Zhang, Wenbo Zhu, Courtney Golden, Dan Ilan, Hongzheng Chen, Christopher Batten, Zhiru Zhang
MICROThe International Symposium on Microarchitecture, 2025 | [abs] | [bib] | | News

🎗️ ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines
Jinming Zhuang*, Shaojie Xiang*, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, Peipei Zhou
FPGAACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2025 (Best Paper Nominee) | [abs] | [bib] |

Allo: A Programming Model for Composable Accelerator Design
Hongzheng Chen*, Niansong Zhang*, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, Zhiru Zhang
PLDIACM SIGPLAN Conference on Programming Language Design and Implementation, 2024 | [abs] | [bib] | | | Blog (Zhihu)

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang
ACM TRETSACM Transactions on Reconfigurable Technology and Systems, 2024 (FCCMIEEE International Symposium on Field-Programmable Custom Computing Machines‘24 Journal Track) | [abs] | [bib] | | Blog (Zhihu)

Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training
Hongzheng Chen, Cody Hao Yu, Shuai Zheng, Zhen Zhang, Zhiru Zhang, Yida Wang
ASPLOSACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024 | [abs] | [bib] | | | Amazon Science

🏆 Formal Verification of Source-to-Source Transformations for HLS
Louis-Noël Pouchet, Emily Tucker, Niansong Zhang, Hongzheng Chen, Debjit Pal, Gabriel Rodríguez, Zhiru Zhang
FPGAACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2024 (Best Paper Award) | [abs] | [bib] |

BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing
Tianfeng Liu*, Yangrui Chen*, Dan Li, Chuan Wu, Yibo Zhu, Jun He, Yanghua Peng, Hongzheng Chen, Hongzhi Chen, Chuanxiong Guo
NSDIUSENIX Symposium on Networked Systems Design and Implementation, 2023 | [abs] | [bib] |

Accelerator Design with Decoupled Hardware Customizations: Benefits and Challenges
Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini, Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, Zhiru Zhang
DACACM/IEEE Design Automation Conference, 2022 (Invited Paper) | [abs] | [bib]

HeteroFlow: An Accelerator Programming Model with Decoupled Data Placement for Software-Defined FPGAs
Shaojie Xiang, Yi-Hsiang Lai, Yuan Zhou, Hongzheng Chen, Niansong Zhang, Debjit Pal, Zhiru Zhang
FPGAACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022 | [abs] | [bib] |

Krill: A Compiler and Runtime System for Concurrent Graph Processing
Hongzheng Chen, Minghua Shen, Nong Xiao, Yutong Lu
SCInternational Conference for High Performance Computing, Networking, Storage and Analysis, 2021 | [abs] | [bib] | |

🎗️ FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations
Yichi Zhang, Junhao Pan, Xinheng Liu, Hongzheng Chen, Deming Chen, Zhiru Zhang
FPGAACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2021 (Best Paper Nominee) | [abs] | [bib] |

Entropy-Directed Scheduling for FPGA High-Level Synthesis
Minghua Shen, Hongzheng Chen*, Nong Xiao
IEEE TCADIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020 | [abs] | [bib] |

A Deep-Reinforcement-Learning-Based Scheduler for FPGA HLS
Hongzheng Chen, Minghua Shen
ICCADIEEE/ACM International Conference on Computer-Aided Design, 2019 | [abs] | [bib] |

(* indicates equal contribution)

Workshops / Preprints

Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve
Hongzheng Chen, Alexander Novikov, Ngân (NV) Vũ, Hanna Alam, Zhiru Zhang, Aiden Grossman, Mircea Trofin, Amir Yazdanbakhsh
C4ML@CGOCompilers for Machine Learning Workshop at International Symposium on Code Generation and Optimization, 2026 | [abs] | [bib] |

Dato: A Task-Based Programming Model for Dataflow Accelerators
Shihan Fang*, Hongzheng Chen*, Niansong Zhang, Jiajie Li, Han Meng, Adrian Liu, Zhiru Zhang
arXiv:2509.06794, 2025 | [abs] | [bib] |

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
Hongzheng Chen*, Yingheng Wang*, Yaohui Cai*, Hins Hu*, Jiajie Li*, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang
Math-AI @ NeurIPSThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS, 2025 | [abs] | [bib] |

Allo: Catalyzing Accelerator Design and Programming for Machine Learning
Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhiru Zhang
C4ML@CGOCompilers for Machine Learning Workshop at International Symposium on Code Generation and Optimization, 2025 | [abs] | [bib] |

🥉 Uncovering Magic with Magic: Schedule Reconstruction from High-Performance Kernel Libraries
Hongzheng Chen
PLDI Student Research Competition (SRC)ACM SIGPLAN Conference on Programming Language Design and Implementation Student Research Competition, 2024 (Bronze) | [abs] | [bib] |

Structured Pruning is All You Need for Pruning CNNs at Initialization
Yaohui Cai, Weizhe Hua, Hongzheng Chen, G. Edward Suh, Christopher De Sa, Zhiru Zhang
arXiv:2203.02549, 2022 | [abs] | [bib]

Education

Cornell University, US
Ph.D. in Computer Science
Aug. 2021 - Present
Thesis: Composable Programming Models for Accelerated Computing
Committee: Zhiru Zhang, Adrian Sampson, Mohamed Abdelfattah
Accumulated GPA: 4.0/4.0
Cornell University, US
M.S. in Computer Science
Aug. 2021 - Dec. 2024
Sun Yat-sen University, China
B.E. in Computer Science
Aug. 2017 - Jun. 2021
Overall GPA: 3.95/4.00 (Major GPA: 3.99/4.00)
Ranking: 1/188

Work Experience

Google DeepMind , Sunnyvale, CA, US
Student Researcher, Compiler Optimization Team
Mentors: Mircea Trofin and Amir Yazdanbakhsh
May 2025 - Dec. 2025
NVIDIA , Redmond, WA, US
Machine Learning Compiler Research Intern, Deep Learning Compiler Technology Team
Mentors: Bin Fan and Vinod Grover
May 2024 - Nov. 2024
Amazon Web Services (AWS) , Santa Clara, CA, US
Applied Scientist Intern, Deep Engine-Science Team
Mentors: Cody Hao Yu, Shuai Zheng, and Yida Wang
Aug. 2022 - Apr. 2023
ByteDance AI Lab , Beijing, China
Research Intern, MLSys Team, Applied Machine Learning (AML)
Mentors: Jun He and Yibo Zhu
Aug. 2020 - May 2021

Teaching

Professional Service

Awards & Honors

Scholarship

  • Qualcomm Innovation Fellowship Finalist, Qualcomm, 2025
  • SenseTime Scholarship (21 undergrads in China), SenseTime, 2020
  • Chinese National Scholarship $\times$ 2 (Top 1%), Ministry of Education of PRC, 2018-2020
  • First-Prize Scholarship $\times$ 3 (Top 5%), Sun Yat-sen University, 2017-2020
  • Samsung Scholarship (Top 1%), Samsung Electronics, 2017-2018

Talks