NSDI’23, a conference focused on networked systems design and implementation, was held in late April in Boston, MA. Thanks to my paper on a GNN system that I worked on during my undergraduate internship at ByteDance, I was fortunate enough to receive the USENIX travel grant to attend the conference. While I did not have the opportunity to review all of the papers beforehand, I will provide summaries of the sessions that I attended in this post.
Bingyang Wu and Zili Zhang, Peking University; Zhihao Bai, Johns Hopkins University; Xuanzhe Liu and Xin Jin, Peking University
https://www.usenix.org/conference/nsdi23/presentation/wu
Changho Hwang, KAIST, Microsoft Research; KyoungSoo Park, KAIST; Ran Shu, Xinyuan Qu, Peng Cheng, and Yongqiang Xiong, Microsoft Research
https://www.usenix.org/conference/nsdi23/presentation/hwang
Mainly reduce the memory access overhead from GPU.
Tianfeng Liu, Tsinghua University, Zhongguancun Laboratory, ByteDance; Yangrui Chen, The University of Hong Kong, ByteDance; Dan Li, Tsinghua University, Zhongguancun Laboratory; Chuan Wu, The University of Hong Kong; Yibo Zhu, Jun He, and Yanghua Peng, ByteDance; Hongzheng Chen, ByteDance, Cornell University; Hongzhi Chen and Chuanxiong Guo, ByteDance
This work is from my ByteDance group, and I am fortunate to be part of the team. I think the optimizations we made are very intuitive. We mainly consider sampling-based GNNs. Our findings showed that not only does accessing the graph structure contribute significantly to the final performance, but access to node features can also degrade performance. Since graph features are typically large and stored on remote CPU machines, we implemented a dynamic cache engine on the GPU to avoid repeated reads, and a simple cache replacement policy for the features.
To ensure locality of the graph structure, we developed a partition algorithm to split the graph across multiple machines. However, we encountered a tricky issue as locality and randomness can conflict, and changing the way we sample the graph can affect accuracy. To address this, we came up with a simple algorithm that generates BFS sequences (for locality) and shuffles them (for randomness) to ensure both are maintained.
Lastly, different stages may have contention that makes the pipeline stall, so we need to balance those stages. We create an ILP model and solve it optimally to allocate those resources.
More details can be found in our paper.
Jie You, Jae-Won Chung, and Mosharaf Chowdhury, University of Michigan
The authors of this work claim to have discovered a Pareto frontier between energy and performance that can be explored when retraining or finding the best configuration of a model. They even demonstrated their framework during their presentation using a stable diffusion model.
While their work is intriguing, it does pose a bit of a challenge. Reducing energy consumption typically results in lower utilization and performance. Although energy consumption is an important issue, it remains to be seen whether this framework will be widely adopted for trading energy.
Jingrong Chen, Yongji Wu, and Shihan Lin, Duke University; Yechen Xu, Shanghai Jiao Tong University; Xinhao Kong, Duke University; Thomas Anderson, University of Washington; Matthew Lentz, Xiaowei Yang, and Danyang Zhuo, Duke University
Instead of using RPC as a library, this work proposes to use RPC as a system service.
Lei Zhang, Emory University and Princeton University; Zhiqiang Xie and Vaastav Anand, Max Planck Institute for Software Systems; Ymir Vigfusson, Emory University; Jonathan Mace, Max Planck Institute for Software Systems
Tammam Mustafa, MIT; Konstantinos Kallas, University of Pennsylvania; Pratyush Das, Purdue University; Nikos Vasilakis, Brown University
A parallel compiler for shell scripts.
Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, and Gautam Mittal, UC Berkeley; Scott Shenker, UC Berkeley and ICSI; Ion Stoica, UC Berkeley
I heard many people discussing sky computing during the conference. I am not sure whether people will buy this idea. It seems the engineering effort is more than its research novelty. It actually poses lots of requirement on the users and cloud providers.
John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, and Yifan Qiao, UCLA; Zhihao Jia, CMU; Minjia Zhang, Microsoft Research; Ravi Netravali, Princeton University; Guoqing Harry Xu, UCLA
Very interesting work. Many people wanted to reduce the bubble in the pipeline, but this work says that we cannot reduce the bubble in the pipeline, then we can just embrace it. Since devices are idle during these bubbles, this approach can effectively reduce energy consumption. I also spoke with the Zeus author, who mentioned that they are exploring a similar idea to reduce energy consumption by dragging the fastest device to the slowest one.
I asked the author about the usability issue, and they mentioned their work is based on DeepSpeed, so there is no need to change the model.
Weiyang Wang, Moein Khazraee, Zhizhen Zhong, and Manya Ghobadi, Massachusetts Institute of Technology; Zhihao Jia, Meta and CMU; Dheevatsa Mudigere and Ying Zhang, Meta; Anthony Kewitsch, Telescent
Miss the talk, will update later.
Fan Lai, Yinwei Dai, Harsha V. Madhyastha, and Mosharaf Chowdhury, University of Michigan
https://www.usenix.org/conference/nsdi23/presentation/lai-fan
This work is an interesting exploration of transfer learning, specifically for models with slight architecture differences that make it difficult to reuse parameters from existing pre-trained models. The author proposes a graph matching approach to partially initialize the model, which could potentially reduce training time.
However, I am uncertain about the practicality of this approach. Most models on HuggingFace are quite regular and based on Transformers. Thus, few people are likely to modify the architecture and instead will just use the pre-trained checkpoint for downstream tasks. It’s possible that this approach could benefit more from neural architecture search (NAS) rather than traditional transfer learning.
Hong Zhang, University of Waterloo; Yupeng Tang and Anurag Khandelwal, Yale University; Ion Stoica, UC Berkeley
https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
Key goal is to achieve high goodput (i.e., the number of requests that meet their deadlines per unit time). Observation is that although short-term requests are unpredictable, grouping requests in a relatively long-term can be predictable. SHEPHERD provides two algorithms to make the DNN serving have high goodput. Notice this work is mainly focuing on small DNN that can be fit in a single device. For LLM serving, refer to AlpaServe, which has already been accepted by OSDI’23.
Kshiteej Mahajan, University of Wisconsin - Madison; Ching-Hsiang Chu and Srinivas Sridharan, Facebook; Aditya Akella, UT Austin
Zeying Zhu, Boston University; Kan Wu, University of Wisconsin-Madison; Zaoxing Liu, Boston University
Key idea is to decompose a complicated pattern into multiple simple subpatterns, e.g. 5-house -> triangle and square. Use LP solver to find the best decomposition, and conduct approximate pattern matching. Similar work from ASPLOS’23, DemoMine.
Junxue Zhang and Xiaodian Cheng, iSINGLab at Hong Kong University of Science and Technology and Clustar; Wei Wang, Clustar; Liu Yang, iSINGLab at Hong Kong University of Science and Technology and Clustar; Jinbin Hu and Kai Chen, iSINGLab at Hong Kong University of Science and Technology
They identify several important operators for cross-silo FL and implement them in FPGA. They use Verilog to implement on-chip controller, which is a harden block. They use FPGA but it is not for its reconfigurability, so that they can achieve more speedup when simulating the results on ASIC.
Chieh-Jan Mike Liang, Microsoft Research; Zilin Fang, Carnegie Mellon University; Yuqing Xie, Tsinghua University; Fan Yang, Microsoft Research; Zhao Lucis Li, University of Science and Technology of China; Li Lyna Zhang, Mao Yang, and Lidong Zhou, Microsoft Research
https://www.usenix.org/conference/nsdi23/presentation/liang-chieh-jan