Resource I Have for MLSys

This is like a guidance page for the resources I know for MLSys, I’ll give a brief introduction to each of them and list the link here. The resources will contain books, papers, and notes I wrote.

Books

AI System

This book is more about the hardware. I think it’s a little bit like for ECE students. I haven’t read it all yet, but I think you can find some useful topics here, such as the introduction to Nvidia GPUs, the Tensor Core, stream multiprocessors, and how the GPU actually do to accelerates the computations. It’s an intro-level course, so you don’t need to have any background. However, since it involves lots of history, which seems not that useful. I think you can just go to the specific chapter to find out what you want instead of reading it from the beginning to the end.

Another reason I recommend this book it’s that it’s open source, and you can contribute to it. I actually am one of the top 10 contributors. It’s also a really good project for you to actually practice your git skills in a real-world work environment.

The book is not sophisticated now, so you may see some errors in the web version. The way I read it is to git clone it to your local, and use the VSCode preview to read the .md files.

Here is the link: https://github.com/chenzomi12/aisystem

AI Infrastructure

This is another book from the same author, and the content of this book, I think, is what we system guys really do. It’s less about the hardware and focuses more on the distributed training/inference systems.

Here is the link: https://github.com/chenzomi12/AIInfra

Papers

I think my notes for the papers will be here, but the blog site it’s like I created recently, so some of my previous notes aren’t in it. Still, I list the corresponding link below. If you see something interesting and it happens to have a link to my notes, you can have a look.

Inference System

Orca

This is the paper that Prof. Ma gave us. I think no one uses it anymore, but you can learn some basic concepts here. Also, it’s one of the base models the vLLM used to compare with

The paper: https://www.usenix.org/system/files/osdi22-yu.pdf

My notes: https://s-tanley.github.io/blogs/2025/05/15/Orca/

vLLM

It’s one of the most popular inference serving systems nowadays, the core idea is the PagedAttention.

The paper: https://arxiv.org/abs/2309.06180

My notes: https://s-tanley.github.io/blogs/2025/05/27/vLLM/

SGLang

It’s the other one most popular inference serving system, and I think most of people use SGLang instead of vLLM now.

The paper: https://arxiv.org/abs/2312.07104

Music

MoE

For now, these are papers recommended by another professor.

You can also see my notes here. I have also written a report, and I think it’s kind of concise.

MoE-Infinity

The paper: http://arxiv.org/abs/2401.14361

ProMoE

The paper: https://doi.org/10.48550/arXiv.2410.22134

TS-MoE

The paper: https://doi.org/10.48550/arXiv.2206.00277

DeepSpeed-FastGen

The paper: https://doi.org/10.48550/arXiv.2401.08671

DeepUM

The paper: https://doi.org/10.1145/3575693.3575736

GShard

The paper: https://doi.org/10.48550/arXiv.2006.16668

MC-SMoE

The paper: https://doi.org/10.48550/arXiv.2310.01334

Mu Li’s recommendation

Mu Li is a really great researcher, and he has published lots of useful resources online, which you can access through Bilibili or YouTube.

The collection of the reading paper is really useful, it’s like my 101 course for researching.

He also has a GitHub repo for this list.

I also casually take some notes when I watch the video, but it’s like most of them are really bad, I mean, I think even I won’t read them again. Anyway, I also created a repo for my notes, and in case you are interested.