Reading Notes for FasterMoE

Created2025-03-28|Updated2025-03-30|Reading Paper

|Post Views:

Summary

Abstract & Introduction & Background and Challenges

前面又是简单介绍MoE，基本都一样。

这个也是training方向的，说了三个challenges：

dynamic load imbalance

在intro里，叫Dynamic expert selection，就也比较明显，就是每次选的experts不一样。
inefficient synchronous execution mode

在intro里，叫Inefficient synchronous operations，就是expert有dependency，就需要别的worker的data，要等。
congested all-to-all communication

在intro里，叫Mismatch of model design and network topology，感觉他的意思是现在的system只管摆放experts的computation load，不管experts之间的communication。

从abstract这里感觉他还是主要是关于解决communication方面的问题。

Intro前面又讲了很久介绍，还附了个图：

CleanShot 2025-03-28 at 21.11.26

提出了一个 precise performance model，就在offline的时候根据MoE model and system configuration去预测latency。

三个方法，分别去解决上面的问题：Dynamic shadowing，A fine-grained smart scheduling strategy，a congestion-avoiding expert selection strategy。

contribution也是经典，一个a performance model，一个roofline-like model，加上上面三个方法，最合组合一起，整了一个system。六点贡献。

CleanShot 2025-03-30 at 15.12.46

又是一个新的transformer block的结构图。

CleanShot 2025-03-30 at 15.39.19

这篇论文又说可以选好几个experts，我目前还是感觉一个token只能用选一个expert，这个论文里说的可能是一个sequence里面会用不同的，有点迷惑。

再次具体的说了一下这三个challenges。

CleanShot 2025-03-30 at 18.07.29

Figure 4主要说的是第一个challenge，就是分配不均衡的这个问题。

第二个就是这个communication，一般我们喜欢尽量异步，但是all-to-all communication里面有一些dependency，所以很难异步。

第三个他虽然有说一遍，但我还是没太看懂，唯一理解是这个expert assignment确实有可以优化的地方。

Performance Modeling

Model-Guided Optimization Approaches

Thoughts

Author: Stanley Zheng

Link: https://s-tanley.github.io/blogs/2025/03/28/FasterMoE/

Copyright Notice: All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.

Related Articles

AllReduce & Bucketing

本文主要介绍了 AllReduce 和 Bucketing 分别是什么，和他们之间的联系。一、AllReduce 是什么？ AllReduce 是分布式训练中的一种集体通信操作，用于在多个 GPU（worker）之间同步张量（通常是梯度）。典型流程如下：每个 GPU 独立计算自己的梯度张量（如 grad）。所有 GPU 通过 AllReduce 操作，将各自的张量求和/平均，获得全局一致的梯度。每个 GPU 使用这个同步后的梯度更新模型参数。 AllReduce 是数据并行训练中实现模型同步的关键机制。二、为什么 AllReduce 会成为性能瓶颈？模型中参数众多，梯度张量数量也很多。每个张量如果单独 AllReduce，通信次数极多。小张量通信无法充分利用带宽，且频繁启动通信带来显著延迟（latency）。三、Bucketing 是什么？ Bucketing 是一种优化 AllReduce 通信效率的策略，将多个小张量合并成一个大 “bucket”，再一次性执行 AllReduce。核心思想：Batch Small Reduces...

MoE 中 All-to-All 通信机制

本文主要介绍了 All-to-All 通信机制，以及为什么需要这个机制。一、All-to-All 是什么？在分布式 Mixture-of-Experts（MoE）模型中，All-to-All 是一种通信操作，用于在多个 GPU 之间交换 token 和专家（Expert）之间的数据。每个 GPU 上都有输入 token，而每个 Expert 分布在多个不同的 GPU 上。 Gate 网络决定每个 token 应该由哪些专家处理，因此 token 需要被动态发送到目标 Expert 所在的 GPU。这正是 All-to-All：每个 GPU 既向其他 GPU 发送数据，也接收来自其他 GPU 的数据。二、为什么 MoE 模型需要 All-to-All？ 1. Expert 是独立的，但 Token 是全局的每个 Expert 的参数是本地的，只存在于某个 GPU 上。但 token 是通过数据并行划分的，分布在所有 GPU 上。每个 token 的 gate 结果可能指向任意 GPU 上的 Expert。因此，token 必须被跨设备发送到它所选中的...

Reading Notes for Orca

This is the reading notes for the ORCA: A Distributed Serving System for Transformer-Based Generative Models. This is an OSDI conference paper from 2022. Almost all the authors come from South Korea, and actually, this is the first time I have read papers written by Koreans. Summary Abstract & Introduction & Background The paper is focused on the inference serving, they point out that the existing system is not good enough for transformer-based models. So, they propose a new method...

Resource I Have for MLSys

This is like a guidance page for the resources I know for MLSys, I’ll give a brief introduction to each of them and list the link here. The resources will contain books, papers, and notes I wrote. Books AI System This book is more about the hardware. I think it’s a little bit like for ECE students. I haven’t read it all yet, but I think you can find some useful topics here, such as the introduction to Nvidia GPUs, the Tensor Core, stream multiprocessors, and how the GPU actually do to...

Reading Notes for SmartMoE

Summary Abstract & Introduction & Background and Motivation Deep neural network（DNN）现在越来越大，除了dense model，就是比较传统的model之外，越来越多的人开始关注sparsely activated model。针对dense model，之前有很多auto-parallelization的方法，但是这些方法对sparsely activated model，比如说MoE架构的模型就没那么好用了。所以他们主要做的就是实现对sparsely activated model做自动并行的分布式训练的方法。 Intro就先说一下来龙去脉，就众所周知，scaling law目前对DNN一直没有失效，所以各家基本上就是一直往上堆参数。但模型变大了就练不动了，所以就要找efficient...

Reading Notes for SGLang

Comments

GiscusUtterances

Loading Database