MoE 中 All-to-All 通信机制

Created2025-04-07|Updated2025-04-07|Research Blogs

|Post Views:

本文主要介绍了 All-to-All 通信机制，以及为什么需要这个机制。

一、All-to-All 是什么？

在分布式 Mixture-of-Experts（MoE）模型中，All-to-All 是一种通信操作，
用于在多个 GPU 之间交换 token 和专家（Expert）之间的数据。

每个 GPU 上都有输入 token，而每个 Expert 分布在多个不同的 GPU 上。
Gate 网络决定每个 token 应该由哪些专家处理，因此 token 需要被动态发送到目标 Expert 所在的 GPU。

这正是 All-to-All：每个 GPU 既向其他 GPU 发送数据，也接收来自其他 GPU 的数据。

二、为什么 MoE 模型需要 All-to-All？

1. Expert 是独立的，但 Token 是全局的

每个 Expert 的参数是本地的，只存在于某个 GPU 上。
但 token 是通过数据并行划分的，分布在所有 GPU 上。
每个 token 的 gate 结果可能指向任意 GPU 上的 Expert。

因此，token 必须被跨设备发送到它所选中的 expert，产生 All-to-All 通信。

2. Forward 和 Backward 各需要 2 次 All-to-All

Forward：
- Dispatch（token -> expert）
- Combine（expert -> token 原始设备）
Backward：
- Gradient dispatch（token grad -> expert）
- Gradient combine（expert grad -> token）

共计 4 次 All-to-All 调用。

三、为什么不能一开始就把 token 发给对应 GPU？

1. Gate 是动态计算出来的

Gate 网络的输入是 MoE 层前一层的 hidden state。
只有在前向传播过程中执行到 MoE 层前，才能计算出每个 token 该去哪几个 Expert。
所以，在模型最开始时，无法预先知道 token 的目标 Expert 所在 GPU。

2. 提前分发负载不均衡，复杂度高

如果尝试预分发 token，可能导致 GPU 负载不均衡。
需要提前构建全局路由表，系统实现复杂。
还会失去 All-to-All 的隐式负载均衡能力。

四、为什么 All-to-All 会阻塞计算？

All-to-All 是集体通信（collective communication），所有参与设备必须同时完成。
通信完成前，Expert 计算无法启动，导致强同步依赖。
所以 All-to-All 通常会“阻塞”后续的 Expert 前向或反向计算。

五、为什么 All-reduce 更容易和计算重叠？

通信类型	模式特征	数据规律性	是否依赖 gate	是否阻塞计算
All-reduce	规则通信（梯度聚合）	固定大小	否	否，可重叠
All-to-All	动态通信（token 路由）	不均匀	是	是，强依赖

All-reduce 使用 ring/tree 算法，可切分为多个小通信步骤，与计算并行。
All-to-All 通信量不均，且结果决定后续 Expert 才能开始计算，难以重叠。

六、系统优化策略示例

Lina:

使用 tensor partitioning，将大张量切分为 micro-ops。
通过优先调度保证 All-to-All 优先获得带宽，All-reduce 在空闲时执行。
实现通信和 Expert 计算的流水线（pipelining）。

NetMoE:

不优化通信本身，而是动态调整 sample placement，使 token 分布更贴合 expert 分布。
减少跨节点通信量，从系统层缓解 All-to-All 压力。

FasterMoE:

提出 Dynamic Shadowing，对热点 Expert 进行复制，降低通信负载。

七、小结

All-to-All 是 MoE 模型中不可避免的通信操作，核心原因是 token 是分布式的，expert 是分布式的，gate 是动态的。
当前系统优化主要集中在通信调度、专家复制、数据重分布等方向。
要理解 All-to-All 的必要性，关键是要明确 token 与 expert 的动态映射关系。

Author: Stanley Zheng

Link: https://s-tanley.github.io/blogs/2025/04/07/MoE 中 All-to-All 通信机制/

Copyright Notice: All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.

Related Articles

AllReduce & Bucketing

本文主要介绍了 AllReduce 和 Bucketing 分别是什么，和他们之间的联系。一、AllReduce 是什么？ AllReduce 是分布式训练中的一种集体通信操作，用于在多个 GPU（worker）之间同步张量（通常是梯度）。典型流程如下：每个 GPU 独立计算自己的梯度张量（如 grad）。所有 GPU 通过 AllReduce 操作，将各自的张量求和/平均，获得全局一致的梯度。每个 GPU 使用这个同步后的梯度更新模型参数。 AllReduce 是数据并行训练中实现模型同步的关键机制。二、为什么 AllReduce 会成为性能瓶颈？模型中参数众多，梯度张量数量也很多。每个张量如果单独 AllReduce，通信次数极多。小张量通信无法充分利用带宽，且频繁启动通信带来显著延迟（latency）。三、Bucketing 是什么？ Bucketing 是一种优化 AllReduce 通信效率的策略，将多个小张量合并成一个大 “bucket”，再一次性执行 AllReduce。核心思想：Batch Small Reduces...

Reading Notes for FasterMoE

Summary Abstract & Introduction & Background and Challenges 前面又是简单介绍MoE，基本都一样。这个也是training方向的，说了三个challenges： dynamic load imbalance 在intro里，叫Dynamic expert selection，就也比较明显，就是每次选的experts不一样。 inefficient synchronous execution mode 在intro里，叫Inefficient synchronous operations，就是expert有dependency，就需要别的worker的data，要等。 congested all-to-all communication 在intro里，叫Mismatch of model design and network topology，感觉他的意思是现在的system只管摆放experts的computation...

Reading Notes for Orca

This is the reading notes for the ORCA: A Distributed Serving System for Transformer-Based Generative Models. This is an OSDI conference paper from 2022. Almost all the authors come from South Korea, and actually, this is the first time I have read papers written by Koreans. Summary Abstract & Introduction & Background The paper is focused on the inference serving, they point out that the existing system is not good enough for transformer-based models. So, they propose a new method...

Resource I Have for MLSys

This is like a guidance page for the resources I know for MLSys, I’ll give a brief introduction to each of them and list the link here. The resources will contain books, papers, and notes I wrote. Books AI System This book is more about the hardware. I think it’s a little bit like for ECE students. I haven’t read it all yet, but I think you can find some useful topics here, such as the introduction to Nvidia GPUs, the Tensor Core, stream multiprocessors, and how the GPU actually do to...

Reading Notes for SGLang

Reading Notes for SmartMoE

Summary Abstract & Introduction & Background and Motivation Deep neural network（DNN）现在越来越大，除了dense model，就是比较传统的model之外，越来越多的人开始关注sparsely activated model。针对dense model，之前有很多auto-parallelization的方法，但是这些方法对sparsely activated model，比如说MoE架构的模型就没那么好用了。所以他们主要做的就是实现对sparsely activated model做自动并行的分布式训练的方法。 Intro就先说一下来龙去脉，就众所周知，scaling law目前对DNN一直没有失效，所以各家基本上就是一直往上堆参数。但模型变大了就练不动了，所以就要找efficient...

Comments

GiscusUtterances

Loading Database