AllReduce & Bucketing

Created2025-04-07|Updated2025-04-07|Research Blogs

|Post Views:

本文主要介绍了 AllReduce 和 Bucketing 分别是什么，和他们之间的联系。

一、AllReduce 是什么？

AllReduce 是分布式训练中的一种集体通信操作，
用于在多个 GPU（worker）之间同步张量（通常是梯度）。

典型流程如下：

每个 GPU 独立计算自己的梯度张量（如 grad）。
所有 GPU 通过 AllReduce 操作，将各自的张量求和/平均，获得全局一致的梯度。
每个 GPU 使用这个同步后的梯度更新模型参数。

AllReduce 是数据并行训练中实现模型同步的关键机制。

二、为什么 AllReduce 会成为性能瓶颈？

模型中参数众多，梯度张量数量也很多。
每个张量如果单独 AllReduce，通信次数极多。
小张量通信无法充分利用带宽，且频繁启动通信带来显著延迟（latency）。

三、Bucketing 是什么？

Bucketing 是一种优化 AllReduce 通信效率的策略，
将多个小张量合并成一个大 “bucket”，再一次性执行 AllReduce。

核心思想：Batch Small Reduces into One Large Reduce

举例：

原始做法：
- grad1 → AllReduce
- grad2 → AllReduce
- grad3 → AllReduce
Bucketing 后：
- [grad1 + grad2 + grad3] → 合并成 bucket → 一次 AllReduce

四、Bucketing 的优势

优势	说明
降低通信启动次数	小张量太多会频繁触发 AllReduce，带来启动延迟
提高带宽利用率	大张量通信更接近理论带宽上限，传输更高效
便于通信计算重叠	bucket 可以在张量准备好后提前通信，提升整体吞吐

五、在 PyTorch 等系统中的实际应用

PyTorch DDP（DistributedDataParallel）:

默认启用 bucketing 策略。
使用参数 bucket_cap_mb 控制 bucket 的大小（如 25MB）。
当 bucket 填满或梯度 ready，即可触发一次 AllReduce。

DeepSpeed, Megatron 等:

通常会设计多级 bucket，例如按照张量类型（权重/偏置）或位置（layer-wise）划分。
有时还结合流式调度（streaming）和异步通信。

六、Bucketing 与 MoE 的关系

在 MoE 训练中，专家参数（expert weights）通常不需要 AllReduce（它们是局部的）。
但非专家参数（如 Attention, LayerNorm 等）仍然共享，需要 AllReduce 同步。
Bucketing 同样可用于这些参数的通信优化。

七、小结

AllReduce 是实现数据并行训练中参数同步的核心通信操作。
Bucketing 是一种提升 AllReduce 效率的关键优化策略，特别适用于张量数量多、小张量多的场景。
二者关系是协作：Bucketing 让 AllReduce 更高效，AllReduce 是 Bucketing 的目标操作。

Author: Stanley Zheng

Link: https://s-tanley.github.io/blogs/2025/04/07/AllReduce & Bucketing/

Copyright Notice: All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.

Related Articles

Reading Notes for FasterMoE

Summary Abstract & Introduction & Background and Challenges 前面又是简单介绍MoE，基本都一样。这个也是training方向的，说了三个challenges： dynamic load imbalance 在intro里，叫Dynamic expert selection，就也比较明显，就是每次选的experts不一样。 inefficient synchronous execution mode 在intro里，叫Inefficient synchronous operations，就是expert有dependency，就需要别的worker的data，要等。 congested all-to-all communication 在intro里，叫Mismatch of model design and network topology，感觉他的意思是现在的system只管摆放experts的computation...

MoE 中 All-to-All 通信机制

本文主要介绍了 All-to-All 通信机制，以及为什么需要这个机制。一、All-to-All 是什么？在分布式 Mixture-of-Experts（MoE）模型中，All-to-All 是一种通信操作，用于在多个 GPU 之间交换 token 和专家（Expert）之间的数据。每个 GPU 上都有输入 token，而每个 Expert 分布在多个不同的 GPU 上。 Gate 网络决定每个 token 应该由哪些专家处理，因此 token 需要被动态发送到目标 Expert 所在的 GPU。这正是 All-to-All：每个 GPU 既向其他 GPU 发送数据，也接收来自其他 GPU 的数据。二、为什么 MoE 模型需要 All-to-All？ 1. Expert 是独立的，但 Token 是全局的每个 Expert 的参数是本地的，只存在于某个 GPU 上。但 token 是通过数据并行划分的，分布在所有 GPU 上。每个 token 的 gate 结果可能指向任意 GPU 上的 Expert。因此，token 必须被跨设备发送到它所选中的...

Reading Notes for Orca

This is the reading notes for the ORCA: A Distributed Serving System for Transformer-Based Generative Models. This is an OSDI conference paper from 2022. Almost all the authors come from South Korea, and actually, this is the first time I have read papers written by Koreans. Summary Abstract & Introduction & Background The paper is focused on the inference serving, they point out that the existing system is not good enough for transformer-based models. So, they propose a new method...

Resource I Have for MLSys

This is like a guidance page for the resources I know for MLSys, I’ll give a brief introduction to each of them and list the link here. The resources will contain books, papers, and notes I wrote. Books AI System This book is more about the hardware. I think it’s a little bit like for ECE students. I haven’t read it all yet, but I think you can find some useful topics here, such as the introduction to Nvidia GPUs, the Tensor Core, stream multiprocessors, and how the GPU actually do to...

Reading Notes for SGLang

Reading Notes for SmartMoE

Summary Abstract & Introduction & Background and Motivation Deep neural network（DNN）现在越来越大，除了dense model，就是比较传统的model之外，越来越多的人开始关注sparsely activated model。针对dense model，之前有很多auto-parallelization的方法，但是这些方法对sparsely activated model，比如说MoE架构的模型就没那么好用了。所以他们主要做的就是实现对sparsely activated model做自动并行的分布式训练的方法。 Intro就先说一下来龙去脉，就众所周知，scaling law目前对DNN一直没有失效，所以各家基本上就是一直往上堆参数。但模型变大了就练不动了，所以就要找efficient...

Comments

GiscusUtterances

Loading Database