### What this PR does / why we need it?
The expert mapping table and weights of the dynamic EPLB were not
updated, causing the accuracy to be correct but not effective. This bug
has now been fixed.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
1.Add eplb ci to check the change of eplb feature.
2.Add param checking of eplb params.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Qwen in A3.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
This PR adds support for redundant experts in the EPLB.
Key points:
- Use global_num_experts = num_experts + num_redundant_experts
consistently.
- Backward compatible when num_redundant_experts=0.
Tested
On a 16-rank setup (W8A8) with static EPLB and expert_map_path,
verifying router logits shape and successful requests.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: yechao237 <yechao20180411@gmail.com>
What this PR does / why we need it?
1.Record expert map without dynamic eplb.
2.Add export PYTHONOPTIMIZE=1 when using dynamic eplb.
3.change eplb doc
Does this PR introduce any user-facing change?
How was this patch tested?
Qwen3_moe in A3.
- vLLM version: v0.11.0
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
This PR primarily focuses on two key changes:
1. Adjusts internal interface calls to optimize the interaction logic
between related modules.
2. Exposes an interface that allows users to select the EPLB algorithm,
enabling more flexible configuration based on specific usage scenarios.
These changes aim to enhance the usability of the system while ensuring
the stability of internal operations. Relevant unit tests have been
updated to cover the modified logic.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it?
When using dynamic eplb,it will be blocking by nz tensor.We fix these
prolems by clone src tensor and recv tensor.
### Does this PR introduce any user-facing change?
### How was this patch tested?
Qwen3_moe in A3.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
## Purpose
This Pull Request enhances the EPLB (Expert Parallelism Load Balancing)
system by introducing a novel balancing algorithm: FlashLB.
## Motivation
1. The default algorithm adopts a two-stage greedy strategy:
a. Replica allotment: Determine the number of expert replicas by
minimizing the maximum load per replica (Min Max Replica, MMR).
b. Replica placement: Distribute replicas across devices by repeatedly
assigning the heaviest replica to the least loaded device (Longest
Processing Time First, LPT).
However, this sequential process lacks inter-stage collaborative
optimization, often leading to suboptimal load balancing. For example,
in the simple case shown in the figure below: given 8 logical experts
with hotness values of 600, 560, 120, 120, 20, 10, 10, 10, and 2
replicas allocated per device across 8 devices, the EPLB algorithm
yields a maximum per-device hotness of 232, while our proposed FlashLB
algorithm can reduce this value to 205.
2. The default algorithm relies on the averaged expert hotness over a
fixed time window for optimization. While this provides a coarse
approximation of the hotness distribution, it fails to capture
oscillatory deviations and temporal correlations of expert hotness
observed across iterations in real-world scenarios, limiting
optimization quality.
3. The default algorithm periodically regenerates the expert placement
table. However, it generates the table for each individual layer, and
the new table does not account for correlations with the previous one;
these two factors collectively lead to nearly full-scale expert
reassignment.
## FlashLB Algorithm Principle
1. Joint Optimization
FlashLB achieves joint optimization of replica allotment and placement
through group-based decision-making. Each group gradually determines the
replica count and placement for a subset of experts, ensuring that the
expected inter-device load balance (considering both deployed and
pending expert replicas) is holistically optimized. To attain superior
load balancing, FlashLB employs tree search to expand the solution space
while integrating pruning and precompilation techniques for
acceleration, thereby delivering load balancing that is both
high-quality and practically efficient.
2. Multi-Shot Enhancement
FlashLB partitions each profiling interval (e.g., 1024 iterations) into
consecutive smaller sub-intervals (e.g., 16 iterations), each capturing
independent hotness measurements. It then performs multi-shot
optimization to co-optimize these sub-intervals simultaneously—enabling
adaptation to time-variant expert hotness while enhancing robustness.
3. Incremental Adjustment
To reduce the overhead of frequent expert re-deployment, FlashLB
introduces an incremental adjustment scheme operating at both
inter-layer and intra-layer levels:
a. Inter-Layer: Hotness variations are tracked at the layer level. Only
layers with fluctuations exceeding a predefined threshold trigger
re-computation of expert placement, avoiding unnecessary redeployment
for stable layers;
b. Intra-Layer (Optional): A lightweight incremental LPT algorithm
(LPT-Incremental) is applied. Instead of recomputing full placement for
all experts in a layer, it selectively adjusts only the hottest experts
or those with replica count changes, further reducing migration
overhead.
This incremental strategy significantly reduces adjustment costs while
maintaining balanced performance across layers and devices.
## Co-author:
Co-authored-by: Skywalker-EP 173723846@qq.com
- vLLM version: v0.10.2
- vLLM main:
9607d5eb44
---------
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: rjg-lyh <1318825571@qq.com>
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Signed-off-by: fems14 <1804143737@qq.com>
Co-authored-by: sdmyzlp <117554856+sdmyzlp@users.noreply.github.com>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: zhanghw0354 <zhanghaiwencmss@139.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhangxinyuehfad <59153331+zhangxinyuehfad@users.noreply.github.com>
Co-authored-by: Lucas Kabela <lucasakabela@gmail.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Icey <1790571317@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: Angazenn <supperccell@163.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: rjg-lyh <83491835+rjg-lyh@users.noreply.github.com>
Co-authored-by: weichen <132029610+Pr0Wh1teGivee@users.noreply.github.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
Co-authored-by: fems14 <74094523+fems14@users.noreply.github.com>
### Motivation
Currently dynamically experts balancing would stop-the-world.
Asynchronously expert load balancing would be better without flowing
problems:
Host-bound latency:
There are many cpu operations during EPLB such as
eplb-algorithm、creating p2p ops、and log2phy expert converting would
spend long cpu time, as ~1s.
Communication latency: The transfer time would cost much in the
situation without nvlink. As the weight of an expert maybe transfer to
multiple new positions, thus N times send/recv for one expert, with
result long latency. We had tested that batch_isend_irecv cost more
100ms for 16 experts weight transmission in A2 server of ascend.
SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms
cost for each layer while benefit 5ms-8ms decode latency with ep_size =
64.
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing while almost the same
effect.
### Proposed Change
We will gradually migrate the EPLB logic to the VLLM community and
implement a generalized design. Relevant RFC:
https://github.com/vllm-project/vllm/issues/22246
The overall workflow involves:
<img width="801" height="302"
alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c"
src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed"
/>
1. Record experts distribution during forward. We using expert_token_num
after disptach instead of topk_ids, thus we got much smaller tensor
shape to reduce cost of hbm recording and add-operator.
2. Do all-gather for experts distribution. Using all-gather instead of
all-reduce as less traffic volume.
3. Wake up eplb worker process with experts distribution when
num_iterations comes. Run eplb algorithm in eplb worker.
4. Generate p2p send/recv ops and other operator such as log2phy would
cost long cpu time.
5. Lanch ibatch_send_recv in async_stream before forward.
6. After forward, wait for the ibatch_send_recv finish, then do uapte
expert map and expert weights.
### Co-author
Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con
Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn
Co-authored-by: qmkakaxi wjh1594260677@qq.com
Co-authored-by: Skywalker-EP 173723846@qq.com
- vLLM version: v0.10.2
- vLLM main:
567939953b
---------
Signed-off-by: offline0806 <z00858301@china.huawei.com>
Co-authored-by: offline0806 <z00858301@china.huawei.com>