# Feature: FlashLB algorithm
## Purpose
This Pull Request enhances the EPLB (Expert Parallelism Load Balancing)
system by introducing a novel load balancing algorithm: FlashLB.
1. The default algorithm adopts two separate sub-procedures to optimize
expert replication and placement independently:
a. **Expert Replica Allotment Sub-procedure** : Determines the number of
replicas for all experts. At each step, it greedily adds one more
replica to the expert with the highest per-replica load, aiming to
minimize load skew at the expert replica granularity (Min Max Replica,
MMR).
b. **Expert Replica Placement Sub-procedure** : Distributes all replicas
across devices. First, it sorts the generated replicas in descending
order of hotness, then iteratively places the currently hottest replica
onto the device with the lowest cumulative load and available slots.
However, this simplistic combination of two separate procedures lacks
synergy and often leads to sub-optimal load balancing. For example, in
the simple scenario illustrated below: Given 8 logical experts with
hotness values [600, 560, 120, 120, 20, 10, 10, 10], and 2 replicas
allocated per device across 8 devices, the default EPLB algorithm
results in a maximum per-device hotness of 232 (peak-average load ratio
1.28), while our proposed FlashLB algorithm reduces this value to 205
(peak-average load ratio 1.13).
<figure><img
src="https://github.com/user-attachments/assets/b9b10fab-651e-4524-9942-adbca8d044a4"
width="90%"</figure>
2. The default algorithm simply aggregates hotness measurements across
the entire profiling window. While this provides a coarse approximation
of the hotness distribution, it fails to capture the time-phased
variations and temporal correlations in expert hotness (both within and
between experts) across iterations—phenomena that have been observed in
real-world scenarios. Such single-point hotness estimation degrades the
solution quality of the load balancing algorithm.
3. The default algorithm regularly recalculates updated expert placement
results for all layers without discrimination. Considering that
excessive expert updates can impact Service Level Objectives (SLOs),
such full-scale redeployment leads to excessively high adjustment
overhead, which negatively affects end-to-end performance.
## FlashLB Algorithm Principle
### 1. Joint Optimization of Replica Allotment and Placement
FlashLB achieves joint optimization of replica allotment and placement
through a novel tree search approach, combined with carefully designed e
Fl fficient pruning and lightweight look-ahead estimation. We partition
all experts into several subsets, and for each subset, hierarchically
determine the optimal replica count and placement. Leveraging efficient
pruning and lightweight look-ahead estimation, the process consistently
aims to optimize the globally expected inter-device load balance degree
(considering both deployed and unexplored experts) while ensuring
sufficient computational efficiency. Additionally, precompilation
techniques are employed for acceleration, delivering load balancing that
is both high-quality and practically efficient.
### 2. Multi-Episode Enhancement
Instead of performing full-duration averaging like the default
algorithm, FlashLB partitions each profiling interval (e.g., 1024
iterations) into multiple consecutive smaller episodes (e.g., 16
iterations). This preserves hotness fluctuation and correlation
information. It then constructs a multi-objective optimization problem
to co-optimize these episodes simultaneously, enabling adaptability to
interleaved hotness patterns and improving statistical robustness.
### 3. Layer-wise Cherry-Picking Redeployment
To reduce the overhead of frequent expert redeployment, FlashLB
introduces a cherry-picking redeployment scheme. During each algorithmic
decision cycle, it real-time tracks load balance degree of all layers
and triggers expert placement updates only for those layers whose
peak-average ratio exceeds a predefined threshold. This avoids
unnecessary redeployment for stable layers, significantly reducing
adjustment overhead and thereby improving end-to-end performance gains.
## Co-author:
Co-authored-by: Skywalker-EP 173723846@qq.com
This PR mainly introduces two key optimizations for load balancing
scheduling:
1. **Add per-step heat collection function**:
Support real-time collection of per-step heat information during model
inference. This enables more fine-grained load balancing decisions by
taking per-step heat as the optimization target, improving scheduling
accuracy for dynamic and fluctuating workloads.
2. **Update FlashLB algorithm**:
Upgrade the FlashLB scheduling logic to better adapt to multi-stage heat
distribution scenarios. The improved algorithm can comprehensively
perceive and utilize multi-stage heat characteristics, achieving more
stable and efficient load balancing under complex expert deployment and
dynamic traffic patterns.
---------
Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
Signed-off-by: xuzewei28 <xuzewei2@h-partners.com>
Co-authored-by: xuzewei28 <xuzewei2@h-partners.com>
### What this PR does / why we need it?
To intuitively show the effect of the eplb algorithm, we print the
expert heat before and after eplb.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
To analyze the overhead of the dynamic eplb adjustment framework in
detail, we added the time consumption of the adjustment to the print
information in profiling mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
#6043 deleted the forward_before phase of the dynamic eplb. Currently,
the end-to-end precision is monitored in the UT, and the log is not
printed in the key place. As a result, the eplb does not take effect and
is not intercepted.
1. The forward_before function is added back.
2. Delete unnecessary logs and add key logs.
3. Warm-up of algorithm 3 is added.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?

#### The conversation is normal.
Okay, the user is asking, \"What is deep learning?\" I need to explain
this in a clear and concise way. Let me start by recalling what I know
about deep learning. It's a subset of machine learning, right? So first,
I should mention that it's part of machine learning, which itself is a
branch of AI. Then, the key aspect of deep learning is the use of neural
networks with multiple layers. These are called deep neural
networks.\n\nWait, I should define neural networks first. Maybe start
with the basics. A neural network is inspired by the human brain, with
layers of nodes (neurons) that process data. But deep learning
specifically refers to networks with many layers—hence \"deep.\" So the
term \"deep\" comes from the number of layers. \n\nI should explain how
deep learning works. It involves training these networks on large
datasets, allowing them to automatically learn features from the data.
Unlike traditional machine learning, where you might have to manually
extract features, deep learning models can do this automatically. That's
a key point. For example, in image recognition, a deep learning model
can learn to detect edges, shapes, and then more complex patterns
without human intervention.\n\nApplications are important too. The user
might want to know where deep learning is used. Common examples include
image and speech recognition, natural language processing, autonomous
vehicles, and recommendation systems. Maybe mention specific
technologies like self-driving cars using computer vision or virtual
assistants like Siri or Alexa
- vLLM version: v0.15.0
- vLLM main:
13397841ab
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
1.Incorporate the warm up of the EPLB into the profile run.
2.Reusing the same gather buffer
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
qwen3-235b aime baseline
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
eplb The OOM issue does not occur.
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817
### Does this PR introduce _any_ user-facing change?
before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`
after this pr:
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`
### How was this patch tested?
#### test qwen3-235b eplb num_redundant_experts=16
without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |
with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
Add log Info for MOE_load Imbalance Ratio
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0
---------
Signed-off-by: daishixun <dsxsteven@sina.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
Fix oom of deepseek-eplb nigtly test
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
1.Add eplb ci to check the change of eplb feature.
2.Add param checking of eplb params.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Qwen in A3.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
What this PR does / why we need it?
1.Record expert map without dynamic eplb.
2.Add export PYTHONOPTIMIZE=1 when using dynamic eplb.
3.change eplb doc
Does this PR introduce any user-facing change?
How was this patch tested?
Qwen3_moe in A3.
- vLLM version: v0.11.0
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### Motivation
Currently dynamically experts balancing would stop-the-world.
Asynchronously expert load balancing would be better without flowing
problems:
Host-bound latency:
There are many cpu operations during EPLB such as
eplb-algorithm、creating p2p ops、and log2phy expert converting would
spend long cpu time, as ~1s.
Communication latency: The transfer time would cost much in the
situation without nvlink. As the weight of an expert maybe transfer to
multiple new positions, thus N times send/recv for one expert, with
result long latency. We had tested that batch_isend_irecv cost more
100ms for 16 experts weight transmission in A2 server of ascend.
SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms
cost for each layer while benefit 5ms-8ms decode latency with ep_size =
64.
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing while almost the same
effect.
### Proposed Change
We will gradually migrate the EPLB logic to the VLLM community and
implement a generalized design. Relevant RFC:
https://github.com/vllm-project/vllm/issues/22246
The overall workflow involves:
<img width="801" height="302"
alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c"
src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed"
/>
1. Record experts distribution during forward. We using expert_token_num
after disptach instead of topk_ids, thus we got much smaller tensor
shape to reduce cost of hbm recording and add-operator.
2. Do all-gather for experts distribution. Using all-gather instead of
all-reduce as less traffic volume.
3. Wake up eplb worker process with experts distribution when
num_iterations comes. Run eplb algorithm in eplb worker.
4. Generate p2p send/recv ops and other operator such as log2phy would
cost long cpu time.
5. Lanch ibatch_send_recv in async_stream before forward.
6. After forward, wait for the ibatch_send_recv finish, then do uapte
expert map and expert weights.
### Co-author
Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con
Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn
Co-authored-by: qmkakaxi wjh1594260677@qq.com
Co-authored-by: Skywalker-EP 173723846@qq.com
- vLLM version: v0.10.2
- vLLM main:
567939953b
---------
Signed-off-by: offline0806 <z00858301@china.huawei.com>
Co-authored-by: offline0806 <z00858301@china.huawei.com>