Commit Graph

21 Commits

Author SHA1 Message Date
SILONG ZENG
4e53c1d900 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #6) (#6001)
### What this PR does / why we need it?
| File Path |
| :--- |
| ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` |
| ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` |
| ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` |
| ` vllm_ascend/eplb/core/eplb_utils.py` |
| ` vllm_ascend/eplb/core/eplb_worker.py` |
| ` vllm_ascend/eplb/core/policy/policy_abstract.py` |
| ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` |
| ` vllm_ascend/eplb/core/policy/policy_factory.py` |
| ` vllm_ascend/eplb/core/policy/policy_flashlb.py` |
| ` vllm_ascend/eplb/core/policy/policy_random.py` |
| ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` |
| ` vllm_ascend/eplb/eplb_updator.py` |
| ` vllm_ascend/eplb/utils.py` |
| ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` |
| ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` |
| ` vllm_ascend/model_loader/netloader/interaction/elastic.py` |
| ` vllm_ascend/model_loader/netloader/load.py` |
| ` vllm_ascend/model_loader/netloader/netloader.py` |
| ` vllm_ascend/model_loader/netloader/utils.py` |
| ` vllm_ascend/patch/platform/__init__.py` |
| ` vllm_ascend/patch/platform/patch_balance_schedule.py` |
| ` vllm_ascend/patch/platform/patch_ec_connector.py` |
| ` vllm_ascend/patch/platform/patch_mamba_config.py` |
| ` vllm_ascend/patch/platform/patch_multiproc_executor.py` |
| ` vllm_ascend/patch/platform/patch_sched_yield.py` |


- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 22:08:33 +08:00
LI SHENGYONG
8210a62a44 [EPLB][Bugfix]Reduce unnecessary video memory usage (#6020)
### What this PR does / why we need it?
1.Incorporate the warm up of the EPLB into the profile run.
2.Reusing the same gather buffer

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
qwen3-235b aime baseline
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

eplb The OOM issue does not occur.
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-23 14:21:13 +08:00
LI SHENGYONG
83de5385b4 [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (#5897)
### What this PR does / why we need it?
1. Rename dynamic_ep to default_eplb.
2. Rename dynamic_ep_v2 to swift_balancer
3. Discard func compose_expert_update_info_bipartite.

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-19 05:47:40 +00:00
LI SHENGYONG
9fed2636cb [EPLB][Nightly][Bugfix] Get expert from moe layer only (#5908)
### What this PR does / why we need it?
1. If the model has dense layers, the current code will attempt to
obtain the routing experts of the dense layers, which will cause an
error. This should be fixed by modifying the code to skip the dense
layers when obtaining the routing experts.
2. The global_expert_map that the function directly outputs a affects
the performance of dsv3.2.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

DeepSeek V3.1 conversation is normal.

#### aime precision test (dsv3.1)
baseline without eplb
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 66.67 |

eplb
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 70.00 |

- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-19 09:23:28 +08:00
LI SHENGYONG
da958ee386 [EPLB]Eplb Config Renaming (#5533)
### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817

### Does this PR introduce _any_ user-facing change?

before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`

after this pr: 
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`

### How was this patch tested?

#### test qwen3-235b eplb num_redundant_experts=16

without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-15 10:26:44 +08:00
LI SHENGYONG
ecf2fa482e [EPLB][Bugfix] Get expert map from layers (#5817)
### What this PR does / why we need it?
The initialization method of expert_map used by the eplb module is
different from that used by the fused_moe module. This PR deletes the
expert_map initialization method used by the eplb module to make the
initialization methods consistent.

#### before bugfix
self._expert_map=tensor([64, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
59, 60, 61,62, 63], device='npu:1', dtype=torch.int32)

self.shared_dict["expert_maps"][0]=tensor([-1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]], dtype=torch.int32)

### How was this patch tested?

#### qwen3-235B-w8a8 aime
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-14 09:16:51 +08:00
Mercykid-bash
29e2f9a43e Bugfix: Align expert map shapes with redundant experts in EPLB adjustment (#5285)
#### Overview
This PR fixes a shape mismatch bug between `expert_placement_map` and
`log2phy_expert_map` when **redundant experts** are enabled in the
vLLM-Ascend platform. The issue occurred during the initialization of
expert maps and their updates via EPLB (Expert Load Balancer)
adjustment, leading to potential tensor shape errors and incorrect
expert routing in distributed MoE deployments.

#### Key Changes
1. **Unify expert map shape calculation logic**
- Ensure the shape of `expert_placement_map` and `log2phy_expert_map`
strictly aligns with the total number of experts (including redundant
experts) during initialization.
- Update the shape adjustment logic in EPLB dynamic update process to
match the initial expert map dimensions.

2. **Add shape consistency checks**
- Add assertion statements to verify the shape consistency of the two
maps after initialization and EPLB adjustment, preventing silent shape
mismatches in subsequent operations.

#### Impact
- Resolves tensor shape errors when using redundant experts with EPLB on
Ascend platform.
- Ensures correct expert routing and load balancing for MoE models with
redundant expert configurations.
- No breaking changes to existing functionality; compatible with
non-redundant expert deployments.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-06 17:22:36 +08:00
LI SHENGYONG
bdc721d35a [smoke][bugfix] moe_init_routing_v2 active_expert_range use int type (#5521)
### What this PR does / why we need it?
The float kernel of MOE_init_routing_v2 in the dispatch allgather
operation does not support tensor format for active_expert_range; it
only supports int.
PR5311 To unify the variables `local_num_experts` and
`self.local_num_experts`, `self.local_num_experts` was used
consistently, which led to the subsequent integer type parameter being
converted to a tensor type.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
gsm8k | exact_match,strict-match: ground_truth=0.89 | measured=0.8939 |
success=
gsm8k | exact_match,flexible-extract: ground_truth=0.85 | measured=0.856
| success=
ceval-valid | acc,none: ground_truth=0.84 | measured=0.8373 | success=
Model Parameters:
{'pretrained': 'Qwen/Qwen3-30B-A3B', 'tensor_parallel_size': 2, 'dtype':
'auto', 'trust_remote_code': False, 'max_model_len': 4096,
'gpu_memory_utilization': 0.6, 'enable_expert_parallel': True}

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-12-31 09:19:04 +08:00
LI SHENGYONG
f81cf694b2 [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy(depend on pr5285) (#5311)
### What this PR does / why we need it?
Unify the loading logic for expert_map and log2phy.
1. The map generated when enabling the redundancy expert is incorrect.
The community generation map function only accepts the number of global
experts. When we pass in the number of logical experts plus redundant
experts, the local expert ID of the last card will index to an expert ID
that does not exist. Now we ensure that the index points to a real
existing expert ID, and each expert can be accessed. Moreover, when
redundant experts are not enabled, the output of our function remains
consistent with the community's function.
2. The map we generate is based on the length of the physical expert,
but in reality, we only need to use the length of the logical expert.
Later on, we will need to pad it accordingly, so we can simply generate
a map with the length of the logical [expert.]
3. Unify the initialization logic across different scenarios and
simplify the code for fused_moe.

**Before refactoring**

-   map path is not None:

expert map: get_rank_placement_map from _'expert_load_balancer.py'_,
maintains the map for all ranks and all layers.

log2phy: get_rank_log2phy_map from _'expert_load_balancer.py'_,
maintains the map for all ranks and all layers.

-   map path is None:

expert map: determine_expert_map from '_vllm.laye_r', The function does
not support the redundant experts of vllm-ascend.
log2phy: determine_default_log2phy_map from _'eplb_utils.py'_. The
function does not support the redundant experts of vllm-ascend.

**Refactoring**
eplb_utils.py
&nbsp;&nbsp;&nbsp;&nbsp;init_eplb_config
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; generate placement
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; generate expert map
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; generate log2phy

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Expert Mapping Test Generation:
ep size: 16, num of experts: 256, num of redundant experts: 16
+++++++++++++++++++++++++++++++++++++++++
Expert Mapping (Non-1 indicates the expert responsible for this rank)
for Rank 15:
vllm map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0  1  2  3  4  5  6  7  8
  9 10 11 12 13 14 15 16]
+++++++++++++++++++++++++++++++++++++++++
Improved map:
[16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]

Expert Mapping Test Generation:
ep size: 16, num of experts: 256, num of redundant experts: 0
+++++++++++++++++++++++++++++++++++++++++
Expert Mapping (Non-1 indicates the expert responsible for this rank)
for Rank 15:
vllm map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
+++++++++++++++++++++++++++++++++++++++
Improved map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]

dsr1 baselie:

| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8k-lite | 7cd45e | accuracy | gen | 100.00 |

dsr1 eplb:

| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8k-lite | 7cd45e | accuracy | gen | 100.00 |


- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-29 09:26:14 +08:00
wangxiyuan
492173cf89 [Misc] Cleanup useless print and logger (#5220)
1. Remove useless print
2. use vLLM logger
3. change useless INFO to DEBUG level

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-22 11:28:26 +08:00
Mercykid-bash
84b9d38e28 BugFix: Resolve PolicyFlashlb warm up function attribute error (#4741)
## Description
Fix the AttributeError caused by incorrect invocation of the warm-up
function in the FlashLB algorithm:
1. **Root Cause**: The warm-up function for FlashLB is defined outside
the `PolicyFlashlb` class (not a class method), but the code incorrectly
attempted to call it via the `PolicyFlashlb` class instance.
2. **Key Fix**: Clarify the invocation rule for FlashLB: when selecting
the FlashLB algorithm, the warm-up function must be called in advance to
precompile and warm up the algorithm (invoked as a standalone function),
instead of calling it through the `PolicyFlashlb` class.
3. **Impact**: Resolve the runtime error when using FlashLB, ensure the
algorithm pre-compilation/warm-up process works as expected, and avoid
attribute missing exceptions.

Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
2025-12-12 14:55:26 +08:00
wangxiyuan
b89763f1ed [CI] speed up ut (#4901)
avoid model download to speed up ut test. 

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-11 18:45:43 +08:00
LI SHENGYONG
019c7ded91 eplb redundant expert bugfix (#4291)
### What this PR does / why we need it?
Redundant experts bugfix
### Does this PR introduce _any_ user-facing change?
After configuring the path for experts_map, users do not need to
configure iinit_redundancy_expert.
### How was this patch tested?
The accuracy of EPLB was tested with and without the use of redundant
experts.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-11-21 14:24:35 +08:00
欧派果奶我还要
c848da0687 [Bugfix] fix nightly multi-node EPLB tests' "DYNAMIC_EPLB=true" environment not working (#4223)
### What this PR does / why we need it?
fix nightly multi-node EPLB tests by adjusting
vllm_ascend\eplb\core\eplb_utils.py dynamic_eplb gate checking
### Does this PR introduce _any_ user-facing change?
no

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 欧派果奶我还要 <47294568+845473182@users.noreply.github.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-19 21:31:58 +08:00
offline893
14ca1e5cb2 [CI]Fix oom of deepseek-eplb nigtly test. (#3884)
### What this PR does / why we need it?
Fix oom of deepseek-eplb nigtly test

- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-30 10:18:07 +08:00
offline893
e916265b2b [CI]Add EPLB CI. (#3568)
### What this PR does / why we need it?
1.Add eplb ci to check the change of eplb feature.
2.Add param checking of eplb params. 
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Qwen in A3.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-21 22:58:02 +08:00
yechao237
4750d45d86 [BugFix]Support redundant experts in EPLB (#3473)
This PR adds support for redundant experts in the EPLB. 

Key points: 
- Use global_num_experts = num_experts + num_redundant_experts
consistently.
- Backward compatible when num_redundant_experts=0. 

Tested 
On a 16-rank setup (W8A8) with static EPLB and expert_map_path,
verifying router logits shape and successful requests.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: yechao237 <yechao20180411@gmail.com>
2025-10-18 00:09:16 +08:00
Mercykid-bash
ecb1713dfc Bugfix: Expose the user policy type interface (#3336)
This PR primarily focuses on two key changes:
1. Adjusts internal interface calls to optimize the interaction logic
between related modules.
2. Exposes an interface that allows users to select the EPLB algorithm,
enabling more flexible configuration based on specific usage scenarios.

These changes aim to enhance the usability of the system while ensuring
the stability of internal operations. Relevant unit tests have been
updated to cover the modified logic.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
2025-10-11 16:28:57 +08:00
offline893
82b6c846ca [BugFix]Fix eplb problems when using dynamic eplb. (#3364)
### What this PR does / why we need it?
When using dynamic eplb,it will be blocking by nz tensor.We fix these
prolems by clone src tensor and recv tensor.

### Does this PR introduce any user-facing change?

### How was this patch tested?
Qwen3_moe in A3.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-11 14:04:02 +08:00
Mercykid-bash
29c173ab48 FlashLB algorithm (#3042)
## Purpose
This Pull Request enhances the EPLB (Expert Parallelism Load Balancing)
system by introducing a novel balancing algorithm: FlashLB.

## Motivation
1. The default algorithm adopts a two-stage greedy strategy: 
a. Replica allotment: Determine the number of expert replicas by
minimizing the maximum load per replica (Min Max Replica, MMR).
b. Replica placement: Distribute replicas across devices by repeatedly
assigning the heaviest replica to the least loaded device (Longest
Processing Time First, LPT).

However, this sequential process lacks inter-stage collaborative
optimization, often leading to suboptimal load balancing. For example,
in the simple case shown in the figure below: given 8 logical experts
with hotness values of 600, 560, 120, 120, 20, 10, 10, 10, and 2
replicas allocated per device across 8 devices, the EPLB algorithm
yields a maximum per-device hotness of 232, while our proposed FlashLB
algorithm can reduce this value to 205.

2. The default algorithm relies on the averaged expert hotness over a
fixed time window for optimization. While this provides a coarse
approximation of the hotness distribution, it fails to capture
oscillatory deviations and temporal correlations of expert hotness
observed across iterations in real-world scenarios, limiting
optimization quality.

3. The default algorithm periodically regenerates the expert placement
table. However, it generates the table for each individual layer, and
the new table does not account for correlations with the previous one;
these two factors collectively lead to nearly full-scale expert
reassignment.

## FlashLB Algorithm Principle
1. Joint Optimization
FlashLB achieves joint optimization of replica allotment and placement
through group-based decision-making. Each group gradually determines the
replica count and placement for a subset of experts, ensuring that the
expected inter-device load balance (considering both deployed and
pending expert replicas) is holistically optimized. To attain superior
load balancing, FlashLB employs tree search to expand the solution space
while integrating pruning and precompilation techniques for
acceleration, thereby delivering load balancing that is both
high-quality and practically efficient.

2. Multi-Shot Enhancement
FlashLB partitions each profiling interval (e.g., 1024 iterations) into
consecutive smaller sub-intervals (e.g., 16 iterations), each capturing
independent hotness measurements. It then performs multi-shot
optimization to co-optimize these sub-intervals simultaneously—enabling
adaptation to time-variant expert hotness while enhancing robustness.

3. Incremental Adjustment
To reduce the overhead of frequent expert re-deployment, FlashLB
introduces an incremental adjustment scheme operating at both
inter-layer and intra-layer levels:
a. Inter-Layer: Hotness variations are tracked at the layer level. Only
layers with fluctuations exceeding a predefined threshold trigger
re-computation of expert placement, avoiding unnecessary redeployment
for stable layers;
b. Intra-Layer (Optional): A lightweight incremental LPT algorithm
(LPT-Incremental) is applied. Instead of recomputing full placement for
all experts in a layer, it selectively adjusts only the hottest experts
or those with replica count changes, further reducing migration
overhead.

This incremental strategy significantly reduces adjustment costs while
maintaining balanced performance across layers and devices.

## Co-author:

Co-authored-by: Skywalker-EP 173723846@qq.com

- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: rjg-lyh <1318825571@qq.com>
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Signed-off-by: fems14 <1804143737@qq.com>
Co-authored-by: sdmyzlp <117554856+sdmyzlp@users.noreply.github.com>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: zhanghw0354 <zhanghaiwencmss@139.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhangxinyuehfad <59153331+zhangxinyuehfad@users.noreply.github.com>
Co-authored-by: Lucas Kabela <lucasakabela@gmail.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Icey <1790571317@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: Angazenn <supperccell@163.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: rjg-lyh <83491835+rjg-lyh@users.noreply.github.com>
Co-authored-by: weichen <132029610+Pr0Wh1teGivee@users.noreply.github.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
Co-authored-by: fems14 <74094523+fems14@users.noreply.github.com>
2025-09-23 10:27:14 +08:00
offline893
76844eec78 Dynamic Expert Load Balance with Zero-like-overhead (#2956)
### Motivation
Currently dynamically experts balancing would stop-the-world.
Asynchronously expert load balancing would be better without flowing
problems:

Host-bound latency:
There are many cpu operations during EPLB such as
eplb-algorithm、creating p2p ops、and log2phy expert converting would
spend long cpu time, as ~1s.
Communication latency: The transfer time would cost much in the
situation without nvlink. As the weight of an expert maybe transfer to
multiple new positions, thus N times send/recv for one expert, with
result long latency. We had tested that batch_isend_irecv cost more
100ms for 16 experts weight transmission in A2 server of ascend.

SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms
cost for each layer while benefit 5ms-8ms decode latency with ep_size =
64.
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing while almost the same
effect.
### Proposed Change
We will gradually migrate the EPLB logic to the VLLM community and
implement a generalized design. Relevant RFC:
https://github.com/vllm-project/vllm/issues/22246
The overall workflow involves:
<img width="801" height="302"
alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c"
src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed"
/>
1. Record experts distribution during forward. We using expert_token_num
after disptach instead of topk_ids, thus we got much smaller tensor
shape to reduce cost of hbm recording and add-operator.
2. Do all-gather for experts distribution. Using all-gather instead of
all-reduce as less traffic volume.
3. Wake up eplb worker process with experts distribution when
num_iterations comes. Run eplb algorithm in eplb worker.
4. Generate p2p send/recv ops and other operator such as log2phy would
cost long cpu time.
5. Lanch ibatch_send_recv in async_stream before forward.
6. After forward, wait for the ibatch_send_recv finish, then do uapte
expert map and expert weights.
### Co-author
Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con
Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn
Co-authored-by: qmkakaxi wjh1594260677@qq.com
Co-authored-by: Skywalker-EP 173723846@qq.com


- vLLM version: v0.10.2
- vLLM main:
567939953b

---------

Signed-off-by: offline0806 <z00858301@china.huawei.com>
Co-authored-by: offline0806 <z00858301@china.huawei.com>
2025-09-17 10:36:43 +08:00