Dynamic Expert Load Balance with Zero-like-overhead (#2956)

### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: 567939953b --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>
2025-09-17 10:36:43 +08:00
parent ae758dda05
commit 76844eec78
30 changed files with 2891 additions and 47 deletions
--- a/docs/source/user_guide/feature_guide/eplb_swift_balancer.md
+++ b/docs/source/user_guide/feature_guide/eplb_swift_balancer.md
@@ -0,0 +1,45 @@
+# Swift Balancer
+
+## Overview
+Experts rebalancing of MoE models for LLM serving is a mandatory option.Changing experts dynamically would have a negative impact on TTFT and TPOT while stop-the-world.
+Asynchronously expert load balancing would be a better choice.
+We have launched SwiftBalancer to support dynamic experts load balancing with Zero-overhead experts movement.
+
+## Design
+
+![img.png](images/eplb_img.png)
+
+The overall workflow involves:
+1. Record experts distribution during forward. We using expert_token_num after dispatch instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm
+   recording and add-operator.
+2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume.
+3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker.
+4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time.
+5. Lanch ibatch_send_recv in async_stream before forward.
+6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights.
+
+In our profiling shows experts transforming is hidden in the bubble between forward iterations. Cpu time cost of eplb algo. and other python operator such as log2phy
+would be hidden by eplb worker process too.
+
+## Config Params
+
+Currently swift balancer optimize 5ms TPOT with ep size 64 while cost less than 2ms for every layer expert movement.
+
+We add new parameters for eplb:
+"dynamic_eplb":true ---  enable dynamic eplb
+"num_iterations_eplb_update": 400 -- forward iterations when eplb would begin
+"gate_eplb":true -- eplb would update only once, false by default.
+"num_wait_worker_iterations":30 -- forward iterations when eplb worker will finish cpu task. In our test default value 30 would cover most cases.
+"expert_map_record_path" -- When dynamic eplb is completed, save the current expert load heatmap to the specified path.
+"init_redundancy_expert" -- Specify redundant experts during initialization.
+
+## Examples
+### Dynamic eplb
+Enable dynamic eplb and specify the trigger rounds.
+--additional-config '{ "dynamic_eplb":true,"num_iterations_eplb_update":400, "gate_eplb":true, "num_wait_worker_iterations":30}'
+### Record expert map for static eplb
+Specify the path for the static eplb initialization file.
+--additional-config '{ "expert_map_record_path": "/xx/xx.json", "init_redundancy_expert": 16, dynamic_eplb":true,"num_iterations_eplb_update":400, "gate_eplb":true, "num_wait_worker_iterations":30}'
+### Static eplb
+If expert map has been recorded, enable static eplb with expert map path.
+--additional-config '{ "expert_map_path": "/xx/xx.json"}'
--- a/docs/source/user_guide/feature_guide/images/eplb_img.png
+++ b/docs/source/user_guide/feature_guide/images/eplb_img.png
--- a/docs/source/user_guide/feature_guide/index.md
+++ b/docs/source/user_guide/feature_guide/index.md
@@ -10,4 +10,5 @@ quantization
 sleep_mode
 structured_output
 lora
+eplb_swift_balancer
 :::