xc-llm-ascend/docs/source/user_guide/feature_guide/eplb_swift_balancer.md

# Expert Load Balance (EPLB)

## Overview

Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.

## EPLB Effects

- Reduced Latency: Dynamically balances expert loads to minimize TTFT and TPOT by distributing workloads evenly across experts.
- Enhanced Throughput: Optimizes GPU utilization, increasing token generation speed under high-concurrency scenarios.
- Zero-Overhead Movement: Expert redistribution occurs asynchronously without interrupting ongoing inference requests.
- Adaptive Scaling: Automatically adjusts to workload fluctuations while maintaining stable performance.
- Fault Tolerance: Redundant expert placement ensures system resilience during hardware failures.

## Support Scenarios

### Models:
DeepseekV3/V3.1/R1、Qwen3-MOE
### MOE QuantType:
W8A8-dynamic

## How to Use EPLB

### Dynamic EPLB

We need to add the environment variable `export PYTHONOPTIMIZE=1` to get context of the vllm process. Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.

```shell
vllm serve Qwen/Qwen3-235B-A22 \
  --tensor-parallel-size 16 \
  --enable-expert-parallel \
  --additional-config '{
    "dynamic_eplb": true,
    "num_iterations_eplb_update": 400,
    "num_wait_worker_iterations": 30
  }'
```

### Static EPLB
#### Initial Setup (Record Expert Map)

Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.

```shell
vllm serve Qwen/Qwen3-235B-A22 \
  --tensor-parallel-size 16 \
  --enable-expert-parallel \
  --additional-config '{
    "expert_map_record_path": "/path/to/eplb.json",
    "init_redundancy_expert": 16,
    "num_iterations_eplb_update": 400,
    "num_wait_worker_iterations": 30
  }'
```

#### Subsequent Deployments (Use Recorded Map)
Load the pre-recorded expert map for consistent performance. This avoids recalculating distributions at runtime.

```shell
vllm serve Qwen/Qwen3-235B-A22 \
  --tensor-parallel-size 16 \
  --enable-expert-parallel \
  --additional-config '{
    "expert_map_path": "/path/to/eplb.json"
  }'
```

## Critical Considerations
1. Parameter Tuning:
   - num_iterations_eplb_update: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.
   - num_wait_worker_iterations: Should be ≥ 30 to avoid premature balancing during startup.
   - init_redundancy_expert: Must match tensor-parallel size (e.g., 16 for 16 GPUs) to ensure sufficient redundancy.

2. Hardware Requirements:
   - Ensure that all GPUs have identical memory capacity and compute capabilities.
   - Network bandwidth must support expert redistribution traffic (≥ 10 Gbps recommended).

3. Model Compatibility:
   - Only MoE models with explicit expert parallelism support (e.g., Qwen3-235B-A22) are compatible.
   - Verify model architecture supports dynamic expert routing through --enable-expert-parallel.

4. Gating Configuration:
   - When gate_eplb=true, validate that the gating mechanism can handle expert movement without routing errors.
   - Test with synthetic workloads before production deployment.

5. Monitoring & Validation:
   - Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and gpu_utilization.
   - Use vllm monitor to detect imbalances during runtime.
   - Always verify expert map JSON structure before loading (validate with jq or similar tools).

6. Startup Behavior:
   - Initial requests may experience higher latency during the first balancing cycle (typically 1-2 minutes).
   - Avoid sudden traffic spikes during the warm-up phase.

7. Common Pitfalls:
   - Incorrect tensor-parallel-size vs. actual GPU count → causes resource underutilization.
   - Using expert_map_path without generating the map first → runtime errors.
   - Setting init_redundancy_expert > available GPUs → system failure.
[BugFix]Modify eplb feature guide. (#3183) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/52d0cb845866869d587fc013a7c59e60a86ebcf2 Co-authored-by: offline0806 <3337230449@qq.com> 2025-09-25 17:01:51 +08:00			`# Expert Load Balance (EPLB)`
Dynamic Expert Load Balance with Zero-like-overhead (#2956) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/567939953b7a9cb0ded6bf0bb21a76917b8fed97 --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com> 2025-09-17 10:36:43 +08:00
			`## Overview`
[BugFix]Modify eplb feature guide. (#3183) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/52d0cb845866869d587fc013a7c59e60a86ebcf2 Co-authored-by: offline0806 <3337230449@qq.com> 2025-09-25 17:01:51 +08:00
[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			`Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.`
[BugFix]Modify eplb feature guide. (#3183) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/52d0cb845866869d587fc013a7c59e60a86ebcf2 Co-authored-by: offline0806 <3337230449@qq.com> 2025-09-25 17:01:51 +08:00
			`## EPLB Effects`

			`- Reduced Latency: Dynamically balances expert loads to minimize TTFT and TPOT by distributing workloads evenly across experts.`
			`- Enhanced Throughput: Optimizes GPU utilization, increasing token generation speed under high-concurrency scenarios.`
			`- Zero-Overhead Movement: Expert redistribution occurs asynchronously without interrupting ongoing inference requests.`
			`- Adaptive Scaling: Automatically adjusts to workload fluctuations while maintaining stable performance.`
			`- Fault Tolerance: Redundant expert placement ensures system resilience during hardware failures.`

[Readme] EPLB Support Scenarios (#4315) ### What this PR does / why we need it? Add information on the scope of EPLB support. --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com> 2025-11-21 14:25:39 +08:00			`## Support Scenarios`

			`### Models:`
			`DeepseekV3/V3.1/R1、Qwen3-MOE`
			`### MOE QuantType:`
			`W8A8-dynamic`

[BugFix]Modify eplb feature guide. (#3183) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/52d0cb845866869d587fc013a7c59e60a86ebcf2 Co-authored-by: offline0806 <3337230449@qq.com> 2025-09-25 17:01:51 +08:00			`## How to Use EPLB`

			`### Dynamic EPLB`

[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			We need to add the environment variable `export PYTHONOPTIMIZE=1` to get context of the vllm process. Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.
[BugFix]Modify eplb feature guide. (#3183) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/52d0cb845866869d587fc013a7c59e60a86ebcf2 Co-authored-by: offline0806 <3337230449@qq.com> 2025-09-25 17:01:51 +08:00
			```shell
			`vllm serve Qwen/Qwen3-235B-A22 \`
			`--tensor-parallel-size 16 \`
			`--enable-expert-parallel \`
			`--additional-config '{`
			`"dynamic_eplb": true,`
			`"num_iterations_eplb_update": 400,`
			`"num_wait_worker_iterations": 30`
			`}'`
			```

			`### Static EPLB`
			`#### Initial Setup (Record Expert Map)`

[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			`Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.`
[BugFix]Modify eplb feature guide. (#3183) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/52d0cb845866869d587fc013a7c59e60a86ebcf2 Co-authored-by: offline0806 <3337230449@qq.com> 2025-09-25 17:01:51 +08:00
			```shell
			`vllm serve Qwen/Qwen3-235B-A22 \`
			`--tensor-parallel-size 16 \`
			`--enable-expert-parallel \`
			`--additional-config '{`
			`"expert_map_record_path": "/path/to/eplb.json",`
			`"init_redundancy_expert": 16,`
			`"num_iterations_eplb_update": 400,`
			`"num_wait_worker_iterations": 30`
			`}'`
			```

			`#### Subsequent Deployments (Use Recorded Map)`
			`Load the pre-recorded expert map for consistent performance. This avoids recalculating distributions at runtime.`

			```shell
			`vllm serve Qwen/Qwen3-235B-A22 \`
			`--tensor-parallel-size 16 \`
			`--enable-expert-parallel \`
			`--additional-config '{`
			`"expert_map_path": "/path/to/eplb.json"`
			`}'`
			```

			`## Critical Considerations`
			`1. Parameter Tuning:`
			`- num_iterations_eplb_update: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.`
[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			`- num_wait_worker_iterations: Should be ≥ 30 to avoid premature balancing during startup.`
[BugFix]Modify eplb feature guide. (#3183) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/52d0cb845866869d587fc013a7c59e60a86ebcf2 Co-authored-by: offline0806 <3337230449@qq.com> 2025-09-25 17:01:51 +08:00			`- init_redundancy_expert: Must match tensor-parallel size (e.g., 16 for 16 GPUs) to ensure sufficient redundancy.`

			`2. Hardware Requirements:`
[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			`- Ensure that all GPUs have identical memory capacity and compute capabilities.`
			`- Network bandwidth must support expert redistribution traffic (≥ 10 Gbps recommended).`
[BugFix]Modify eplb feature guide. (#3183) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/52d0cb845866869d587fc013a7c59e60a86ebcf2 Co-authored-by: offline0806 <3337230449@qq.com> 2025-09-25 17:01:51 +08:00
			`3. Model Compatibility:`
			`- Only MoE models with explicit expert parallelism support (e.g., Qwen3-235B-A22) are compatible.`
[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			`- Verify model architecture supports dynamic expert routing through --enable-expert-parallel.`
[BugFix]Modify eplb feature guide. (#3183) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/52d0cb845866869d587fc013a7c59e60a86ebcf2 Co-authored-by: offline0806 <3337230449@qq.com> 2025-09-25 17:01:51 +08:00
			`4. Gating Configuration:`
			`- When gate_eplb=true, validate that the gating mechanism can handle expert movement without routing errors.`
			`- Test with synthetic workloads before production deployment.`

			`5. Monitoring & Validation:`
			`- Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and gpu_utilization.`
			`- Use vllm monitor to detect imbalances during runtime.`
			`- Always verify expert map JSON structure before loading (validate with jq or similar tools).`

			`6. Startup Behavior:`
			`- Initial requests may experience higher latency during the first balancing cycle (typically 1-2 minutes).`
[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			`- Avoid sudden traffic spikes during the warm-up phase.`
[BugFix]Modify eplb feature guide. (#3183) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/52d0cb845866869d587fc013a7c59e60a86ebcf2 Co-authored-by: offline0806 <3337230449@qq.com> 2025-09-25 17:01:51 +08:00
			`7. Common Pitfalls:`
			`- Incorrect tensor-parallel-size vs. actual GPU count → causes resource underutilization.`
			`- Using expert_map_path without generating the map first → runtime errors.`
			`- Setting init_redundancy_expert > available GPUs → system failure.`