[Feat] Load balance of tokens across experts in dummy_run (#3184)

### What this PR does / why we need it?
Due to the special input data during the dummy run, the majority of
tokens are distributed on DP0TP0, which results in insufficient
available KV cache on DP0TP0.
This PR changes the `topk_ids` of the dummy_run input from all zeros to
random values.
This is a naive implementation for experts load balance so as to avoid
accumulating too much tokens on a single rank.

### How was this patch tested?
model: DeepSeek-v3-w8a8
```bash
vllm serve DeepSeek-v3-w8a8 \
    --host 0.0.0.0 \
    --port 8004 \
    --data-parallel-size 2 \
    --tensor-parallel-size 8 \
    --quantization ascend \
    --seed 1024 \
    --enforce-eager \
    --served-model-name deepseek_v3 \
    --enable-expert-parallel \
    --disable-log-stats \
    --max-num-seqs 18 \
    --max-model-len 8192 \
    --max-num-batched-tokens 8192 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --gpu-memory-utilization 0.9 \
    --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
    --additional-config \
    '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}' 
```

The Available memory: **2728672256** -> **6771544064**
KV Cache size: **38144** -> **95232** tokens

After enabling load balance


- vLLM version: v0.11.0

---------

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
This commit is contained in:
MengLong Chen
2025-10-10 09:00:07 +08:00
committed by GitHub
parent 60b7c936c5
commit 6ae75933da
2 changed files with 10 additions and 6 deletions

View File

@@ -264,6 +264,10 @@ class AscendFusedMoE(FusedMoE):
quantized_x_for_share, dynamic_scale_for_share = None, None
forward_context = get_forward_context()
# Load balancing for token distribution among experts in dummy_run
# TODO: The community only considers load balancing when DP > 1.
# This approach may overlook some extreme scenarios.
enable_force_load_balance = forward_context.in_profile_run
forward_context = get_forward_context()

View File

@@ -217,6 +217,12 @@ class AscendW8A8DynamicFusedMoEMethod:
e_score_correction_bias=e_score_correction_bias,
global_num_experts=global_num_experts)
# this is a naive implementation for experts load balance so as
# to avoid accumulating too much tokens on a single rank.
# currently it is only activated when doing profile runs.
if enable_force_load_balance:
topk_ids = torch.randint_like(topk_ids, 0, global_num_experts)
if self.use_aclgraph:
moe_comm_method = get_forward_context().moe_comm_method
return moe_comm_method.fused_experts(
@@ -232,12 +238,6 @@ class AscendW8A8DynamicFusedMoEMethod:
expert_map=expert_map,
dynamic_eplb=self.dynamic_eplb)
# this is a naive implementation for experts load balance so as
# to avoid accumulating too much tokens on a single rank.
# currently it is only activated when doing profile runs.
if enable_force_load_balance:
topk_ids = torch.randint_like(topk_ids, 0, global_num_experts)
topk_weights = topk_weights.to(x.dtype)
moe_comm_method = get_forward_context().moe_comm_method