[Feat] Load balance of tokens across experts in dummy_run (#3184)

### What this PR does / why we need it? Due to the special input data during the dummy run, the majority of tokens are distributed on DP0TP0, which results in insufficient available KV cache on DP0TP0. This PR changes the `topk_ids` of the dummy_run input from all zeros to random values. This is a naive implementation for experts load balance so as to avoid accumulating too much tokens on a single rank. ### How was this patch tested? model: DeepSeek-v3-w8a8 ```bash vllm serve DeepSeek-v3-w8a8 \ --host 0.0.0.0 \ --port 8004 \ --data-parallel-size 2 \ --tensor-parallel-size 8 \ --quantization ascend \ --seed 1024 \ --enforce-eager \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ --disable-log-stats \ --max-num-seqs 18 \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --trust-remote-code \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.9 \ --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ --additional-config \ '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}' ``` The Available memory: **2728672256** -> **6771544064** KV Cache size: **38144** -> **95232** tokens After enabling load balance - vLLM version: v0.11.0 --------- Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2025-10-10 09:00:07 +08:00
parent 60b7c936c5
commit 6ae75933da
2 changed files with 10 additions and 6 deletions
--- a/vllm_ascend/ops/common_fused_moe.py
+++ b/vllm_ascend/ops/common_fused_moe.py
@@ -264,6 +264,10 @@ class AscendFusedMoE(FusedMoE):
        quantized_x_for_share, dynamic_scale_for_share = None, None

        forward_context = get_forward_context()
+
+        # Load balancing for token distribution among experts in dummy_run
+        # TODO: The community only considers load balancing when DP > 1.
+        # This approach may overlook some extreme scenarios.
        enable_force_load_balance = forward_context.in_profile_run

        forward_context = get_forward_context()
--- a/vllm_ascend/quantization/w8a8_dynamic.py
+++ b/vllm_ascend/quantization/w8a8_dynamic.py
@@ -217,6 +217,12 @@ class AscendW8A8DynamicFusedMoEMethod:
            e_score_correction_bias=e_score_correction_bias,
            global_num_experts=global_num_experts)

+        # this is a naive implementation for experts load balance so as
+        # to avoid accumulating too much tokens on a single rank.
+        # currently it is only activated when doing profile runs.
+        if enable_force_load_balance:
+            topk_ids = torch.randint_like(topk_ids, 0, global_num_experts)
+
        if self.use_aclgraph:
            moe_comm_method = get_forward_context().moe_comm_method
            return moe_comm_method.fused_experts(
@@ -232,12 +238,6 @@ class AscendW8A8DynamicFusedMoEMethod:
                expert_map=expert_map,
                dynamic_eplb=self.dynamic_eplb)

-        # this is a naive implementation for experts load balance so as
-        # to avoid accumulating too much tokens on a single rank.
-        # currently it is only activated when doing profile runs.
-        if enable_force_load_balance:
-            topk_ids = torch.randint_like(topk_ids, 0, global_num_experts)
-
        topk_weights = topk_weights.to(x.dtype)

        moe_comm_method = get_forward_context().moe_comm_method