shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395)

### What this PR does / why we need it? When all_reduce_merge is in progress, shared_experts does not do all_reduce in mlp, but waits until shared_experts+router_experts are completed before doing all_reduce In prefill and decode, as long as shared_experts+router_experts are all_reduce, there will be benefits. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? bash examples/run_dp_attention_etp16.sh bash examples/run_dp_attention_etp16_benmark.sh - vLLM version: v0.9.1 - vLLM main: 977180c912 --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-10 12:07:05 +08:00
parent 997f156a51
commit 60519c71bd
5 changed files with 32 additions and 7 deletions
--- a/examples/run_dp_attention_etp16.sh
+++ b/examples/run_dp_attention_etp16.sh
@@ -3,9 +3,10 @@ export TASK_QUEUE_ENABLE=1
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 source /usr/local/Ascend/nnal/atb/set_env.sh
 export ASCEND_LAUNCH_BLOCKING=0
-export VLLM_VERSION=0.9.0
+export VLLM_VERSION=0.9.1

 nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
+    --served-model-name auto \
    --quantization ascend \
    --trust-remote-code \
    --distributed-executor-backend=mp \
--- a/examples/run_dp_attention_etp16_benmark.sh
+++ b/examples/run_dp_attention_etp16_benmark.sh
@@ -21,7 +21,8 @@ for concurrency in "${concurrency_array[@]}"; do
        python /mnt/deepseek/vllm/benchmarks/benchmark_serving.py \
            --backend vllm \
            --trust-remote-code \
-            --model /mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
+            --model auto \
+            --tokenizer /mnt/deepseek/DeepSeek-R1-W8A8-VLLM \ 
            --dataset-name random \
            --random-input-len 4096 \
            --random-output-len 1536 \