shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395)

### What this PR does / why we need it?
When all_reduce_merge is in progress, shared_experts does not do
all_reduce in mlp, but waits until shared_experts+router_experts are
completed before doing all_reduce
In prefill and decode, as long as shared_experts+router_experts are
all_reduce, there will be benefits.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh
- vLLM version: v0.9.1
- vLLM main:
977180c912

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
This commit is contained in:
ttanzhiqiang
2025-07-10 12:07:05 +08:00
committed by GitHub
parent 997f156a51
commit 60519c71bd
5 changed files with 32 additions and 7 deletions

View File

@@ -21,7 +21,8 @@ for concurrency in "${concurrency_array[@]}"; do
python /mnt/deepseek/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--trust-remote-code \
--model /mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
--model auto \
--tokenizer /mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
--dataset-name random \
--random-input-len 4096 \
--random-output-len 1536 \