shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395)

### What this PR does / why we need it?
When all_reduce_merge is in progress, shared_experts does not do
all_reduce in mlp, but waits until shared_experts+router_experts are
completed before doing all_reduce
In prefill and decode, as long as shared_experts+router_experts are
all_reduce, there will be benefits.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh
- vLLM version: v0.9.1
- vLLM main:
977180c912

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
This commit is contained in:
ttanzhiqiang
2025-07-10 12:07:05 +08:00
committed by GitHub
parent 997f156a51
commit 60519c71bd
5 changed files with 32 additions and 7 deletions

View File

@@ -3,9 +3,10 @@ export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.0
export VLLM_VERSION=0.9.1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
--served-model-name auto \
--quantization ascend \
--trust-remote-code \
--distributed-executor-backend=mp \

View File

@@ -21,7 +21,8 @@ for concurrency in "${concurrency_array[@]}"; do
python /mnt/deepseek/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--trust-remote-code \
--model /mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
--model auto \
--tokenizer /mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
--dataset-name random \
--random-input-len 4096 \
--random-output-len 1536 \