shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395)
### What this PR does / why we need it?
When all_reduce_merge is in progress, shared_experts does not do
all_reduce in mlp, but waits until shared_experts+router_experts are
completed before doing all_reduce
In prefill and decode, as long as shared_experts+router_experts are
all_reduce, there will be benefits.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh
- vLLM version: v0.9.1
- vLLM main:
977180c912
---------
Signed-off-by: ttanzhiqiang <389825161@qq.com>
This commit is contained in:
@@ -3,9 +3,10 @@ export TASK_QUEUE_ENABLE=1
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
export ASCEND_LAUNCH_BLOCKING=0
|
||||
export VLLM_VERSION=0.9.0
|
||||
export VLLM_VERSION=0.9.1
|
||||
|
||||
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
|
||||
--served-model-name auto \
|
||||
--quantization ascend \
|
||||
--trust-remote-code \
|
||||
--distributed-executor-backend=mp \
|
||||
|
||||
@@ -21,7 +21,8 @@ for concurrency in "${concurrency_array[@]}"; do
|
||||
python /mnt/deepseek/vllm/benchmarks/benchmark_serving.py \
|
||||
--backend vllm \
|
||||
--trust-remote-code \
|
||||
--model /mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
|
||||
--model auto \
|
||||
--tokenizer /mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
|
||||
--dataset-name random \
|
||||
--random-input-len 4096 \
|
||||
--random-output-len 1536 \
|
||||
|
||||
Reference in New Issue
Block a user