### What this PR does / why we need it?
When all_reduce_merge is in progress, shared_experts does not do
all_reduce in mlp, but waits until shared_experts+router_experts are
completed before doing all_reduce
In prefill and decode, as long as shared_experts+router_experts are
all_reduce, there will be benefits.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh
- vLLM version: v0.9.1
- vLLM main:
977180c912
---------
Signed-off-by: ttanzhiqiang <389825161@qq.com>
23 lines
861 B
Bash
23 lines
861 B
Bash
export VLLM_USE_V1=1
|
|
export TASK_QUEUE_ENABLE=1
|
|
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
|
source /usr/local/Ascend/nnal/atb/set_env.sh
|
|
export ASCEND_LAUNCH_BLOCKING=0
|
|
export VLLM_VERSION=0.9.1
|
|
|
|
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
|
|
--served-model-name auto \
|
|
--quantization ascend \
|
|
--trust-remote-code \
|
|
--distributed-executor-backend=mp \
|
|
--port 8006 \
|
|
-tp=8 \
|
|
-dp=2 \
|
|
--max-num-seqs 24 \
|
|
--max-model-len 32768 \
|
|
--max-num-batched-tokens 32768 \
|
|
--block-size 128 \
|
|
--no-enable-prefix-caching \
|
|
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \
|
|
--gpu-memory-utilization 0.96 &> run.log &
|
|
disown |