[Doc] Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8 (#2553)

### What this PR does / why we need it?
Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
de02b07db4

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
Li Wang
2025-08-27 14:16:44 +08:00
committed by GitHub
parent 2bfbf9b9b3
commit 516e14ae6a

View File

@@ -57,7 +57,7 @@ hccn_tool -i 0 -ping -g address 10.20.0.20
``` ```
## Run with docker ## Run with docker
Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3-w8a8` quantitative model across multi-node. Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multi-node.
```{code-block} bash ```{code-block} bash
:substitutions: :substitutions:
@@ -107,6 +107,7 @@ Before launch the inference server, ensure the following environment variables a
nic_name="xxxx" nic_name="xxxx"
local_ip="xxxx" local_ip="xxxx"
export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name
@@ -115,9 +116,9 @@ export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100 export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024 export HCCL_BUFFSIZE=1024
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8 # The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html # If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /root/.cache/ds_v3 \ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 8004 \ --port 8004 \
--data-parallel-size 4 \ --data-parallel-size 4 \
@@ -126,7 +127,7 @@ vllm serve /root/.cache/ds_v3 \
--data-parallel-rpc-port 13389 \ --data-parallel-rpc-port 13389 \
--tensor-parallel-size 4 \ --tensor-parallel-size 4 \
--seed 1024 \ --seed 1024 \
--served-model-name deepseek_v3 \ --served-model-name deepseek_v3.1 \
--enable-expert-parallel \ --enable-expert-parallel \
--max-num-seqs 16 \ --max-num-seqs 16 \
--max-model-len 32768 \ --max-model-len 32768 \
@@ -146,6 +147,7 @@ vllm serve /root/.cache/ds_v3 \
nic_name="xxx" nic_name="xxx"
local_ip="xxx" local_ip="xxx"
export VLLM_USE_MODELSCOPE=True
export HCCL_IF_IP=$local_ip export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name
@@ -155,7 +157,7 @@ export OMP_NUM_THREADS=100
export VLLM_USE_V1=1 export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024 export HCCL_BUFFSIZE=1024
vllm serve /root/.cache/ds_v3 \ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 8004 \ --port 8004 \
--headless \ --headless \
@@ -167,7 +169,7 @@ vllm serve /root/.cache/ds_v3 \
--tensor-parallel-size 4 \ --tensor-parallel-size 4 \
--seed 1024 \ --seed 1024 \
--quantization ascend \ --quantization ascend \
--served-model-name deepseek_v3 \ --served-model-name deepseek_v3.1 \
--max-num-seqs 16 \ --max-num-seqs 16 \
--max-model-len 32768 \ --max-model-len 32768 \
--max-num-batched-tokens 4096 \ --max-num-batched-tokens 4096 \
@@ -187,7 +189,7 @@ Once your server is started, you can query the model with input prompts:
curl http://{ node0 ip:8004 }/v1/completions \ curl http://{ node0 ip:8004 }/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "/root/.cache/ds_v3", "model": "deepseek_v3.1",
"prompt": "The future of AI is", "prompt": "The future of AI is",
"max_tokens": 50, "max_tokens": 50,
"temperature": 0 "temperature": 0
@@ -198,7 +200,8 @@ curl http://{ node0 ip:8004 }/v1/completions \
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
```shell ```shell
vllm bench serve --model /root/.cache/ds_v3 --served-model-name deepseek_v3 \ export VLLM_USE_MODELSCOPE=true
vllm bench serve --model vllm-ascend/DeepSeek-V3.1-W8A8 --served-model-name deepseek_v3.1 \
--dataset-name random --random-input-len 128 --random-output-len 128 \ --dataset-name random --random-input-len 128 --random-output-len 128 \
--num-prompts 200 --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1 --num-prompts 200 --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1
``` ```