2025-12-11 20:53:13 +08:00
# DeepSeek-V3.2
2025-11-04 18:58:33 +08:00
## Introduction
2025-12-13 22:09:59 +08:00
DeepSeek-V3.2 is a sparse attention model. The main architecture is similar to DeepSeek-V3.1, but with a sparse attention mechanism, which is designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.
2025-11-04 18:58:33 +08:00
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
## Supported Features
2026-02-10 15:03:35 +08:00
Refer to [supported features ](../../user_guide/support_matrix/supported_models.md ) to get the model's supported feature matrix.
2025-11-04 18:58:33 +08:00
2026-02-10 15:03:35 +08:00
Refer to [feature guide ](../../user_guide/feature_guide/index.md ) to get the feature's configuration.
2025-11-04 18:58:33 +08:00
## Environment Preparation
### Model Weight
[Doc]Refresh model tutorial examples and serving commands (#7426)
### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples
- adjust some command snippets and notes for better copy-paste usability
- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`(**Offline** inference currently **does not support** the
"max_completion_tokens")
``` bash
Traceback (most recent call last):
File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048,
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) +
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```
- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
model=model_name,
task="score", # need fix
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
- modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```
These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87
Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-20 11:34:18 +08:00
- `DeepSeek-V3.2-Exp-W8A8` (Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight ](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8 )
2025-12-27 17:14:31 +08:00
- `DeepSeek-V3.2-w8a8` (Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight ](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/ )
2025-11-04 18:58:33 +08:00
2026-02-13 15:50:05 +08:00
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` .
2025-11-04 18:58:33 +08:00
### Verify Multi-node Communication(Optional)
2026-02-10 15:03:35 +08:00
If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment ](../../installation.md#verify-multi-node-communication ).
2025-11-04 18:58:33 +08:00
### Installation
2026-02-10 11:14:57 +08:00
You can use our official docker image to run `DeepSeek-V3.2` directly.
2025-11-04 18:58:33 +08:00
2025-11-12 09:11:31 +08:00
:::::{tab-set}
:sync-group: install
::::{tab-item} A3 series
:sync: A3
2025-11-04 18:58:33 +08:00
2025-12-13 22:09:59 +08:00
Start the docker image on your each node.
2025-12-01 22:29:21 +08:00
```{code-block} bash
:substitutions:
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
2025-11-04 18:58:33 +08:00
2025-11-12 09:11:31 +08:00
::::
::::{tab-item} A2 series
:sync: A2
2025-11-04 18:58:33 +08:00
2025-12-13 22:09:59 +08:00
Start the docker image on your each node.
2025-12-01 22:29:21 +08:00
```{code-block} bash
:substitutions:
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
2025-11-04 18:58:33 +08:00
::::
2025-11-12 09:11:31 +08:00
:::::
2025-11-04 18:58:33 +08:00
2025-11-12 09:11:31 +08:00
In addition, if you don't want to use the docker image as above, you can also build all from source:
2025-11-04 18:58:33 +08:00
2026-02-10 15:03:35 +08:00
- Install `vllm-ascend` from source, refer to [installation ](../../installation.md ).
2025-11-04 18:58:33 +08:00
If you want to deploy multi-node environment, you need to set up environment on each node.
## Deployment
2025-12-13 22:09:59 +08:00
:::{note}
In this tutorial, we suppose you downloaded the model weight to `/root/.cache/` . Feel free to change it to your own path.
:::
2025-11-04 18:58:33 +08:00
2026-01-24 11:29:07 +08:00
### Single-node Deployment
- Quantized model `DeepSeek-V3.2-w8a8` can be deployed on 1 Atlas 800 A3 (64G × 16).
Run the following script to execute online inference.
```shell
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
[DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
2026-01-24 11:29:07 +08:00
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 2 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
[DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
2026-01-24 11:29:07 +08:00
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```
2026-04-12 21:52:54 +08:00
In PD-disaggregated deployments, `layer_sharding` is supported only on prefill/P nodes with `kv_role="kv_producer"` . Do not enable it on decode/D nodes or `kv_role="kv_both"` nodes.
2026-01-24 11:29:07 +08:00
### Multi-node Deployment
- `DeepSeek-V3.2-w8a8` : require at least 2 Atlas 800 A2 (64G × 8).
Run the following scripts on two nodes respectively.
:::::{tab-set}
:sync-group: install
::::{tab-item} A3 series
:sync: A3
**Node0**
```{code-block} bash
:substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
[DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
2026-01-24 11:29:07 +08:00
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 12890 \
--tensor-parallel-size 16 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
[DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
2026-01-24 11:29:07 +08:00
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```
**Node1**
```{code-block} bash
:substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
[DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
2026-01-24 11:29:07 +08:00
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 12890 \
--tensor-parallel-size 16 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
[DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
2026-01-24 11:29:07 +08:00
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```
::::
::::{tab-item} A2 series
:sync: A2
**Node0**
```{code-block} bash
:substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
2026-02-25 14:43:51 +08:00
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
2026-01-24 11:29:07 +08:00
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
[DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
2026-01-24 11:29:07 +08:00
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
[DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
2026-01-24 11:29:07 +08:00
```
**Node1**
```{code-block} bash
:substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
2026-02-25 14:43:51 +08:00
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
2026-01-24 11:29:07 +08:00
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
[DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
2026-01-24 11:29:07 +08:00
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
[DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
2026-01-24 11:29:07 +08:00
```
::::
:::::
2025-12-13 22:09:59 +08:00
### Prefill-Decode Disaggregation
2025-11-04 18:58:33 +08:00
2025-12-13 22:09:59 +08:00
We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environment with 1P1D for better performance.
Before you start, please
2026-01-15 09:06:01 +08:00
2026-02-13 15:50:05 +08:00
1. prepare the script `launch_online_dp.py` on each node:
2025-12-13 22:09:59 +08:00
2026-01-15 09:06:01 +08:00
```python
2025-12-13 22:09:59 +08:00
import argparse
import multiprocessing
import os
import subprocess
import sys
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--dp-size",
type=int,
required=True,
help="Data parallel size."
)
parser.add_argument(
"--tp-size",
type=int,
default=1,
help="Tensor parallel size."
)
parser.add_argument(
"--dp-size-local",
type=int,
default=-1,
help="Local data parallel size."
)
parser.add_argument(
"--dp-rank-start",
type=int,
default=0,
help="Starting rank for data parallel."
)
parser.add_argument(
"--dp-address",
type=str,
required=True,
help="IP address for data parallel master node."
)
parser.add_argument(
"--dp-rpc-port",
type=str,
default=12345,
help="Port for data parallel master node."
)
parser.add_argument(
"--vllm-start-port",
type=int,
default=9000,
help="Starting port for the engine."
)
return parser.parse_args()
args = parse_args()
dp_size = args.dp_size
tp_size = args.tp_size
dp_size_local = args.dp_size_local
if dp_size_local == -1:
dp_size_local = dp_size
dp_rank_start = args.dp_rank_start
dp_address = args.dp_address
dp_rpc_port = args.dp_rpc_port
vllm_start_port = args.vllm_start_port
2026-02-10 11:14:57 +08:00
def run_command(visible_devices, dp_rank, vllm_engine_port):
2025-12-13 22:09:59 +08:00
command = [
"bash",
"./run_dp_template.sh",
2026-02-10 11:14:57 +08:00
visible_devices,
2025-12-13 22:09:59 +08:00
str(vllm_engine_port),
str(dp_size),
str(dp_rank),
dp_address,
dp_rpc_port,
str(tp_size),
]
subprocess.run(command, check=True)
if __name__ == "__main__":
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
print(f"Template file {template_path} does not exist.")
sys.exit(1)
processes = []
num_cards = dp_size_local * tp_size
for i in range(dp_size_local):
dp_rank = dp_rank_start + i
vllm_engine_port = vllm_start_port + i
2026-02-10 11:14:57 +08:00
visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
2025-12-13 22:09:59 +08:00
process = multiprocessing.Process(target=run_command,
2026-02-10 11:14:57 +08:00
args=(visible_devices, dp_rank,
2025-12-13 22:09:59 +08:00
vllm_engine_port))
processes.append(process)
process.start()
for process in processes:
process.join()
```
2. prepare the script `run_dp_template.sh` on each node.
1. Prefill node 0
2026-01-15 09:06:01 +08:00
```shell
2025-12-13 22:09:59 +08:00
nic_name="enp48s3u1u1" # change to your own nic name
local_ip=141.61.39.105 # change to your own ip
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=256
export ASCEND_AGGREGATE_ENABLE=1
export ASCEND_TRANSPORT_PRINT=1
export ACL_OP_INIT_MODE=1
export ASCEND_A3_ENABLE=1
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
export ASCEND_RT_VISIBLE_DEVICES=$1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
2025-12-23 08:53:30 +08:00
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
2025-12-13 22:09:59 +08:00
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
2026-01-19 09:27:55 +08:00
--profiler-config \
'{"profiler": "torch",
"torch_profiler_dir": "./vllm_profile",
"torch_profiler_with_stack": false}' \
2025-12-13 22:09:59 +08:00
--seed 1024 \
--served-model-name dsv3 \
--max-model-len 68000 \
2026-03-18 14:59:48 +08:00
--max-num-batched-tokens 32560 \
2025-12-13 22:09:59 +08:00
--trust-remote-code \
--max-num-seqs 64 \
--gpu-memory-utilization 0.82 \
--quantization ascend \
--enforce-eager \
--no-enable-prefix-caching \
2026-01-20 12:40:54 +08:00
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
2025-12-13 22:09:59 +08:00
--kv-transfer-config \
2026-03-18 14:59:48 +08:00
'{"kv_connector": "MooncakeLayerwiseConnector",
2025-12-13 22:09:59 +08:00
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 16
},
"decode": {
"dp_size": 8,
"tp_size": 4
}
}
}'
```
2. Prefill node 1
2026-01-15 09:06:01 +08:00
```shell
2025-12-13 22:09:59 +08:00
nic_name="enp48s3u1u1" # change to your own nic name
local_ip=141.61.39.113 # change to your own ip
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=256
export ASCEND_AGGREGATE_ENABLE=1
export ASCEND_TRANSPORT_PRINT=1
export ACL_OP_INIT_MODE=1
export ASCEND_A3_ENABLE=1
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
export ASCEND_RT_VISIBLE_DEVICES=$1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
2025-12-23 08:53:30 +08:00
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
2025-12-13 22:09:59 +08:00
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
2026-01-19 09:27:55 +08:00
--profiler-config \
'{"profiler": "torch",
"torch_profiler_dir": "./vllm_profile",
"torch_profiler_with_stack": false}' \
2025-12-13 22:09:59 +08:00
--seed 1024 \
--served-model-name dsv3 \
--max-model-len 68000 \
2026-03-18 14:59:48 +08:00
--max-num-batched-tokens 32560 \
2025-12-13 22:09:59 +08:00
--trust-remote-code \
--max-num-seqs 64 \
--gpu-memory-utilization 0.82 \
--quantization ascend \
--enforce-eager \
--no-enable-prefix-caching \
2026-01-20 12:40:54 +08:00
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
2025-12-13 22:09:59 +08:00
--kv-transfer-config \
2026-03-18 14:59:48 +08:00
'{"kv_connector": "MooncakeLayerwiseConnector",
2025-12-13 22:09:59 +08:00
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 16
},
"decode": {
"dp_size": 8,
"tp_size": 4
}
}
}'
```
3. Decode node 0
2026-01-15 09:06:01 +08:00
```shell
2025-12-13 22:09:59 +08:00
nic_name="enp48s3u1u1" # change to your own nic name
local_ip=141.61.39.117 # change to your own ip
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
#Mooncake
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=256
export ASCEND_AGGREGATE_ENABLE=1
export ASCEND_TRANSPORT_PRINT=1
export ACL_OP_INIT_MODE=1
export ASCEND_A3_ENABLE=1
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
export TASK_QUEUE_ENABLE=1
export ASCEND_RT_VISIBLE_DEVICES=$1
2025-12-23 08:53:30 +08:00
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
2025-12-13 22:09:59 +08:00
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
2026-01-19 09:27:55 +08:00
--profiler-config \
'{"profiler": "torch",
"torch_profiler_dir": "./vllm_profile",
"torch_profiler_with_stack": false}' \
2025-12-13 22:09:59 +08:00
--seed 1024 \
--served-model-name dsv3 \
--max-model-len 68000 \
2025-12-25 22:28:35 +08:00
--max-num-batched-tokens 12 \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
2025-12-13 22:09:59 +08:00
--trust-remote-code \
2025-12-25 22:28:35 +08:00
--max-num-seqs 4 \
2025-12-13 22:09:59 +08:00
--gpu-memory-utilization 0.95 \
--no-enable-prefix-caching \
--async-scheduling \
--quantization ascend \
--kv-transfer-config \
2026-03-18 14:59:48 +08:00
'{"kv_connector": "MooncakeLayerwiseConnector",
2025-12-13 22:09:59 +08:00
"kv_role": "kv_consumer",
"kv_port": "30100",
"engine_id": "1",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 16
},
"decode": {
"dp_size": 8,
"tp_size": 4
}
}
2025-12-25 22:28:35 +08:00
}' \
--additional-config '{"recompute_scheduler_enable" : true}'
2025-12-13 22:09:59 +08:00
```
4. Decode node 1
2026-01-15 09:06:01 +08:00
```shell
2025-12-13 22:09:59 +08:00
nic_name="enp48s3u1u1" # change to your own nic name
local_ip=141.61.39.181 # change to your own ip
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
#Mooncake
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=256
export ASCEND_AGGREGATE_ENABLE=1
export ASCEND_TRANSPORT_PRINT=1
export ACL_OP_INIT_MODE=1
export ASCEND_A3_ENABLE=1
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
export TASK_QUEUE_ENABLE=1
export ASCEND_RT_VISIBLE_DEVICES=$1
2025-12-23 08:53:30 +08:00
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
2025-12-13 22:09:59 +08:00
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
2026-01-19 09:27:55 +08:00
--profiler-config \
'{"profiler": "torch",
"torch_profiler_dir": "./vllm_profile",
"torch_profiler_with_stack": false}' \
2025-12-13 22:09:59 +08:00
--seed 1024 \
--served-model-name dsv3 \
--max-model-len 68000 \
2025-12-25 22:28:35 +08:00
--max-num-batched-tokens 12 \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
2025-12-13 22:09:59 +08:00
--trust-remote-code \
--async-scheduling \
2025-12-25 22:28:35 +08:00
--max-num-seqs 4 \
2025-12-13 22:09:59 +08:00
--gpu-memory-utilization 0.95 \
--no-enable-prefix-caching \
--quantization ascend \
--kv-transfer-config \
2026-03-18 14:59:48 +08:00
'{"kv_connector": "MooncakeLayerwiseConnector",
2025-12-13 22:09:59 +08:00
"kv_role": "kv_consumer",
"kv_port": "30100",
"engine_id": "1",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 16
},
"decode": {
"dp_size": 8,
"tp_size": 4
}
}
2025-12-25 22:28:35 +08:00
}' \
--additional-config '{"recompute_scheduler_enable" : true}'
2025-12-13 22:09:59 +08:00
```
Once the preparation is done, you can start the server with the following command on each node:
2026-03-18 14:59:48 +08:00
Refer to [Distributed DP Server With Large-Scale Expert Parallelism ](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/large_scale_ep.html ) to get the detailed boot method.
2025-12-13 22:09:59 +08:00
1. Prefill node 0
2025-11-04 18:58:33 +08:00
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
```
2025-11-04 18:58:33 +08:00
2025-12-13 22:09:59 +08:00
2. Prefill node 1
2025-11-04 18:58:33 +08:00
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
```
2025-11-04 18:58:33 +08:00
2025-12-13 22:09:59 +08:00
3. Decode node 0
2025-11-04 18:58:33 +08:00
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
```
2025-11-04 18:58:33 +08:00
2025-12-13 22:09:59 +08:00
4. Decode node 1
2025-11-04 18:58:33 +08:00
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
```
2025-11-04 18:58:33 +08:00
2026-02-25 14:43:51 +08:00
### Request Forwarding
2026-03-18 14:59:48 +08:00
To set up request forwarding, run the following script on any machine. You can get the proxy program in the repository's examples: [load_balance_proxy_layerwise_server_example.py ](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py )
2026-02-25 14:43:51 +08:00
```shell
unset http_proxy
unset https_proxy
2026-03-18 14:59:48 +08:00
python load_balance_proxy_layerwise_server_example.py \
2026-02-25 14:43:51 +08:00
--port 8000 \
2026-03-18 14:59:48 +08:00
--host 141.61.39.105 \
2026-02-25 14:43:51 +08:00
--prefiller-hosts \
141.61.39.105 \
141.61.39.113 \
--prefiller-ports \
9100 \
9100 \
--decoder-hosts \
141.61.39.117 \
141.61.39.117 \
141.61.39.117 \
141.61.39.117 \
141.61.39.181 \
141.61.39.181 \
141.61.39.181 \
141.61.39.181 \
--decoder-ports \
9100 9101 9102 9103 \
9100 9101 9102 9103 \
```
2025-11-04 18:58:33 +08:00
## Functional Verification
Once your server is started, you can query the model with input prompts:
```shell
curl http://< node0_ip > :< port > /v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek_v3.2",
"prompt": "The future of AI is",
2026-01-26 11:57:40 +08:00
"max_completion_tokens": 50,
2025-11-04 18:58:33 +08:00
"temperature": 0
}'
```
## Accuracy Evaluation
Here are two accuracy evaluation methods.
### Using AISBench
2026-02-10 15:03:35 +08:00
1. Refer to [Using AISBench ](../../developer_guide/evaluation/using_ais_bench.md ) for details.
2025-11-04 18:58:33 +08:00
2025-12-13 22:09:59 +08:00
2. After execution, you can get the result.
2025-11-04 18:58:33 +08:00
### Using Language Model Evaluation Harness
2025-12-13 22:09:59 +08:00
As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-V3.2-W8A8` in online mode.
2025-11-04 18:58:33 +08:00
2026-02-10 15:03:35 +08:00
1. Refer to [Using lm_eval ](../../developer_guide/evaluation/using_lm_eval.md ) for `lm_eval` installation.
2025-11-04 18:58:33 +08:00
2. Run `lm_eval` to execute the accuracy evaluation.
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
lm_eval \
--model local-completions \
--model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \
--output_path ./
```
2025-11-04 18:58:33 +08:00
2025-12-13 22:09:59 +08:00
3. After execution, you can get the result.
2025-11-04 18:58:33 +08:00
## Performance
### Using AISBench
2026-02-10 15:03:35 +08:00
Refer to [Using AISBench for performance evaluation ](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation ) for details.
2025-11-04 18:58:33 +08:00
2025-12-25 22:28:35 +08:00
The performance result is:
**Hardware**: A3-752T, 4 node
**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4
**Input/Output**: 64k/3k
**Performance**: 533tps, TPOT 32ms
2025-11-04 18:58:33 +08:00
### Using vLLM Benchmark
2025-12-13 22:09:59 +08:00
Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.
2025-11-04 18:58:33 +08:00
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
Refer to [vllm benchmark ](https://docs.vllm.ai/en/latest/benchmarking/ ) for more details.
2025-11-04 18:58:33 +08:00
2026-02-10 11:14:57 +08:00
There are three `vllm bench` subcommands:
2026-01-15 09:06:01 +08:00
2025-11-04 18:58:33 +08:00
- `latency` : Benchmark the latency of a single batch of requests.
- `serve` : Benchmark the online serving throughput.
- `throughput` : Benchmark offline inference throughput.
Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
2026-02-13 15:50:05 +08:00
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
2025-11-04 18:58:33 +08:00
```
2025-12-25 22:28:35 +08:00
## Function Call
2025-12-13 22:09:59 +08:00
2025-12-25 22:28:35 +08:00
The function call feature is supported from v0.13.0rc1 on. Please use the latest version.
2025-12-13 22:09:59 +08:00
2025-12-25 22:28:35 +08:00
Refer to [DeepSeek-V3.2 Usage Guide ](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#tool-calling-example ) for details.