[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This commit is contained in:
wangxiyuan
2026-02-10 15:03:35 +08:00
committed by GitHub
parent 77305df398
commit 7d4833bce9
39 changed files with 159 additions and 151 deletions

View File

@@ -0,0 +1,15 @@
# Feature Tutorials
This section provides tutorials for different features of vLLM Ascend.
:::{toctree}
:caption: Feature Tutorials
:maxdepth: 1
pd_colocated_mooncake_multi_instance
pd_disaggregation_mooncake_single_node
pd_disaggregation_mooncake_multi_node
long_sequence_context_parallel_single_node
long_sequence_context_parallel_multi_node
suffix_speculative_decoding
ray
:::

View File

@@ -0,0 +1,371 @@
# Long-Sequence Context Parallel (Deepseek)
## Getting Start
:::{note}
Context parallel feature currently is only supported on Atlas A3 device, and will be supported on Atlas A2 in the future.
:::
vLLM-Ascend now supports long sequence with context parallel options. This guide takes one-by-one steps to verify these features with constrained resources.
Take the Deepseek-V3.1-w8a8 model as an example, use 3 Atlas 800T A3 servers to deploy the “1P1D” architecture. Node p is deployed across multiple machines, while node d is deployed on a single machine. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1). On each server, use 8 NPUs 16 chips to deploy one service instance.In the current example, we will enable the context parallel feature on node p to improve TTFT. Although enabling the DCP feature on node d can reduce memory usage, it would introduce additional communication and small operator overhead. Therefore, we will not enable the DCP feature on node d.
## Environment Preparation
### Model Weight
- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
### Verify Multi-node Communication
Refer to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication) to verify multi-node communication.
### Installation
You can use our official docker image to run `DeepSeek-V3.1` directly.
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
```{code-block} bash
:substitutions:
# Update the vllm-ascend image according to your environment.
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
You need to set up environment on each node.
## Prefiller/Decoder Deployment
We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
1. Run the following script to execute online 128k inference on three nodes respectively.
:::::{tab-set}
:sync-group: nodes
::::{tab-item} Prefiller node 1
:sync: prefill node1
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.1"
master_addr="192.0.0.1"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=768
export OMP_PROC_BIND=false
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
--host 0.0.0.0 \
--port 8004 \
--decode-context-parallel-size 8 \
--prefill-context-parallel-size 2 \
--cp-kv-cache-interleave-size 128 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--quantization ascend \
--enforce-eager \
--served-model-name deepseek_v3 \
--seed 1024 \
--no-enable-chunked-prefill \
--no-enable-prefix-caching \
--max-num-seqs 1 \
--max-model-len 136000 \
--max-num-batched-tokens 136000 \
--block-size 128 \
--trust-remote-code \
--gpu-memory-utilization 0.8 \
--nnodes 2 \
--node-rank 0 \
--master-addr $master_addr \
--master-port 7001 \
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 1,
"tp_size": 16
},
"decode": {
"dp_size": 1,
"tp_size": 16
}
}
}'
```
::::
::::{tab-item} Prefiller node 2
:sync: prefill node2
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.2"
master_addr="192.0.0.1"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=768
export OMP_PROC_BIND=false
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
--host 0.0.0.0 \
--port 8004 \
--decode-context-parallel-size 8 \
--prefill-context-parallel-size 2 \
--cp-kv-cache-interleave-size 128 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--quantization ascend \
--enforce-eager \
--served-model-name deepseek_v3 \
--seed 1024 \
--no-enable-chunked-prefill \
--no-enable-prefix-caching \
--max-num-seqs 1 \
--max-model-len 136000 \
--max-num-batched-tokens 136000 \
--block-size 128 \
--trust-remote-code \
--gpu-memory-utilization 0.8 \
--nnodes 2 \
--node-rank 1 \
--headless \
--master-addr $master_addr \
--master-port 7001 \
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "1",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 1,
"tp_size": 16
},
"decode": {
"dp_size": 1,
"tp_size": 16
}
}
}'
```
::::
::::{tab-item} Decoder node 1
:sync: decoder node1
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.3"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=768
export OMP_PROC_BIND=false
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 0 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 5980 \
--decode-context-parallel-size 1 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--quantization ascend \
--no-enable-prefix-caching \
--distributed-executor-backend mp \
--served-model-name deepseek_v3 \
--seed 1024 \
--max-model-len 136000 \
--max-num-batched-tokens 128 \
--enable-chunked-prefill \
--max-num-seqs 4 \
--trust-remote-code \
--gpu-memory-utilization 0.96 \
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
--compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4]}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "3",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 16
},
"decode": {
"dp_size": 1,
"tp_size": 16
}
}
}'
```
::::
:::::
2. Prefill master node `proxy.sh` scripts
```shell
python load_balance_proxy_server_example.py \
--port 8005 \
--host 192.0.0.1 \
--prefiller-hosts \
192.0.0.1 \
--prefiller-ports \
8004 \
--decoder-hosts \
192.0.0.3 \
--decoder-ports \
8004
```
3. run proxy
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
```shell
cd vllm-ascend/examples/disaggregated_prefill_v1/
bash proxy.sh
```
**Notice:**
The parameters are explained as follows:
- `--tensor-parallel-size` 16 are common settings for tensor parallelism (TP) sizes.
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism (PCP) sizes.
- `--decode-context-parallel-size` 8 are common settings for decode context parallelism (DCP) sizes.
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
- `--quantization` "ascend" indicates that quantization is used. To disable quantization, remove this option.
- `--compilation-config` contains configurations related to the aclgraph graph mode. The most significant configurations are "cudagraph_mode" and "cudagraph_capture_sizes", which have the following meanings:
"cudagraph_mode": represents the specific graph mode. Currently, "PIECEWISE" and "FULL_DECODE_ONLY" are supported. The graph mode is mainly used to reduce the cost of operator dispatch. Currently, "FULL_DECODE_ONLY" is recommended.
- "cudagraph_capture_sizes": represents different levels of graph modes. The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. In the graph mode, the input for graphs at different levels is fixed, and inputs between levels are automatically padded to the next level. Currently, the default setting is recommended. Only in some scenarios is it necessary to set this separately to achieve optimal performance.
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tensor-parallel-size > 1.
- `export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` indicates that context parallel is enabled. This environment variable is required in the PD architecture but not needed in the pd co-locate deployment scenario. It will be removed in the future.
**Notice:**
- tensor-parallel-size needs to be divisible by decode-context-parallel-size.
- decode-context-parallel-size must less than or equal to tensor-parallel-size.
## Accuracy Evaluation
Here are two accuracy evaluation methods.
### Using AISBench
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
2. After execution, you can get the result, here is the result of `DeepSeek-V3.1-w8a8` for reference only.
| dataset | version | metric | mode | vllm-api-general-chat |
|----------| ----- | ----- | ----- |-----------------------|
| aime2024 | - | accuracy | gen | 86.67 |
## Performance
### Using AISBench
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
### Using vLLM Benchmark
Run performance evaluation of `DeepSeek-V3.1-w8a8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommands:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.
Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompt 20 --request-rate 0 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
| dataset | version | metric | mode | ttft |
|---------| ----- |-------------|------|--------|
| random | - | performance | perf | 20.7s |

View File

@@ -0,0 +1,179 @@
# Long-Sequence Context Parallel (Qwen3-235B-A22B)
## Getting Start
vLLM-Ascend now supports long-sequence context parallel. This guide takes one-by-one steps to verify these features with constrained resources.
Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture.
## Environment Preparation
### Model Weight
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
### Run with Docker
Start a Docker container on each node.
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash
```
## Deployment
### Single-node Deployment
`Qwen3-235B-A22B-w8a8` can be deployed on 1 Atlas 800 A364G*16.
Quantized version need to start with parameter `--quantization ascend`.
Run the following script to execute online 128k inference.
```shell
#!/bin/sh
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true
# To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export TASK_QUEUE_ENABLE=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 8 \
--prefill-context-parallel-size 2 \
--decode-context-parallel-size 2 \
--seed 1024 \
--quantization ascend \
--served-model-name qwen3 \
--max-num-seqs 1 \
--max-model-len 133008 \
--max-num-batched-tokens 133008 \
--enable-expert-parallel \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8]}' \
--async-scheduling
```
**Notice:**
- for vllm version below `v0.12.0` use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`
- for vllm version `v0.12.0` use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`
The parameters are explained as follows:
- `--tensor-parallel-size` 8 are common settings for tensor parallelism (TP) sizes.
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism PCP) sizes.
- `--decode-context-parallel-size` 2 are common settings for decode context parallelism DCP) sizes.
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
- `--quantization` "ascend" indicates that quantization is used. To disable quantization, remove this option.
- `--compilation-config` contains configurations related to the aclgraph graph mode. The most significant configurations are "cudagraph_mode" and "cudagraph_capture_sizes", which have the following meanings:
"cudagraph_mode": represents the specific graph mode. Currently, "PIECEWISE" and "FULL_DECODE_ONLY" are supported. The graph mode is mainly used to reduce the cost of operator dispatch. Currently, "FULL_DECODE_ONLY" is recommended.
- "cudagraph_capture_sizes": represents different levels of graph modes. The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. In the graph mode, the input for graphs at different levels is fixed, and inputs between levels are automatically padded to the next level. Currently, the default setting is recommended. Only in some scenarios is it necessary to set this separately to achieve optimal performance.
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tp_size > 1.
**Notice:**
- tp_size needs to be divisible by dcp_size
- decode context parallel size must less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads.
## Accuracy Evaluation
Here are two accuracy evaluation methods.
### Using AISBench
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` for reference only.
| dataset | version | metric | mode | vllm-api-general-chat |
|----------| ----- | ----- | ----- |-----------------------|
| aime2024 | - | accuracy | gen | 83.33 |
## Performance
### Using AISBench
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
### Using vLLM Benchmark
Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommands:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.
Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompt 1 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
| dataset | version | metric | mode | ttft |
|---------| ----- |-------------|------|--------|
| random | - | performance | perf | 17.36s |

View File

@@ -0,0 +1,343 @@
# PD-Colocated with Mooncake Multi-Instance
## Getting Started
vLLM-Ascend now supports PD-colocated deployment with Mooncake features.
This guide provides step-by-step instructions to test these features with
constrained resources.
Using the Qwen2.5-72B-Instruct model as an example, this guide demonstrates
how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two Atlas 800T A2
nodes to deploy two vLLM instances. Each instance occupies 4 NPU cards and
uses PD-colocated deployment.
## Verify Multi-Node Communication Environment
### Physical Layer Requirements
- The two Atlas 800T A2 nodes must be physically interconnected via a RoCE
network. Without RoCE interconnection, cross-node KV Cache access
performance will be significantly degraded.
- All NPU cards must communicate properly. Intra-node communication uses HCCS,
while inter-node communication uses the RoCE network.
### Verification Process
The following process serves as a reference example. Please modify parameters
such as IP addresses according to your actual environment.
1. Single Node Verification:
Execute the following commands sequentially. The results must all be
`success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
```
2. Check NPU HCCN Configuration:
Ensure that the hccn.conf file exists in the environment. If using Docker,
mount it into the container.
```bash
cat /etc/hccn.conf
```
3. Get NPU IP Addresses:
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g; done
```
4. Cross-Node PING Test:
```bash
# Execute the following command on each node, replacing x.x.x.x
# with the target node's NPU card address.
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x; done
```
5. Check NPU TLS Configuration
```bash
# The tls settings should be consistent across all nodes
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
## Run with Docker
Start a Docker container on each node.
```bash
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0
export NAME=vllm-ascend
# Run the container using the defined variables
# This test uses four NPU cards to create the container.
# Mount the hccn.conf file from the host node into the container.
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:\
/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
## (Optional) Install Mooncake
Mooncake is pre-installed and functional in the v0.11.0 image.
The following installation steps are optional.
Mooncake is the serving platform for Kimi, a leading LLM service provided by
Moonshot AI. Installation and compilation guide:
<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
First, obtain the Mooncake project using the following command:
```bash
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
cd Mooncake
git submodule update --init --recursive
```
Install MPI:
```bash
apt-get install mpich libmpich-dev -y
```
Install the relevant dependencies (Go installation is not required):
```bash
bash dependencies.sh -y
```
Compile and install:
```bash
mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON
make -j
make install
```
After installation, verify that Mooncake is installed correctly:
```bash
python -c "import mooncake; print(mooncake.__file__)"
# Expected output path:
# /usr/local/Ascend/ascend-toolkit/latest/python/
# site-packages/mooncake/__init__.py
```
## Start Mooncake Master Service
To start the Mooncake master service in one of the node containers, use the
following command:
```bash
docker exec -it vllm-ascend bash
cd /vllm-workspace/Mooncake
mooncake_master --port 50088 \
--eviction_high_watermark_ratio 0.95 \
--eviction_ratio 0.05
```
| Parameter | Value | Explanation |
| ----------------------------- | ----- | ------------------------------------- |
| port | 50088 | Port for the master service |
| eviction_high_watermark_ratio | 0.95 | High watermark ratio (95% threshold) |
| eviction_ratio | 0.05 | Percentage to evict when full (5%) |
## Create a Mooncake Configuration File Named mooncake.json
The template for the mooncake.json file is as follows:
```json
{
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
"device_name": "",
"use_ascend_direct": true,
"master_server_address": "<your_server_ip>:50088",
"global_segment_size": 107374182400
}
```
| Parameter | Value | Explanation |
| --------------| ------------------------| -----------------------------------|
| metadata_server | P2PHANDSHAKE | Point-to-point handshake mode |
| protocol | ascend | Ascend proprietary protocol |
| use_ascend_direct | true | Enable direct hardware access |
| master_server_address | 90.90.100.188:50088(for example) | Master server address|
| global_segment_size | 107374182400 | Size per segment (100 GB) |
## vLLM Instance Deployment
Create containers on both Node 1 and Node 2, and launch the
Qwen2.5-72B-Instruct model service in each to test the reusability and
performance of cross-node, cross-instance KV Cache. Instance 1 utilizes NPU
cards [0-3] on the first Atlas 800T A2 server, while Instance 2 utilizes
cards [0-3] on the second server.
### Deploy Instance 1
Replace file paths, host, and port parameters based on your actual environment
configuration.
```bash
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/\
latest/python/site-packages:$LD_LIBRARY_PATH
export MOONCAKE_CONFIG_PATH="/vllm-workspace/mooncake.json"
# NPU buffer pool: quantity:size(MB)
# Allocates 4 buffers of 8MB each for KV transfer
export ASCEND_BUFFER_POOL=4:8
vllm serve <path_to_your_model>/Qwen2.5-72B-Instruct/ \
--served-model-name qwen \
--dtype bfloat16 \
--max-model-len 25600 \
--tensor-parallel-size 4 \
--host <your_server_ip> \
--port 8002 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.9 \
--kv-transfer-config '{
"kv_connector": "MooncakeConnectorStoreV1",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"use_layerwise": false,
"mooncake_rpc_port": "0",
"load_async": true,
"register_buffer": true
}
}'
```
### Deploy Instance 2
The deployment method for Instance 2 is identical to Instance 1. Simply
modify the `--host` and `--port` parameters according to your Instance 2
configuration.
### Configuration Parameters
| Parameter | Value | Explanation |
| ----------------- | ----------------------| -------------------------------- |
| kv_connector | MooncakeConnectorStoreV1 | Use StoreV1 version |
| kv_role | kv_both | Enable both produce and consume |
| use_layerwise | false | Transfer entire cache (see note) |
| mooncake_rpc_port | 0 | Automatic port assignment |
| load_async | true | Enable asynchronous loading |
| register_buffer | true | Required for PD-colocated mode |
**Note on use_layerwise:**
- `false`: Transfer entire KV Cache (suitable for cross-node with sufficient
bandwidth)
- `true`: Layer-by-layer transfer (suitable for single-node memory
constraints)
## Benchmark
We recommend using the **aisbench** tool to assess performance. The test uses
**Dataset A**, consisting of fully random data, with the following
configuration:
- Input/output tokens: 1024/10
- Total requests: 100
- Concurrency: 25
The test procedure consists of three steps:
### Step 1: Baseline (No Cache)
Send Dataset A to Instance 1 on Node 1 and record the Time to First Token
(TTFT) as **TTFT1**.
### Preparation for Step 2
Before Step 2, send a fully random Dataset B to Instance 1. Due to the
unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy,
Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's
cache only in Node 1's DRAM.
### Step 2: Local DRAM Hit
Send Dataset A to Instance 1 again to measure the performance when hitting
the KV Cache in local DRAM. Record the TTFT as **TTFT2**.
### Step 3: Cross-Node DRAM Hit
Send Dataset A to Instance 2. With the Mooncake KV Cache pool, this results
in a cross-node KV Cache hit from Node 1's DRAM. Record the TTFT as
**TTFT3**.
**Model Configuration**:
```python
from ais_bench.benchmark.models import VLLMCustomAPIChatStream
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
models = [
dict(
attr="service",
type=VLLMCustomAPIChatStream,
abbr='vllm-api-stream-chat',
path="<path_to_your_model>/Qwen2.5-72B-Instruct",
model="qwen",
request_rate = 0,
retry = 2,
host_ip = "<your_server_ip>",
host_port = 8002,
max_out_len = 10,
batch_size= 25,
trust_remote_code=False,
generation_kwargs = dict(
temperature = 0,
ignore_eos = True,
),
)
]
```
**Performance Benchmarking Commands**
```shell
ais_bench --models vllm_api_stream_chat \
--datasets gsm8k_gen_0_shot_cot_str_perf \
--debug --summarizer default_perf --mode perf
```
### Test Results
| Requests | Concur | TTFT1 (ms) | TTFT2 (ms) | TTFT3 (ms) |
| -------- | ------ | ---------- | ---------- | ---------- |
| 100 | 25 | 2322 | 739 | 948 |

View File

@@ -0,0 +1,951 @@
# Prefill-Decode Disaggregation (Deepseek)
## Getting Start
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
## Verify Multi-Node Communication Environment
### Physical Layer Requirements
- The physical machines must be located on the same WLAN, with network connectivity.
- All NPUs must be interconnected. Intra-node connectivity is via HCCS, and inter-node connectivity is via RDMA.
### Verification Process
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
:::::{tab-set}
::::{tab-item} A3
1. Single Node Verification:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
```
2. Check NPU HCCN Configuration:
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
```bash
cat /etc/hccn.conf
```
3. Get NPU IP Addresses
```bash
# Get virtual npu ip
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
4. Get superpodid and SDID
```bash
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
```
5. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with virtual npu ip address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
6. Check NPU TLS Configuration
```bash
# The tls settings should be consistent across all nodes
for i in {0..15}; do hccn_tool -i $i -tls -g ; done | grep switch
```
::::
::::{tab-item} A2
1. Single Node Verification:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
```
2. Check NPU HCCN Configuration:
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
```bash
cat /etc/hccn.conf
```
3. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
4. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
5. Check NPU TLS Configuration
```bash
# The tls settings should be consistent across all nodes
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
::::
:::::
## Run with Docker
Start a Docker container on each node.
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash
```
## Install Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
First, we need to obtain the Mooncake project. Refer to the following command:
```shell
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
```
(Optional) Replace go install url if the network is poor
```shell
cd Mooncake
sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh
```
Install mpi
```shell
apt-get install mpich libmpich-dev -y
```
Install the relevant dependencies. The installation of Go is not required.
```shell
bash dependencies.sh -y
```
Compile and install
```shell
mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON
make -j
make install
```
Set environment variables
**Note:**
- Adjust the Python path according to your specific Python installation
- Ensure `/usr/local/lib` and `/usr/local/lib64` are in your `LD_LIBRARY_PATH`
```shell
export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LIBRARY_PATH
```
## Prefiller/Decoder Deployment
We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
### kv_port Configuration Guide
On Ascend NPU, Mooncake uses AscendDirectTransport for RDMA data transfer, which randomly allocates ports within range `[20000, 20000 + npu_per_node × 1000)`. If `kv_port` overlaps with this range, intermittent port conflicts may occur. To avoid this, configure `kv_port` according to the table below:
| NPUs per Node | Reserved Port Range | Recommended kv_port |
|---------------|---------------------|---------------------|
| 8 | 20000 - 27999 | >= 28000 |
| 16 | 20000 - 35999 | >= 36000 |
```{warning}
If you occasionally see `zmq.error.ZMQError: Address already in use` during startup, it may be caused by kv_port conflicting with randomly allocated AscendDirectTransport ports. Increase your kv_port value to avoid the reserved range.
```
### launch_online_dp.py
Use `launch_online_dp.py` to launch external dp vllm servers.
[launch\_online\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py)
### run_dp_template.sh
Modify `run_dp_template.sh` on each node.
[run\_dp\_template.sh](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh)
#### Layerwise
:::::{tab-set}
:sync-group: nodes
::::{tab-item} Prefiller node 1
:sync: prefill node1
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.1"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=256
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name ds_r1 \
--max-model-len 40000 \
--max-num-batched-tokens 16384 \
--max-num-seqs 8 \
--enforce-eager \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--additional-config '{"recompute_scheduler_enable":true,"enable_shared_expert_dp": true}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_producer",
"kv_port": "36000",
"engine_id": "0",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
::::
::::{tab-item} Prefiller node 2
:sync: prefill node2
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.2"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=256
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name ds_r1 \
--max-model-len 40000 \
--max-num-batched-tokens 16384 \
--max-num-seqs 8 \
--enforce-eager \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--additional-config '{"recompute_scheduler_enable":true,"enable_shared_expert_dp": true}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_producer",
"kv_port": "36100",
"engine_id": "1",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
::::
::::{tab-item} Decoder node 1
:sync: decoder node1
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.3"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=600
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name ds_r1 \
--max-model-len 40000 \
--max-num-batched-tokens 256 \
--max-num-seqs 40 \
--trust-remote-code \
--gpu-memory-utilization 0.94 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_consumer",
"kv_port": "36200",
"engine_id": "2",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
::::
::::{tab-item} Decoder node 2
:sync: decoder node2
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.4"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=600
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name ds_r1 \
--max-model-len 40000 \
--max-num-batched-tokens 256 \
--max-num-seqs 40 \
--trust-remote-code \
--gpu-memory-utilization 0.94 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_consumer",
"kv_port": "36200",
"engine_id": "2",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
::::
:::::
#### Non-layerwise
:::::{tab-set}
:sync-group: nodes
::::{tab-item} Prefiller node 1
:sync: prefill node1
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.1"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=256
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name ds_r1 \
--max-model-len 40000 \
--max-num-batched-tokens 16384 \
--max-num-seqs 8 \
--enforce-eager \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--additional-config '{"recompute_scheduler_enable":true,"enable_shared_expert_dp": true}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "36000",
"engine_id": "0",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
::::
::::{tab-item} Prefiller node 2
:sync: prefill node2
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.2"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=256
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name ds_r1 \
--max-model-len 40000 \
--max-num-batched-tokens 16384 \
--max-num-seqs 8 \
--enforce-eager \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--additional-config '{"recompute_scheduler_enable":true,"enable_shared_expert_dp": true}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "36100",
"engine_id": "1",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
::::
::::{tab-item} Decoder node 1
:sync: decoder node1
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.3"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=600
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name ds_r1 \
--max-model-len 40000 \
--max-num-batched-tokens 256 \
--max-num-seqs 40 \
--trust-remote-code \
--gpu-memory-utilization 0.94 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "36200",
"engine_id": "2",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
::::
::::{tab-item} Decoder node 2
:sync: decoder node2
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.4"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=600
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name ds_r1 \
--max-model-len 40000 \
--max-num-batched-tokens 256 \
--max-num-seqs 40 \
--trust-remote-code \
--gpu-memory-utilization 0.94 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "36200",
"engine_id": "2",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
::::
:::::
### Start the service
```bash
# on 190.0.0.1
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 190.0.0.1 --dp-rpc-port 12321 --vllm-start-port 7100
# on 190.0.0.2
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 190.0.0.2 --dp-rpc-port 12321 --vllm-start-port 7100
# on 190.0.0.3
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 190.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
# on 190.0.0.4
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 190.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
```
## Example Proxy for Deployment
Run a proxy server on the same node where your prefiller service instance is deployed. You can find the proxy implementation in the repository's examples directory.
We provide two different proxy implementations with distinct request routing behaviors:
- **`load_balance_proxy_layerwise_server_example.py`**: Requests are first routed to the D nodes, which then forward to the P nodes as needed.This proxy is designed for use with the MooncakeLayerwiseConnector.[load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py)
- **`load_balance_proxy_server_example.py`**: Requests are first routed to the P nodes, which then forward to the D nodes for subsequent processing.This proxy is designed for use with the MooncakeConnector.[load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
:::::{tab-set}
::::{tab-item} Layerwise
```shell
python load_balance_proxy_layerwise_server_example.py \
--port 1999 \
--host 192.0.0.1 \
--prefiller-hosts \
192.0.0.1 \
192.0.0.1 \
192.0.0.2 \
192.0.0.2 \
--prefiller-ports \
7100 7101 7100 7101 \
--decoder-hosts \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
--decoder-ports \
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115\
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115\
```
::::
::::{tab-item} Non-layerwise
```shell
python load_balance_proxy_server_example.py \
--port 1999 \
--host 192.0.0.1 \
--prefiller-hosts \
192.0.0.1 \
192.0.0.1 \
192.0.0.2 \
192.0.0.2 \
--prefiller-ports \
7100 7101 7100 7101 \
--decoder-hosts \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.3 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
192.0.0.4 \
--decoder-ports \
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115\
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115\
```
::::
:::::
|Parameter | meaning |
| --- | --- |
| --port | Proxy service Port |
| --host | Proxy service Host IP|
| --prefiller-hosts | Hosts of prefiller nodes |
| --prefiller-ports | Ports of prefiller nodes |
| --decoder-hosts | Hosts of decoder nodes |
| --decoder-ports | Ports of decoder nodes |
You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
## Benchmark
We recommend use aisbench tool to assess performance. [aisbench](https://gitee.com/aisbench/benchmark) Execute the following commands to install aisbench
```shell
git clone https://gitee.com/aisbench/benchmark.git
cd benchmark/
pip3 install -e ./
```
You need to canncel the http proxy before assessing performance, as following
```shell
# unset proxy
unset http_proxy
unset https_proxy
```
- You can place your datasets in the dir: `benchmark/ais_bench/datasets`
- You can change the configurationin the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for examples
```python
models = [
dict(
attr="service",
type=VLLMCustomAPIChatStream,
abbr='vllm-api-stream-chat',
path="/root/.cache/ds_r1",
model="dsr1",
request_rate = 14,
retry = 2,
host_ip = "192.0.0.1", # Proxy service host IP
host_port = 8000, # Proxy service Port
max_out_len = 10,
batch_size=768,
trust_remote_code=True,
generation_kwargs = dict(
temperature = 0,
seed = 1024,
ignore_eos=False,
)
)
]
```
- Take gsm8k dataset for example, execute the following commands to assess performance.
```shell
ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf --debug --mode perf
```
- For more details for commands and parameters for aisbench, refer to [aisbench](https://gitee.com/aisbench/benchmark)
## FAQ
### 1. Prefiller nodes need to warmup
Since the computation of some NPU operators requires several rounds of warm-up to achieve best performance, we recommend preheating the service with some requests before conducting performance tests to achieve the best end-to-end throughput.
## Verification
Check service health using the proxy server endpoint.
```shell
curl http://192.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-moe",
"prompt": "Who are you?",
"max_completion_tokens": 100,
"temperature": 0
}'
```

View File

@@ -0,0 +1,277 @@
# Prefill-Decode Disaggregation (Qwen2.5-VL)
## Getting Start
vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide takes one-by-one steps to verify these features with constrained resources.
Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture. Assume the IP address is 192.0.0.1.
## Verify Communication Environment
### Verification Process
1. Single Node Verification:
Execute the following commands in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
```
2. Check NPU HCCN Configuration:
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
```bash
cat /etc/hccn.conf
```
3. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
4. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
5. Check NPU TLS Configuration
```bash
# The tls settings should be consistent across all nodes
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
## Run with Docker
Start a Docker container.
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash
```
## Install Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
First, we need to obtain the Mooncake project. Refer to the following command:
```shell
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
```
(Optional) Replace go install url if the network is poor
```shell
cd Mooncake
sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh
```
Install mpi
```shell
apt-get install mpich libmpich-dev -y
```
Install the relevant dependencies. The installation of Go is not required.
```shell
bash dependencies.sh -y
```
Compile and install
```shell
mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON
make -j
make install
```
Set environment variables
**Note:**
- Adjust the Python path according to your specific Python installation
- Ensure `/usr/local/lib` and `/usr/local/lib64` are in your `LD_LIBRARY_PATH`
```shell
export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LIBRARY_PATH
```
## Prefiller/Decoder Deployment
We can run the following scripts to launch a server on the prefiller/decoder NPU, respectively.
:::::{tab-set}
::::{tab-item} Prefiller
```shell
export ASCEND_RT_VISIBLE_DEVICES=0
export HCCL_IF_IP=192.0.0.1 # node ip
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
vllm serve /model/Qwen2.5-VL-7B-Instruct \
--host 0.0.0.0 \
--port 13700 \
--no-enable-prefix-caching \
--tensor-parallel-size 1 \
--seed 1024 \
--served-model-name qwen25vl \
--max-model-len 40000 \
--max-num-batched-tokens 40000 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 1
},
"decode": {
"dp_size": 1,
"tp_size": 1
}
}
}'
```
::::
::::{tab-item} Decoder
```shell
export ASCEND_RT_VISIBLE_DEVICES=1
export HCCL_IF_IP=192.0.0.1 # node ip
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
vllm serve /model/Qwen2.5-VL-7B-Instruct \
--host 0.0.0.0 \
--port 13701 \
--no-enable-prefix-caching \
--tensor-parallel-size 1 \
--seed 1024 \
--served-model-name qwen25vl \
--max-model-len 40000 \
--max-num-batched-tokens 40000 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "30100",
"engine_id": "1",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 1
},
"decode": {
"dp_size": 1,
"tp_size": 1
}
}
}'
```
::::
:::::
If you want to run "2P1D", please set ASCEND_RT_VISIBLE_DEVICES and port to different values for each P process.
## Example Proxy for Deployment
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
```shell
python load_balance_proxy_server_example.py \
--host 192.0.0.1 \
--port 8080 \
--prefiller-hosts 192.0.0.1 \
--prefiller-port 13700 \
--decoder-hosts 192.0.0.1 \
--decoder-ports 13701
```
|Parameter | Meaning |
| --- | --- |
| --port | Port of proxy |
| --prefiller-port | All ports of prefill |
| --decoder-ports | All ports of decoder |
## Verification
Check service health using the proxy server endpoint.
```shell
curl http://192.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen25vl",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustrate?"}
]}
],
"max_completion_tokens": 100,
"temperature": 0
}'
```

View File

@@ -0,0 +1,192 @@
# Ray Distributed (Qwen3-235B-A22B)
Multi-node inference is suitable for scenarios where the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
* **Verify Multi-Node Communication Environment**
* **Set Up and Start the Ray Cluster**
* **Start the Online Inference Service on Multi-node**
## Verify Multi-Node Communication Environment
### Physical Layer Requirements
* The physical machines must be located on the same LAN, with network connectivity.
* All NPUs are connected with optical modules, and the connection status must be normal.
### Verification Process
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
### NPU Interconnect Verification
#### 1. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
```
#### 2. Cross-Node PING Test
```bash
# Execute on the target node (replace with actual IP)
hccn_tool -i 0 -ping -g address 10.20.0.20
```
## Set Up and Start the Ray Cluster
### Setting Up the Basic Container
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is advised to use Docker images.
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the primary and secondary nodes, with the `--net=host` option to enable proper network connectivity.
Below is the example container setup command, which should be executed on **all nodes** :
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /path/to/shared/cache:/root/.cache \ # IMPORTANT: This must be a shared directory accessible by all nodes
-it $IMAGE bash
```
### Start Ray Cluster
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
Choose one machine as the primary node and the others as secondary nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues.
Below are the commands for the primary and secondary nodes:
**Primary node**:
:::{note}
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
Updating the environment variables requires restarting the Ray cluster.
:::
```shell
# Head node
export HCCL_IF_IP={local_ip}
export GLOO_SOCKET_IFNAME={nic_name}
export TP_SOCKET_IFNAME={nic_name}
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ray start --head
```
**Secondary node**:
:::{note}
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
:::
```shell
# Worker node
export HCCL_IF_IP={local_ip}
export GLOO_SOCKET_IFNAME={nic_name}
export TP_SOCKET_IFNAME={nic_name}
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ray start --address='{head_node_ip}:6379' --node-ip-address={local_ip}
```
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
After Ray is successfully started, the following content will appear:\
A local Ray instance has started successfully.\
Dashboard URL: The access address for the Ray Dashboard (default: <http://localhost:8265>); Node status (CPU/memory resources, number of healthy nodes); Cluster connection address (used for adding multiple nodes).
## Start the Online Inference Service on Multi-node scenario
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster.
**You only need to run the vllm command on one node.**
To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
```shell
vllm serve Qwen/Qwen3-235B-A22B \
--distributed-executor-backend ray \
--pipeline-parallel-size 2 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--max-model-len 8192 \
--max-num-seqs 25 \
--served-model-name qwen \
--trust-remote-code \
--gpu-memory-utilization 0.9
```
Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
```shell
vllm serve Qwen/Qwen3-235B-A22B \
--distributed-executor-backend ray \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--seed 1024 \
--max-model-len 8192 \
--max-num-seqs 25 \
--served-model-name qwen \
--trust-remote-code \
--gpu-memory-utilization 0.9
```
Once your server is started, you can query the model with input prompts:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"prompt": "tell me how to sleep well",
"max_completion_tokens": 100,
"temperature": 0
}'
```

View File

@@ -0,0 +1,179 @@
# Suffix Speculative Decoding
## **Introduction**
Suffix Decoding is an optimization technique for speculative decoding based on pattern matching. It simultaneously retrieves repetitive sequences from both the prompt and the generated content, using frequency statistics to predict the most likely token continuations. Unlike traditional speculative decoding methods, Suffix Decoding runs entirely on the CPU, eliminating the need for additional GPU resources or draft models, which results in superior acceleration for repetitive tasks such as AI agents and code generation.
This document provides step-by-step guidance on how to deploy and benchmark the Suffix Decoding speculative inference technology supported by `vllm-ascend` on Atlas A2 hardware. The setup utilizes a single Atlas 800T A2 node with a 4-card deployment of the Qwen3-32B model instance. Benchmarking is conducted using authentic open-source datasets covering the following categories:
| **Dataset Category** | **Dataset Name** |
| ------------------------------ | ---------------- |
| Code Generation | HumanEval |
| Common Sense Reasoning | ARC |
| Mathematical Reasoning | gsm8k |
| Natural Language Understanding | SuperGLUE_BoolQ |
| Comprehensive Examination | agieval |
| Multi-turn Dialogue | sharegpt |
The benchmarking tool used in this tutorial is AISbench, which supports performance testing for all the datasets listed above. The final section of this tutorial presents a performance comparison between enabling and disabling Suffix Decoding under the condition of satisfying an SLO TPOT < 50ms across different datasets and concurrency levels. Validations demonstrate that the Qwen3-32B model achieves a throughput improvement of approximately 20% to 80% on various real-world datasets when Suffix Decoding is enabled.
## **Download vllm-ascend Image**
This tutorial uses the official image, version v0.13.0rc1. Use the following command to download:
```bash
docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1
```
## **Run with Docker**
Container startup command:
```bash
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.13.0rc1
export NAME=vllm-ascend
# Run the container using the defined variables
# This test uses four Atlas A2 NPU cards to create the container.
# Mount the hccn.conf file from the host node into the container.
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:\
/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
## **Install arctic-inference**
Before enabling Suffix Decoding speculative inference on Ascend, the Arctic Inference plugin must be installed. Arctic Inference is an open-source plugin launched by Snowflake specifically to optimize LLM inference speed. For detailed technical principles, please refer to the following article: [Fastest Speculative Decoding in vLLM with Arctic Inference and Arctic Training](https://www.snowflake.com/en/engineering-blog/fast-speculative-decoding-vllm-arctic/). Install it within the container using the following command:
```bash
pip install arctic-inference
```
## **vLLM Instance Deployment**
Use the following command to start the container service instance. Speculative inference is enabled via the `--speculative-config` parameter, where `method` is set to`suffix`. For this test, `num_speculative_tokens` is uniformly set to`3`.
```bash
# set the NPU device number
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
# Set the operator dispatch pipeline level to 1 and disable manual memory control in ACLGraph
export TASK_QUEUE_ENABLE=1
# Enable the AIVector core to directly schedule ROCE communication
export HCCL_OP_EXPANSION_MODE="AIV"
# Enable MLP prefetch for better performance.
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1
# Enable FlashComm_v1 optimization when tensor parallel is enabled.
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
vllm serve /data/Qwen3-32B \
--served-model-name qwen3 \
--trust-remote-code \
--distributed-executor-backend mp \
--tensor-parallel-size 4 \
--max-model-len 5500 \
--max-num-batched-tokens 40960 \
--speculative-config '{"method": "suffix", "num_speculative_tokens": 3}' \
--gpu-memory-utilization 0.9 \
--additional-config '{"pa_shape_list":[48,64,72,80]}' \
--port 8011
```
## **AISbench Benchmark Testing**
Performance for all open-source datasets is tested using AISbench. For specific instructions, refer to [Using AISBench for performance evaluation](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html#execute-performance-evaluation).
**Model Configuration**:
```bash
# "ignore_eos" must be set to "False", and "max_out_len" should be set to a large value to allow the model to output completely and naturally.
from ais_bench.benchmark.models import VLLMCustomAPIChatStream
models = [
dict(
attr="service",
type=VLLMCustomAPIChatStream,
abbr='vllm-api-stream-chat',
path="<path_to_your_model>/Qwen3-32B",
model="qwen3",
request_rate = 0,
retry = 2,
host_ip = "<your_server_ip>",
host_port = 8011,
max_out_len = 4000,
batch_size= 16,
trust_remote_code=False,
generation_kwargs = dict(
temperature = 0,
ignore_eos = False
)
)
]
```
**Performance Benchmarking Commands**:
```bash
# Example command to test gsm8k dataset performance using the first 100 prompts. Commands for other datasets are similar.
ais_bench --models vllm_api_stream_chat \
--datasets gsm8k_gen_0_shot_cot_str_perf \
--debug --summarizer default_perf --mode perf --num-prompts 100
```
## **Test Results**
Below are the detailed test results of the six open-source datasets in this evaluation. Compared to the baseline performance, the improvement in TPOT and throughput performance at different concurrency levels after enabling Suffix Decoding varies across datasets. The extent of improvement after enabling Suffix Decoding differs among the datasets. Below is a summary of the results:
| **Dataset Category** | **Typical Representative** | **Throughput Improvement (BS=1-10)** | **SLO TPOT** |
| -------------------- | -------------------------- | ------------------------------------ | ------------ |
| **High Gain** | AGIEval, GSM8K | **> 50%** | < 50ms |
| **Medium-Low Gain** | ARC, ShareGPT | **20% ~ 30%** | < 50ms |
Below is the raw detailed test results:
| Concurrency | Avg Input | Avg Output | Requests | Base TPOT(ms) | Base Throughput(TPS) | Suffix TPOT(ms) | Suffix Throughput(TPS) | Accept Rate | TPOT Gain | TPS Gain |
| ------------------- | --------- | ---------- | -------- | ------------- | -------------------- | --------------- | ---------------------- | ----------- | --------- | -------- |
| **Humaneval** | | | | | | | | | | |
| 1 | 150 | 2700 | 100 | 55.1 | 18.1 | 37.9 | 26.3 | 27.0% | 45.2% | 45.1% |
| 15 | 150 | 2700 | 100 | 61.6 | 233.8 | 45.8 | 318.2 | 27.0% | 34.6% | 36.1% |
| 26 | 150 | 2700 | 100 | 64.7 | 403.8 | 50.9 | 519.2 | 27.0% | 27.2% | 28.6% |
| **ARC** | | | | | | | | | | |
| 1 | 76 | 960 | 100 | 52.8 | 18.9 | 39.5 | 25.4 | 23.9% | 33.7% | 34.6% |
| 8 | 76 | 960 | 100 | 59.1 | 125.4 | 47.0 | 163.1 | 23.9% | 25.7% | 30.0% |
| 15 | 76 | 960 | 100 | 59.8 | 245.8 | 48.9 | 311.7 | 23.9% | 22.3% | 26.8% |
| **GSM8K** | | | | | | | | | | |
| 1 | 67 | 1570 | 100 | 55.5 | 18.0 | 35.7 | 28.5 | 31.1% | 55.6% | 58.4% |
| 17 | 67 | 1570 | 100 | 61.5 | 279.8 | 45.4 | 403.0 | 31.1% | 35.6% | 44.0% |
| 26 | 67 | 1570 | 100 | 63.9 | 396.4 | 50.0 | 527.6 | 31.1% | 27.8% | 33.1% |
| **ShareGPT** | | | | | | | | | | |
| 1 | 666 | 231 | 327 | 54.1 | 18.3 | 39.2 | 24.1 | 23.9% | 37.9% | 31.5% |
| 8 | 666 | 231 | 327 | 58.8 | 125.0 | 46.2 | 153.2 | 23.9% | 27.1% | 22.5% |
| 14 | 666 | 231 | 327 | 61.8 | 227.0 | 49.9 | 273.9 | 23.9% | 23.8% | 20.7% |
| **SuperGLUE_BoolQ** | | | | | | | | | | |
| 1 | 207 | 314 | 100 | 54.1 | 18.4 | 36.1 | 26.8 | 33.4% | 49.8% | 45.6% |
| 16 | 207 | 314 | 100 | 60.0 | 229.7 | 43.5 | 303.9 | 33.4% | 38.0% | 32.3% |
| 32 | 207 | 314 | 100 | 62.7 | 396.4 | 47.8 | 507.5 | 33.4% | 31.3% | 28.0% |
| **Agieval** | | | | | | | | | | |
| 1 | 735 | 1880 | 100 | 53.1 | 18.7 | 31.8 | 34.1 | 50.3% | 66.8% | 81.9% |
| 24 | 735 | 1880 | 100 | 64.0 | 381.2 | 43.3 | 629.0 | 50.3% | 47.8% | 65.0% |
| 34 | 735 | 1880 | 100 | 70.0 | 494.6 | 50.2 | 768.4 | 50.3% | 39.4% | 55.3% |