[v0.11.0][Doc] Update doc (#3852)

### What this PR does / why we need it?
Update doc


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
This commit is contained in:
zhangxinyuehfad
2025-10-29 11:32:12 +08:00
committed by GitHub
parent 6188450269
commit 75de3fa172
49 changed files with 724 additions and 701 deletions

View File

@@ -1,7 +1,7 @@
# Multi-Node (DeepSeek V3.2)
:::{note}
Only machines with aarch64 is supported currently, x86 is coming soon. This guide take A3 as the example.
Only machines with AArch64 are supported currently. x86 will be supported soon. This guide takes A3 as the example.
:::
## Verify Multi-Node Communication Environment
@@ -80,14 +80,14 @@ for i in {0..15}; do hccn_tool -i $i -ip -g | grep ipaddr; done
hccn_tool -i 0 -ping -g address 10.20.0.20
```
## Deploy DeepSeek-V3.2-Exp with vLLM-Ascend:
## Deploy DeepSeek-V3.2-Exp with vLLM-Ascend
Currently, we provide a all-in-one image (include CANN 8.2RC1 + [SparseFlashAttention/LightningIndexer](https://gitcode.com/cann/cann-recipes-infer/tree/master/ops/ascendc) + [MLAPO](https://github.com/vllm-project/vllm-ascend/pull/3226)). You can also build your own image refer to [link](https://github.com/vllm-project/vllm-ascend/issues/3278).
Currently, we provide a all-in-one image (include CANN 8.2RC1 + [SparseFlashAttention/LightningIndexer](https://gitcode.com/cann/cann-recipes-infer/tree/master/ops/ascendc) + [MLAPO](https://github.com/vllm-project/vllm-ascend/pull/3226)). You can also build your own image by referring to [link](https://github.com/vllm-project/vllm-ascend/issues/3278).
- `DeepSeek-V3.2-Exp`: requreid 2 Atlas 800 A3(64G*16) nodes or 4 Atlas 800 A2(64G*8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
- `DeepSeek-V3.2-Exp-w8a8`: requreid 1 Atlas 800 A3(64G*16) node or 2 Atlas 800 A2(64G*8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
- `DeepSeek-V3.2-Exp`: require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
- `DeepSeek-V3.2-Exp-w8a8`: require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
Run the following command to start the container in each node(This guide suppose you have download the weight to /root/.cache already):
Run the following command to start the container in each node (You should download the weight to /root/.cache in advance):
:::::{tab-set}
::::{tab-item} A2 series
@@ -180,13 +180,13 @@ docker run --rm \
:::::{tab-set}
::::{tab-item} DeepSeek-V3.2-Exp A3 series
Run the following scripts on two nodes respectively
Run the following scripts on two nodes respectively.
:::{note}
Before launch the inference server, ensure the following environment variables are set for multi node communication
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
**node0**
**Node 0**
```shell
#!/bin/sh
@@ -225,7 +225,7 @@ vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
```
**node1**
**Node 1**
```shell
#!/bin/sh
@@ -297,9 +297,9 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
::::
::::{tab-item} DeepSeek-V3.2-Exp-W8A8 A2 series
Run the following scripts on two nodes respectively
Run the following scripts on two nodes respectively.
**node0**
**Node 0**
```shell
#!/bin/sh
@@ -341,7 +341,7 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
```
**node1**
**Node 1**
```shell
#!/bin/sh

View File

@@ -3,18 +3,18 @@
## Getting Start
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
Each DP rank is deployed as a separate core engine process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
Each DP rank is deployed as a separate "core engine" process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
- Use **Data Parallel (DP)** for attention layers, which are replicated across devices and handle separate batches.
- Use **Expert or Tensor Parallel (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
This division enables attention layers to be replicated across DP ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using DP/TP, maximizing hardware utilization and efficiency.
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
For MoE models, when any requests are in progress in any rank, we must ensure that empty dummy forward passes are performed in all ranks which dont currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP `Coordinator` process, which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
## Verify Multi-Node Communication Environment
@@ -56,8 +56,8 @@ for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
hccn_tool -i 0 -ping -g address 10.20.0.20
```
## Run with docker
Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multi-node.
## Run with Docker
Assume you have two Atlas 800 A2 (64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multiple nodes.
```{code-block} bash
:substitutions:
@@ -66,7 +66,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
@@ -91,13 +91,13 @@ docker run --rm \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively
Run the following scripts on two nodes respectively.
:::{note}
Before launch the inference server, ensure the following environment variables are set for multi node communication
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
**node0**
**Node 0**
```shell
#!/bin/sh
@@ -116,8 +116,8 @@ export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--host 0.0.0.0 \
--port 8004 \
@@ -139,7 +139,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
**node1**
**Node 1**
```shell
#!/bin/sh
@@ -185,7 +185,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
The Deployment view looks like:
The deployment view looks like:
![alt text](../assets/multi_node_dp_deepseek.png)
Once your server is started, you can query the model with input prompts:
@@ -201,8 +201,8 @@ curl http://{ node0 ip:8004 }/v1/completions \
}'
```
## Run benchmarks
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
## Run Benchmarks
For details, refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
```shell
export VLLM_USE_MODELSCOPE=true

View File

@@ -2,10 +2,10 @@
## Verify Multi-Node Communication Environment
referring to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process)
Refer to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process).
## Run with docker
Assume you have two Atlas 800 A3(64G*16) nodes(or 4 * A2), and want to deploy the `Kimi-K2-Instruct-W8A8` quantitative model across multi-node.
## Run with Docker
Assume you have two Atlas 800 A3 (64G*16) or four A2 nodes, and want to deploy the `Kimi-K2-Instruct-W8A8` quantitative model across multiple nodes.
```{code-block} bash
:substitutions:
@@ -14,7 +14,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
@@ -47,13 +47,13 @@ docker run --rm \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively
Run the following scripts on two nodes respectively.
:::{note}
Before launch the inference server, ensure the following environment variables are set for multi node communication
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
**node0**
**Node 0**
```shell
#!/bin/sh
@@ -72,8 +72,8 @@ export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--host 0.0.0.0 \
--port 8004 \
@@ -96,7 +96,7 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
**node1**
**Node 1**
```shell
#!/bin/sh
@@ -141,7 +141,7 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
The Deployment view looks like:
The deployment view looks like:
![alt text](../assets/multi_node_dp_kimi.png)
Once your server is started, you can query the model with input prompts:

View File

@@ -2,9 +2,9 @@
## Getting Start
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
vLLM-Ascend now supports prefill-decode (PD) disaggregation with Expert Parallel (EP) options. This guide takes one-by-one steps to verify these features with constrained resources.
Take the Qwen3-30B-A3B model as an example, use vllm-ascend v0.10.1rc1 (with vLLM v0.10.1.1) on 3 Atlas 800T A2 servers to deploy the "1P2D" architecture. Assume the ip of the prefiller server is 192.0.0.1, and the decoder servers are 192.0.0.2 (decoder 1) and 192.0.0.3 (decoder 2). On each server, use 2 NPUs to deploy one service instance.
Using the Qwen3-30B-A3B model as an example, use vllm-ascend v0.10.1rc1 (with vLLM v0.10.1.1) on 3 Atlas 800T A2 servers to deploy the "1P2D" architecture. Assume the IP address of the prefiller server is 192.0.0.1, and the decoder servers are 192.0.0.2 (decoder 1) and 192.0.0.3 (decoder 2). On each server, use 2 NPUs to deploy one service instance.
## Verify Multi-Node Communication Environment
@@ -49,7 +49,7 @@ for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
## Generate Ranktable
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. For more details please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md). Execute the following commands for reference.
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. For more details, please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md). Execute the following commands for reference.
```shell
cd vllm-ascend/examples/disaggregate_prefill_v1/
@@ -58,7 +58,7 @@ bash gen_ranktable.sh --ips <prefiller_node1_local_ip> <prefiller_node2_local_ip
[--local-device-ids <id_1>,<id_2>,<id_3>...]
```
Assume that we use device 0,1 on the prefiller server node and device 6,7 on both of the decoder server nodes. Take the following commands as an example. (`--local-device-ids` is necessary if you specify certain NPU devices on the local server.)
Assume that we use devices 0 and 1 on the prefiller server node and devices 6 and 7 on both of the decoder server nodes. The following commands are for reference. (`--local-device-ids` is necessary if you specify certain NPU devices on the local server.)
```shell
# On the prefiller node
@@ -77,20 +77,20 @@ bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 6,7
```
Rank table will generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
The rank table will be generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
|Parameter | meaning |
|Parameter | Meaning |
| --- | --- |
| --ips | Each node's local ip (prefiller nodes should be front of decoder nodes) |
| --npus-per-node | Each node's npu clips |
| --ips | Each node's local IP address (prefiller nodes should be in front of decoder nodes) |
| --npus-per-node | Each node's NPU clips |
| --network-card-name | The physical machines' NIC |
|--prefill-device-cnt | Npu clips used for prefill |
|--decode-device-cnt |Npu clips used for decode |
|--prefill-device-cnt | NPU clips used for prefill |
|--decode-device-cnt |NPU clips used for decode |
|--local-device-ids |Optional. No need if using all devices on the local node. |
## Prefiller / Decoder Deployment
## Prefiller/Decoder Deployment
We can run the following scripts to launch a server on the prefiller/decoder node respectively.
We can run the following scripts to launch a server on the prefiller/decoder node, respectively.
:::::{tab-set}
@@ -214,9 +214,9 @@ vllm serve /model/Qwen3-30B-A3B \
:::::
## Example proxy for Deployment
## Example Proxy for Deployment
Run a proxy server on the same node with prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
```shell
python load_balance_proxy_server_example.py \

View File

@@ -2,7 +2,7 @@
## Getting Start
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide takes one-by-one steps to verify these features with constrained resources.
Take the Qwen3-235B model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
@@ -60,7 +60,7 @@ Mooncake is the serving platform for Kimi, a leading LLM service provided by Moo
git clone -b pooling_async_memecpy_v1 https://github.com/AscendTransport/Mooncake
```
Update and install Python
Update and install Python.
```shell
apt-get update
@@ -97,11 +97,11 @@ cp mooncake-transfer-engine/src/transport/ascend_transport/hccl_transport/ascend
cp mooncake-transfer-engine/src/libtransfer_engine.so /usr/local/Ascend/ascend-toolkit/latest/python/site-packages/
```
## Prefiller / Decoder Deployment
## Prefiller/Decoder Deployment
We can run the following scripts to launch a server on the prefiller/decoder node respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
### layerwise
### Layerwise
:::::{tab-set}
@@ -223,7 +223,7 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
::::
::::{tab-item} Decoder node 1 (master Node)
::::{tab-item} Decoder node 1 (master node)
```shell
unset ftp_proxy
@@ -284,7 +284,7 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
::::
::::{tab-item} Decoder node 2 (primary Node)
::::{tab-item} Decoder node 2 (primary node)
```shell
unset ftp_proxy
@@ -348,7 +348,7 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
:::::
### non-layerwise
### Non-layerwise
:::::{tab-set}
@@ -595,13 +595,13 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
:::::
## Example proxy for Deployment
## Example Proxy for Deployment
Run a proxy server on the same node with prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py) or [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py) or [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
:::::{tab-set}
::::{tab-item} layerwise
::::{tab-item} Layerwise
```shell
python load_balance_proxy_layerwise_server_example.py \
@@ -615,7 +615,7 @@ python load_balance_proxy_layerwise_server_example.py \
::::
::::{tab-item} non-layerwise
::::{tab-item} Non-layerwise
```shell
python load_balance_proxy_server_example.py \

View File

@@ -1,15 +1,15 @@
# Multi-Node-DP (Qwen3-VL-235B-A22B)
:::{note}
Qwen3 VL rely on the newest version of `transformers`(>4.56.2). Please install it from source until it's released.
Qwen3 VL relies on the newest version of `transformers` (>4.56.2). Please install it from source.
:::
## Verify Multi-Node Communication Environment
referring to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process)
Refer to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process).
## Run with docker
Assume you have an Atlas 800 A3(64G*16) nodes(or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multi-node.
## Run with Docker
Assume you have Atlas 800 A3 (64G*16) nodes (or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
```{code-block} bash
:substitutions:
@@ -49,13 +49,13 @@ docker run --rm \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively
Run the following scripts on two nodes respectively.
:::{note}
Before launch the inference server, ensure the following environment variables are set for multi node communication
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
node0
Node 0
```shell
#!/bin/sh
@@ -93,7 +93,7 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--gpu-memory-utilization 0.8 \
```
node1
Node 1
```shell
#!/bin/sh
@@ -136,7 +136,7 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
--gpu-memory-utilization 0.8 \
```
If the service starts successfully, the following information will be displayed on node0:
If the service starts successfully, the following information will be displayed on node 0:
```shell
INFO: Started server process [44610]

View File

@@ -1,10 +1,10 @@
# Multi-Node-Ray (Qwen/Qwen3-235B-A22B)
Multi-node inference is suitable for the scenarios that the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
Multi-node inference is suitable for scenarios where the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
* **Verify Multi-Node Communication Environment**
* **Set Up and Start the Ray Cluster**
* **Start the Online Inference Service on multinode**
* **Start the Online Inference Service on Multi-node**
## Verify Multi-Node Communication Environment
@@ -48,9 +48,9 @@ hccn_tool -i 0 -ping -g address 10.20.0.20
## Set Up and Start the Ray Cluster
### Setting Up the Basic Container
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is advised to use Docker images.
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the primary and secondary nodes, with the `--net=host` option to enable proper network connectivity.
Below is the example container setup command, which should be executed on **all nodes** :
@@ -89,13 +89,13 @@ docker run --rm \
### Start Ray Cluster
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
Choose one machine as the primary node and the others as secondary nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues.
Below are the commands for the head and worker nodes:
Below are the commands for the primary and secondary nodes:
**Head node**:
**Primary node**:
:::{note}
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
@@ -112,7 +112,7 @@ export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ray start --head
```
**Worker node**:
**Secondary node**:
:::{note}
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
@@ -130,7 +130,7 @@ ray start --address='{head_node_ip}:6379' --node-ip-address={local_ip}
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
## Start the Online Inference Service on multinode scenario
## Start the Online Inference Service on Multi-node scenario
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster.
**You only need to run the vllm command on one node.**

View File

@@ -27,7 +27,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -39,13 +39,13 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
Run the following script to start the vLLM server on multi-NPU:
```bash
vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/completions \

View File

@@ -27,7 +27,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
@@ -43,7 +43,7 @@ git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
Run the following script to start the vLLM server on multi-NPU:
```bash
vllm serve /path/to/pangu-pro-moe-model \

View File

@@ -1,8 +1,8 @@
# Multi-NPU (QwQ 32B W8A8)
## Run docker container
## Run Docker Container
:::{note}
w8a8 quantization feature is supported by v0.8.4rc2 or higher
w8a8 quantization feature is supported by v0.8.4rc2 and later.
:::
```{code-block} bash
@@ -28,7 +28,7 @@ docker run --rm \
-it $IMAGE bash
```
## Install modelslim and convert model
## Install modelslim and Convert Model
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8
@@ -53,8 +53,8 @@ SAVE_PATH=/home/models/QwQ-32B-w8a8
python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --calib_file ../common/boolq.jsonl --w_bit 8 --a_bit 8 --device_type npu --anti_method m1 --trust_remote_code True
```
## Verify the quantized model
The converted model files looks like:
## Verify the Quantized Model
The converted model files look like:
```bash
.
@@ -68,10 +68,10 @@ The converted model files looks like:
`-- tokenizer_config.json
```
Run the following script to start the vLLM server with quantized model:
Run the following script to start the vLLM server with the quantized model:
:::{note}
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released. You can cherry-pick this commit for now.
:::
```bash
@@ -93,10 +93,10 @@ curl http://localhost:8000/v1/completions \
}'
```
Run the following script to execute offline inference on multi-NPU with quantized model:
Run the following script to execute offline inference on multi-NPU with the quantized model:
:::{note}
To enable quantization for ascend, quantization method must be "ascend"
To enable quantization for ascend, quantization method must be "ascend".
:::
```python

View File

@@ -27,7 +27,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -41,13 +41,13 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
Run the following script to start the vLLM server on Multi-NPU:
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32GB of memory, tensor-parallel-size should be at least 4.
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32 GB of memory, tensor-parallel-size should be at least 4.
```bash
vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable_expert_parallel
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{

View File

@@ -1,7 +1,7 @@
# Multi-NPU (Qwen3-Next)
```{note}
The Qwen3 Next are using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes around stability, accuracy and performance improvement.
The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement.
```
## Run vllm-ascend on Multi-NPU with Qwen3 Next
@@ -31,7 +31,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -41,7 +41,7 @@ export VLLM_USE_MODELSCOPE=True
### Install Triton Ascend
:::::{tab-set}
::::{tab-item} Linux (aarch64)
::::{tab-item} Linux (AArch64)
The [Triton Ascend](https://gitee.com/ascend/triton-ascend) is required when you run Qwen3 Next, please follow the instructions below to install it and its dependency.
@@ -72,7 +72,7 @@ Coming soon ...
### Inference on Multi-NPU
Please make sure you already executed the command:
Please make sure you have already executed the command:
```bash
source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
@@ -81,15 +81,15 @@ source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
:::::{tab-set}
::::{tab-item} Online Inference
Run the following script to start the vLLM server on Multi-NPU:
Run the following script to start the vLLM server on multi-NPU:
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32GB of memory, tensor-parallel-size should be at least 8.
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32 GB of memory, tensor-parallel-size should be at least 8.
```bash
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --enforce-eager
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{

View File

@@ -1,13 +1,13 @@
# Single Node (Atlas 300I series)
# Single Node (Atlas 300I Series)
```{note}
1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement.
2. Currently, the 310I series only supports eager mode and the data type is float16.
3. There are some known issues for running vLLM on 310p series, you can refer to vllm-ascend [<u>#3316</u>](https://github.com/vllm-project/vllm-ascend/issues/3316),
[<u>#2795</u>](https://github.com/vllm-project/vllm-ascend/issues/2795), you can use v0.10.0rc1 version first.
1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes related to model coverage and performance improvement.
2. Currently, the 310I series only supports eager mode and the float16 data type.
3. There are some known issues for running vLLM on 310p series, you can refer to vllm-ascend [<u>#3316</u>](https://github.com/vllm-project/vllm-ascend/issues/3316) and
[<u>#2795</u>](https://github.com/vllm-project/vllm-ascend/issues/2795). You can use v0.10.0rc1 version first.
```
## Run vLLM on Altlas 300I series
## Run vLLM on Atlas 300I Series
Run docker container:
@@ -38,7 +38,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -50,7 +50,7 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
### Online Inference on NPU
Run the following script to start the vLLM server on NPU(Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
Run the following script to start the vLLM server on NPU (Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
:::::{tab-set}
:sync-group: inference
@@ -170,7 +170,7 @@ vllm serve /home/pangu-pro-moe-mode/ \
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
export question="你是谁?"

View File

@@ -26,7 +26,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -164,7 +164,7 @@ vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
:::::
:::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
:::
If your service start successfully, you can see the info shown below:

View File

@@ -26,7 +26,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -119,4 +119,4 @@ The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little La
### Online Serving on Single NPU
Currently, vllm's OpenAI-compatible server doesn't support audio inputs, find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).
Currently, vLLM's OpenAI-Compatible server doesn't support audio inputs. Find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).

View File

@@ -26,7 +26,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -149,10 +149,10 @@ vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
```
:::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
:::
If your service start successfully, you can see the info shown below:
If your service starts successfully, you can see the info shown below:
```bash
INFO: Started server process [2736]

View File

@@ -2,9 +2,9 @@
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
## Run docker container
## Run Docker Container
Take Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
Using the Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
```{code-block} bash
:substitutions:
@@ -26,7 +26,7 @@ docker run --rm \
-it $IMAGE bash
```
Setup environment variables:
Set up environment variables:
```bash
# Load model from ModelScope to speed up download
@@ -42,7 +42,7 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
vllm serve Qwen/Qwen3-Embedding-8B --task embed
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{

View File

@@ -1,8 +1,8 @@
# Single-NPU (Qwen3 8B W4A8)
## Run docker container
## Run Docker Container
:::{note}
w4a8 quantization feature is supported by v0.9.1rc2 or higher
w4a8 quantization feature is supported by v0.9.1rc2 and later.
:::
```{code-block} bash
@@ -25,7 +25,7 @@ docker run --rm \
-it $IMAGE bash
```
## Install modelslim and convert model
## Install modelslim and Convert Model
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/Qwen3-8B-W4A8
@@ -65,8 +65,8 @@ python quant_qwen.py \
--w_method HQQ
```
## Verify the quantized model
The converted model files looks like:
## Verify the Quantized Model
The converted model files look like:
```bash
.
@@ -84,13 +84,13 @@ The converted model files looks like:
`-- tokenizer_config.json
```
Run the following script to start the vLLM server with quantized model:
Run the following script to start the vLLM server with the quantized model:
```bash
vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
```
Once your server is started, you can query the model with input prompts
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/completions \
@@ -105,10 +105,10 @@ curl http://localhost:8000/v1/completions \
}'
```
Run the following script to execute offline inference on Single-NPU with quantized model:
Run the following script to execute offline inference on single-NPU with the quantized model:
:::{note}
To enable quantization for ascend, quantization method must be "ascend"
To enable quantization for ascend, quantization method must be "ascend".
:::
```python