[v0.11.0][Doc] Update doc (#3852)
### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com>
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
# Multi-Node (DeepSeek V3.2)
|
||||
|
||||
:::{note}
|
||||
Only machines with aarch64 is supported currently, x86 is coming soon. This guide take A3 as the example.
|
||||
Only machines with AArch64 are supported currently. x86 will be supported soon. This guide takes A3 as the example.
|
||||
:::
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
@@ -80,14 +80,14 @@ for i in {0..15}; do hccn_tool -i $i -ip -g | grep ipaddr; done
|
||||
hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||
```
|
||||
|
||||
## Deploy DeepSeek-V3.2-Exp with vLLM-Ascend:
|
||||
## Deploy DeepSeek-V3.2-Exp with vLLM-Ascend
|
||||
|
||||
Currently, we provide a all-in-one image (include CANN 8.2RC1 + [SparseFlashAttention/LightningIndexer](https://gitcode.com/cann/cann-recipes-infer/tree/master/ops/ascendc) + [MLAPO](https://github.com/vllm-project/vllm-ascend/pull/3226)). You can also build your own image refer to [link](https://github.com/vllm-project/vllm-ascend/issues/3278).
|
||||
Currently, we provide a all-in-one image (include CANN 8.2RC1 + [SparseFlashAttention/LightningIndexer](https://gitcode.com/cann/cann-recipes-infer/tree/master/ops/ascendc) + [MLAPO](https://github.com/vllm-project/vllm-ascend/pull/3226)). You can also build your own image by referring to [link](https://github.com/vllm-project/vllm-ascend/issues/3278).
|
||||
|
||||
- `DeepSeek-V3.2-Exp`: requreid 2 Atlas 800 A3(64G*16) nodes or 4 Atlas 800 A2(64G*8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
|
||||
- `DeepSeek-V3.2-Exp-w8a8`: requreid 1 Atlas 800 A3(64G*16) node or 2 Atlas 800 A2(64G*8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
|
||||
- `DeepSeek-V3.2-Exp`: require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
|
||||
- `DeepSeek-V3.2-Exp-w8a8`: require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
|
||||
|
||||
Run the following command to start the container in each node(This guide suppose you have download the weight to /root/.cache already):
|
||||
Run the following command to start the container in each node (You should download the weight to /root/.cache in advance):
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} A2 series
|
||||
@@ -180,13 +180,13 @@ docker run --rm \
|
||||
:::::{tab-set}
|
||||
::::{tab-item} DeepSeek-V3.2-Exp A3 series
|
||||
|
||||
Run the following scripts on two nodes respectively
|
||||
Run the following scripts on two nodes respectively.
|
||||
|
||||
:::{note}
|
||||
Before launch the inference server, ensure the following environment variables are set for multi node communication
|
||||
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
|
||||
:::
|
||||
|
||||
**node0**
|
||||
**Node 0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -225,7 +225,7 @@ vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
|
||||
```
|
||||
|
||||
**node1**
|
||||
**Node 1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -297,9 +297,9 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
|
||||
::::
|
||||
::::{tab-item} DeepSeek-V3.2-Exp-W8A8 A2 series
|
||||
|
||||
Run the following scripts on two nodes respectively
|
||||
Run the following scripts on two nodes respectively.
|
||||
|
||||
**node0**
|
||||
**Node 0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -341,7 +341,7 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
|
||||
```
|
||||
|
||||
**node1**
|
||||
**Node 1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
@@ -3,18 +3,18 @@
|
||||
## Getting Start
|
||||
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
|
||||
|
||||
Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
|
||||
Each DP rank is deployed as a separate "core engine" process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
|
||||
|
||||
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
|
||||
|
||||
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
|
||||
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
|
||||
- Use **Data Parallel (DP)** for attention layers, which are replicated across devices and handle separate batches.
|
||||
- Use **Expert or Tensor Parallel (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
|
||||
|
||||
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
|
||||
This division enables attention layers to be replicated across DP ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using DP/TP, maximizing hardware utilization and efficiency.
|
||||
|
||||
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
|
||||
In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
|
||||
|
||||
For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which don’t currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
|
||||
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP `Coordinator` process, which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
@@ -56,8 +56,8 @@ for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
|
||||
hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||
```
|
||||
|
||||
## Run with docker
|
||||
Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multi-node.
|
||||
## Run with Docker
|
||||
Assume you have two Atlas 800 A2 (64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multiple nodes.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
@@ -66,7 +66,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
@@ -91,13 +91,13 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Run the following scripts on two nodes respectively
|
||||
Run the following scripts on two nodes respectively.
|
||||
|
||||
:::{note}
|
||||
Before launch the inference server, ensure the following environment variables are set for multi node communication
|
||||
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
|
||||
:::
|
||||
|
||||
**node0**
|
||||
**Node 0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -116,8 +116,8 @@ export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
|
||||
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
|
||||
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
|
||||
# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
|
||||
vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
@@ -139,7 +139,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||
```
|
||||
|
||||
**node1**
|
||||
**Node 1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -185,7 +185,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||
```
|
||||
|
||||
The Deployment view looks like:
|
||||
The deployment view looks like:
|
||||

|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
@@ -201,8 +201,8 @@ curl http://{ node0 ip:8004 }/v1/completions \
|
||||
}'
|
||||
```
|
||||
|
||||
## Run benchmarks
|
||||
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
|
||||
## Run Benchmarks
|
||||
For details, refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
|
||||
@@ -2,10 +2,10 @@
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
referring to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process)
|
||||
Refer to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process).
|
||||
|
||||
## Run with docker
|
||||
Assume you have two Atlas 800 A3(64G*16) nodes(or 4 * A2), and want to deploy the `Kimi-K2-Instruct-W8A8` quantitative model across multi-node.
|
||||
## Run with Docker
|
||||
Assume you have two Atlas 800 A3 (64G*16) or four A2 nodes, and want to deploy the `Kimi-K2-Instruct-W8A8` quantitative model across multiple nodes.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
@@ -14,7 +14,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
@@ -47,13 +47,13 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Run the following scripts on two nodes respectively
|
||||
Run the following scripts on two nodes respectively.
|
||||
|
||||
:::{note}
|
||||
Before launch the inference server, ensure the following environment variables are set for multi node communication
|
||||
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
|
||||
:::
|
||||
|
||||
**node0**
|
||||
**Node 0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -72,8 +72,8 @@ export OMP_NUM_THREADS=100
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
|
||||
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
|
||||
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
|
||||
# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
|
||||
vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
@@ -96,7 +96,7 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||
```
|
||||
|
||||
**node1**
|
||||
**Node 1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -141,7 +141,7 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||
```
|
||||
|
||||
The Deployment view looks like:
|
||||
The deployment view looks like:
|
||||

|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
## Getting Start
|
||||
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation with Expert Parallel (EP) options. This guide takes one-by-one steps to verify these features with constrained resources.
|
||||
|
||||
Take the Qwen3-30B-A3B model as an example, use vllm-ascend v0.10.1rc1 (with vLLM v0.10.1.1) on 3 Atlas 800T A2 servers to deploy the "1P2D" architecture. Assume the ip of the prefiller server is 192.0.0.1, and the decoder servers are 192.0.0.2 (decoder 1) and 192.0.0.3 (decoder 2). On each server, use 2 NPUs to deploy one service instance.
|
||||
Using the Qwen3-30B-A3B model as an example, use vllm-ascend v0.10.1rc1 (with vLLM v0.10.1.1) on 3 Atlas 800T A2 servers to deploy the "1P2D" architecture. Assume the IP address of the prefiller server is 192.0.0.1, and the decoder servers are 192.0.0.2 (decoder 1) and 192.0.0.3 (decoder 2). On each server, use 2 NPUs to deploy one service instance.
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
@@ -49,7 +49,7 @@ for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
|
||||
|
||||
## Generate Ranktable
|
||||
|
||||
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. For more details please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md). Execute the following commands for reference.
|
||||
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. For more details, please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md). Execute the following commands for reference.
|
||||
|
||||
```shell
|
||||
cd vllm-ascend/examples/disaggregate_prefill_v1/
|
||||
@@ -58,7 +58,7 @@ bash gen_ranktable.sh --ips <prefiller_node1_local_ip> <prefiller_node2_local_ip
|
||||
[--local-device-ids <id_1>,<id_2>,<id_3>...]
|
||||
```
|
||||
|
||||
Assume that we use device 0,1 on the prefiller server node and device 6,7 on both of the decoder server nodes. Take the following commands as an example. (`--local-device-ids` is necessary if you specify certain NPU devices on the local server.)
|
||||
Assume that we use devices 0 and 1 on the prefiller server node and devices 6 and 7 on both of the decoder server nodes. The following commands are for reference. (`--local-device-ids` is necessary if you specify certain NPU devices on the local server.)
|
||||
|
||||
```shell
|
||||
# On the prefiller node
|
||||
@@ -77,20 +77,20 @@ bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
|
||||
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 6,7
|
||||
```
|
||||
|
||||
Rank table will generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
|
||||
The rank table will be generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
|
||||
|
||||
|Parameter | meaning |
|
||||
|Parameter | Meaning |
|
||||
| --- | --- |
|
||||
| --ips | Each node's local ip (prefiller nodes should be front of decoder nodes) |
|
||||
| --npus-per-node | Each node's npu clips |
|
||||
| --ips | Each node's local IP address (prefiller nodes should be in front of decoder nodes) |
|
||||
| --npus-per-node | Each node's NPU clips |
|
||||
| --network-card-name | The physical machines' NIC |
|
||||
|--prefill-device-cnt | Npu clips used for prefill |
|
||||
|--decode-device-cnt |Npu clips used for decode |
|
||||
|--prefill-device-cnt | NPU clips used for prefill |
|
||||
|--decode-device-cnt |NPU clips used for decode |
|
||||
|--local-device-ids |Optional. No need if using all devices on the local node. |
|
||||
|
||||
## Prefiller / Decoder Deployment
|
||||
## Prefiller/Decoder Deployment
|
||||
|
||||
We can run the following scripts to launch a server on the prefiller/decoder node respectively.
|
||||
We can run the following scripts to launch a server on the prefiller/decoder node, respectively.
|
||||
|
||||
:::::{tab-set}
|
||||
|
||||
@@ -214,9 +214,9 @@ vllm serve /model/Qwen3-30B-A3B \
|
||||
|
||||
:::::
|
||||
|
||||
## Example proxy for Deployment
|
||||
## Example Proxy for Deployment
|
||||
|
||||
Run a proxy server on the same node with prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py \
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Getting Start
|
||||
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide takes one-by-one steps to verify these features with constrained resources.
|
||||
|
||||
Take the Qwen3-235B model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
|
||||
|
||||
@@ -60,7 +60,7 @@ Mooncake is the serving platform for Kimi, a leading LLM service provided by Moo
|
||||
git clone -b pooling_async_memecpy_v1 https://github.com/AscendTransport/Mooncake
|
||||
```
|
||||
|
||||
Update and install Python
|
||||
Update and install Python.
|
||||
|
||||
```shell
|
||||
apt-get update
|
||||
@@ -97,11 +97,11 @@ cp mooncake-transfer-engine/src/transport/ascend_transport/hccl_transport/ascend
|
||||
cp mooncake-transfer-engine/src/libtransfer_engine.so /usr/local/Ascend/ascend-toolkit/latest/python/site-packages/
|
||||
```
|
||||
|
||||
## Prefiller / Decoder Deployment
|
||||
## Prefiller/Decoder Deployment
|
||||
|
||||
We can run the following scripts to launch a server on the prefiller/decoder node respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
|
||||
We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
|
||||
|
||||
### layerwise
|
||||
### Layerwise
|
||||
|
||||
:::::{tab-set}
|
||||
|
||||
@@ -223,7 +223,7 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder node 1 (master Node)
|
||||
::::{tab-item} Decoder node 1 (master node)
|
||||
|
||||
```shell
|
||||
unset ftp_proxy
|
||||
@@ -284,7 +284,7 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder node 2 (primary Node)
|
||||
::::{tab-item} Decoder node 2 (primary node)
|
||||
|
||||
```shell
|
||||
unset ftp_proxy
|
||||
@@ -348,7 +348,7 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
|
||||
|
||||
:::::
|
||||
|
||||
### non-layerwise
|
||||
### Non-layerwise
|
||||
|
||||
:::::{tab-set}
|
||||
|
||||
@@ -595,13 +595,13 @@ vllm serve /model/Qwen3-235B-A22B-W8A8 \
|
||||
|
||||
:::::
|
||||
|
||||
## Example proxy for Deployment
|
||||
## Example Proxy for Deployment
|
||||
|
||||
Run a proxy server on the same node with prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py) or [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py) or [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
:::::{tab-set}
|
||||
|
||||
::::{tab-item} layerwise
|
||||
::::{tab-item} Layerwise
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_layerwise_server_example.py \
|
||||
@@ -615,7 +615,7 @@ python load_balance_proxy_layerwise_server_example.py \
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} non-layerwise
|
||||
::::{tab-item} Non-layerwise
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py \
|
||||
|
||||
@@ -1,15 +1,15 @@
|
||||
# Multi-Node-DP (Qwen3-VL-235B-A22B)
|
||||
|
||||
:::{note}
|
||||
Qwen3 VL rely on the newest version of `transformers`(>4.56.2). Please install it from source until it's released.
|
||||
Qwen3 VL relies on the newest version of `transformers` (>4.56.2). Please install it from source.
|
||||
:::
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
referring to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process)
|
||||
Refer to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process).
|
||||
|
||||
## Run with docker
|
||||
Assume you have an Atlas 800 A3(64G*16) nodes(or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multi-node.
|
||||
## Run with Docker
|
||||
Assume you have Atlas 800 A3 (64G*16) nodes (or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
@@ -49,13 +49,13 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Run the following scripts on two nodes respectively
|
||||
Run the following scripts on two nodes respectively.
|
||||
|
||||
:::{note}
|
||||
Before launch the inference server, ensure the following environment variables are set for multi node communication
|
||||
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
|
||||
:::
|
||||
|
||||
node0
|
||||
Node 0
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -93,7 +93,7 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
|
||||
--gpu-memory-utilization 0.8 \
|
||||
```
|
||||
|
||||
node1
|
||||
Node 1
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -136,7 +136,7 @@ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
|
||||
--gpu-memory-utilization 0.8 \
|
||||
```
|
||||
|
||||
If the service starts successfully, the following information will be displayed on node0:
|
||||
If the service starts successfully, the following information will be displayed on node 0:
|
||||
|
||||
```shell
|
||||
INFO: Started server process [44610]
|
||||
|
||||
@@ -1,10 +1,10 @@
|
||||
# Multi-Node-Ray (Qwen/Qwen3-235B-A22B)
|
||||
|
||||
Multi-node inference is suitable for the scenarios that the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
|
||||
Multi-node inference is suitable for scenarios where the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
|
||||
|
||||
* **Verify Multi-Node Communication Environment**
|
||||
* **Set Up and Start the Ray Cluster**
|
||||
* **Start the Online Inference Service on multinode**
|
||||
* **Start the Online Inference Service on Multi-node**
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
@@ -48,9 +48,9 @@ hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||
|
||||
## Set Up and Start the Ray Cluster
|
||||
### Setting Up the Basic Container
|
||||
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
|
||||
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is advised to use Docker images.
|
||||
|
||||
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
|
||||
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the primary and secondary nodes, with the `--net=host` option to enable proper network connectivity.
|
||||
|
||||
Below is the example container setup command, which should be executed on **all nodes** :
|
||||
|
||||
@@ -89,13 +89,13 @@ docker run --rm \
|
||||
### Start Ray Cluster
|
||||
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
|
||||
|
||||
Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
|
||||
Choose one machine as the primary node and the others as secondary nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
|
||||
|
||||
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues.
|
||||
|
||||
Below are the commands for the head and worker nodes:
|
||||
Below are the commands for the primary and secondary nodes:
|
||||
|
||||
**Head node**:
|
||||
**Primary node**:
|
||||
|
||||
:::{note}
|
||||
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
|
||||
@@ -112,7 +112,7 @@ export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
ray start --head
|
||||
```
|
||||
|
||||
**Worker node**:
|
||||
**Secondary node**:
|
||||
|
||||
:::{note}
|
||||
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
|
||||
@@ -130,7 +130,7 @@ ray start --address='{head_node_ip}:6379' --node-ip-address={local_ip}
|
||||
|
||||
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
|
||||
|
||||
## Start the Online Inference Service on multinode scenario
|
||||
## Start the Online Inference Service on Multi-node scenario
|
||||
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster.
|
||||
|
||||
**You only need to run the vllm command on one node.**
|
||||
|
||||
@@ -27,7 +27,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
Set up environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
@@ -39,13 +39,13 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
|
||||
### Online Inference on Multi-NPU
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
Run the following script to start the vLLM server on multi-NPU:
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
|
||||
@@ -27,7 +27,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
Set up environment variables:
|
||||
|
||||
```bash
|
||||
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
|
||||
@@ -43,7 +43,7 @@ git clone https://gitcode.com/ascend-tribe/pangu-pro-moe-model.git
|
||||
|
||||
### Online Inference on Multi-NPU
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
Run the following script to start the vLLM server on multi-NPU:
|
||||
|
||||
```bash
|
||||
vllm serve /path/to/pangu-pro-moe-model \
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
# Multi-NPU (QwQ 32B W8A8)
|
||||
|
||||
## Run docker container
|
||||
## Run Docker Container
|
||||
:::{note}
|
||||
w8a8 quantization feature is supported by v0.8.4rc2 or higher
|
||||
w8a8 quantization feature is supported by v0.8.4rc2 and later.
|
||||
:::
|
||||
|
||||
```{code-block} bash
|
||||
@@ -28,7 +28,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Install modelslim and convert model
|
||||
## Install modelslim and Convert Model
|
||||
:::{note}
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded,
|
||||
see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8
|
||||
@@ -53,8 +53,8 @@ SAVE_PATH=/home/models/QwQ-32B-w8a8
|
||||
python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --calib_file ../common/boolq.jsonl --w_bit 8 --a_bit 8 --device_type npu --anti_method m1 --trust_remote_code True
|
||||
```
|
||||
|
||||
## Verify the quantized model
|
||||
The converted model files looks like:
|
||||
## Verify the Quantized Model
|
||||
The converted model files look like:
|
||||
|
||||
```bash
|
||||
.
|
||||
@@ -68,10 +68,10 @@ The converted model files looks like:
|
||||
`-- tokenizer_config.json
|
||||
```
|
||||
|
||||
Run the following script to start the vLLM server with quantized model:
|
||||
Run the following script to start the vLLM server with the quantized model:
|
||||
|
||||
:::{note}
|
||||
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
|
||||
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released. You can cherry-pick this commit for now.
|
||||
:::
|
||||
|
||||
```bash
|
||||
@@ -93,10 +93,10 @@ curl http://localhost:8000/v1/completions \
|
||||
}'
|
||||
```
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU with quantized model:
|
||||
Run the following script to execute offline inference on multi-NPU with the quantized model:
|
||||
|
||||
:::{note}
|
||||
To enable quantization for ascend, quantization method must be "ascend"
|
||||
To enable quantization for ascend, quantization method must be "ascend".
|
||||
:::
|
||||
|
||||
```python
|
||||
|
||||
@@ -27,7 +27,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
Set up environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
@@ -41,13 +41,13 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
|
||||
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32GB of memory, tensor-parallel-size should be at least 4.
|
||||
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32 GB of memory, tensor-parallel-size should be at least 4.
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable_expert_parallel
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# Multi-NPU (Qwen3-Next)
|
||||
|
||||
```{note}
|
||||
The Qwen3 Next are using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes around stability, accuracy and performance improvement.
|
||||
The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement.
|
||||
```
|
||||
|
||||
## Run vllm-ascend on Multi-NPU with Qwen3 Next
|
||||
@@ -31,7 +31,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
Set up environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
@@ -41,7 +41,7 @@ export VLLM_USE_MODELSCOPE=True
|
||||
### Install Triton Ascend
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Linux (aarch64)
|
||||
::::{tab-item} Linux (AArch64)
|
||||
|
||||
The [Triton Ascend](https://gitee.com/ascend/triton-ascend) is required when you run Qwen3 Next, please follow the instructions below to install it and its dependency.
|
||||
|
||||
@@ -72,7 +72,7 @@ Coming soon ...
|
||||
|
||||
### Inference on Multi-NPU
|
||||
|
||||
Please make sure you already executed the command:
|
||||
Please make sure you have already executed the command:
|
||||
|
||||
```bash
|
||||
source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
|
||||
@@ -81,15 +81,15 @@ source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Online Inference
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
Run the following script to start the vLLM server on multi-NPU:
|
||||
|
||||
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32GB of memory, tensor-parallel-size should be at least 8.
|
||||
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32 GB of memory, tensor-parallel-size should be at least 8.
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --enforce-eager
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||||
|
||||
@@ -1,13 +1,13 @@
|
||||
# Single Node (Atlas 300I series)
|
||||
# Single Node (Atlas 300I Series)
|
||||
|
||||
```{note}
|
||||
1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement.
|
||||
2. Currently, the 310I series only supports eager mode and the data type is float16.
|
||||
3. There are some known issues for running vLLM on 310p series, you can refer to vllm-ascend [<u>#3316</u>](https://github.com/vllm-project/vllm-ascend/issues/3316),
|
||||
[<u>#2795</u>](https://github.com/vllm-project/vllm-ascend/issues/2795), you can use v0.10.0rc1 version first.
|
||||
1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes related to model coverage and performance improvement.
|
||||
2. Currently, the 310I series only supports eager mode and the float16 data type.
|
||||
3. There are some known issues for running vLLM on 310p series, you can refer to vllm-ascend [<u>#3316</u>](https://github.com/vllm-project/vllm-ascend/issues/3316) and
|
||||
[<u>#2795</u>](https://github.com/vllm-project/vllm-ascend/issues/2795). You can use v0.10.0rc1 version first.
|
||||
```
|
||||
|
||||
## Run vLLM on Altlas 300I series
|
||||
## Run vLLM on Atlas 300I Series
|
||||
|
||||
Run docker container:
|
||||
|
||||
@@ -38,7 +38,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
Set up environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
@@ -50,7 +50,7 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
|
||||
### Online Inference on NPU
|
||||
|
||||
Run the following script to start the vLLM server on NPU(Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
|
||||
Run the following script to start the vLLM server on NPU (Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: inference
|
||||
@@ -170,7 +170,7 @@ vllm serve /home/pangu-pro-moe-mode/ \
|
||||
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
export question="你是谁?"
|
||||
|
||||
@@ -26,7 +26,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
Set up environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
@@ -164,7 +164,7 @@ vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
|
||||
:::::
|
||||
|
||||
:::{note}
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
:::
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
@@ -26,7 +26,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
Set up environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
@@ -119,4 +119,4 @@ The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little La
|
||||
|
||||
### Online Serving on Single NPU
|
||||
|
||||
Currently, vllm's OpenAI-compatible server doesn't support audio inputs, find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).
|
||||
Currently, vLLM's OpenAI-Compatible server doesn't support audio inputs. Find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).
|
||||
|
||||
@@ -26,7 +26,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
Set up environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
@@ -149,10 +149,10 @@ vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
|
||||
```
|
||||
|
||||
:::{note}
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
:::
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
If your service starts successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
INFO: Started server process [2736]
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model.
|
||||
|
||||
## Run docker container
|
||||
## Run Docker Container
|
||||
|
||||
Take Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
|
||||
Using the Qwen3-Embedding-8B model as an example, first run the docker container with the following command:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
@@ -26,7 +26,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
Set up environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
@@ -42,7 +42,7 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
vllm serve Qwen/Qwen3-Embedding-8B --task embed
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
# Single-NPU (Qwen3 8B W4A8)
|
||||
|
||||
## Run docker container
|
||||
## Run Docker Container
|
||||
:::{note}
|
||||
w4a8 quantization feature is supported by v0.9.1rc2 or higher
|
||||
w4a8 quantization feature is supported by v0.9.1rc2 and later.
|
||||
:::
|
||||
|
||||
```{code-block} bash
|
||||
@@ -25,7 +25,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Install modelslim and convert model
|
||||
## Install modelslim and Convert Model
|
||||
:::{note}
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded,
|
||||
see https://www.modelscope.cn/models/vllm-ascend/Qwen3-8B-W4A8
|
||||
@@ -65,8 +65,8 @@ python quant_qwen.py \
|
||||
--w_method HQQ
|
||||
```
|
||||
|
||||
## Verify the quantized model
|
||||
The converted model files looks like:
|
||||
## Verify the Quantized Model
|
||||
The converted model files look like:
|
||||
|
||||
```bash
|
||||
.
|
||||
@@ -84,13 +84,13 @@ The converted model files looks like:
|
||||
`-- tokenizer_config.json
|
||||
```
|
||||
|
||||
Run the following script to start the vLLM server with quantized model:
|
||||
Run the following script to start the vLLM server with the quantized model:
|
||||
|
||||
```bash
|
||||
vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
@@ -105,10 +105,10 @@ curl http://localhost:8000/v1/completions \
|
||||
}'
|
||||
```
|
||||
|
||||
Run the following script to execute offline inference on Single-NPU with quantized model:
|
||||
Run the following script to execute offline inference on single-NPU with the quantized model:
|
||||
|
||||
:::{note}
|
||||
To enable quantization for ascend, quantization method must be "ascend"
|
||||
To enable quantization for ascend, quantization method must be "ascend".
|
||||
:::
|
||||
|
||||
```python
|
||||
|
||||
Reference in New Issue
Block a user