init v0.11.0rc0
This commit is contained in:
@@ -22,6 +22,8 @@ Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
|
||||
|
||||
| vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu | MindIE Turbo |
|
||||
|-------------|--------------|------------------|-------------|--------------------|--------------|
|
||||
| v0.11.0rc0 | v0.11.0rc3 | >= 3.9, < 3.12 | 8.2.RC1 | 2.7.1 / 2.7.1.dev20250724 | |
|
||||
| v0.10.2rc1 | v0.10.2 | >= 3.9, < 3.12 | 8.2.RC1 | 2.7.1 / 2.7.1.dev20250724 | |
|
||||
| v0.10.1rc1 | v0.10.1/v0.10.1.1 | >= 3.9, < 3.12 | 8.2.RC1 | 2.7.1 / 2.7.1.dev20250724 | |
|
||||
| v0.10.0rc1 | v0.10.0 | >= 3.9, < 3.12 | 8.2.RC1 | 2.7.1 / 2.7.1.dev20250724 | |
|
||||
| v0.9.2rc1 | v0.9.2 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1.post1.dev20250619 | |
|
||||
@@ -42,6 +44,8 @@ Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
|
||||
|
||||
| Date | Event |
|
||||
|------------|-------------------------------------------|
|
||||
| 2025.09.30 | Release candidates, v0.11.0rc0 |
|
||||
| 2025.09.16 | Release candidates, v0.10.2rc1 |
|
||||
| 2025.09.04 | Release candidates, v0.10.1rc1 |
|
||||
| 2025.09.03 | v0.9.1 Final release |
|
||||
| 2025.08.22 | Release candidates, v0.9.1rc3 |
|
||||
|
||||
@@ -65,19 +65,19 @@ myst_substitutions = {
|
||||
# the branch of vllm, used in vllm clone
|
||||
# - main branch: 'main'
|
||||
# - vX.Y.Z branch: 'vX.Y.Z'
|
||||
'vllm_version': 'v0.10.1.1',
|
||||
'vllm_version': 'v0.11.0rc3',
|
||||
# the branch of vllm-ascend, used in vllm-ascend clone and image tag
|
||||
# - main branch: 'main'
|
||||
# - vX.Y.Z branch: latest vllm-ascend release tag
|
||||
'vllm_ascend_version': 'v0.10.1rc1',
|
||||
'vllm_ascend_version': 'v0.11.0rc0',
|
||||
# the newest release version of vllm-ascend and matched vLLM, used in pip install.
|
||||
# This value should be updated when cut down release.
|
||||
'pip_vllm_ascend_version': "0.10.1rc1",
|
||||
'pip_vllm_version': "0.10.1.1",
|
||||
'pip_vllm_ascend_version': "0.11.0rc0",
|
||||
'pip_vllm_version': "0.11.0",
|
||||
# CANN image tag
|
||||
'cann_image_tag': "8.2.rc1-910b-ubuntu22.04-py3.11",
|
||||
# vllm version in ci
|
||||
'ci_vllm_version': 'v0.10.1.1',
|
||||
'ci_vllm_version': 'v0.11.0rc3',
|
||||
}
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
|
||||
@@ -0,0 +1,20 @@
|
||||
# deepseek-ai/DeepSeek-V2-Lite
|
||||
|
||||
- **vLLM Version**: vLLM: 0.10.1.1 ([1da94e6](https://github.com/vllm-project/vllm/commit/1da94e6)), **vLLM Ascend Version**: v0.10.1rc1 ([7e16b4a](https://github.com/vllm-project/vllm-ascend/commit/7e16b4a))
|
||||
- **Software Environment**: **CANN**: 8.2.RC1, **PyTorch**: 2.7.1, **torch-npu**: 2.7.1.dev20250724
|
||||
- **Hardware Environment**: Atlas A2 Series
|
||||
- **Parallel mode**: TP2
|
||||
- **Execution mode**: ACLGraph
|
||||
|
||||
**Command**:
|
||||
|
||||
```bash
|
||||
export MODEL_ARGS='pretrained=deepseek-ai/DeepSeek-V2-Lite,tensor_parallel_size=2,dtype=auto,trust_remote_code=True,max_model_len=4096,enforce_eager=True'
|
||||
lm_eval --model vllm --model_args $MODEL_ARGS --tasks gsm8k \
|
||||
--batch_size auto
|
||||
```
|
||||
|
||||
| Task | Metric | Value | Stderr |
|
||||
|-----------------------|-------------|----------:|-------:|
|
||||
| gsm8k | exact_match,strict-match | ✅0.3813 | ± 0.0134 |
|
||||
| gsm8k | exact_match,flexible-extract | ✅0.3836 | ± 0.0134 |
|
||||
@@ -0,0 +1,19 @@
|
||||
# Qwen/Qwen2.5-VL-7B-Instruct
|
||||
|
||||
- **vLLM Version**: vLLM: 0.10.1.1 ([1da94e6](https://github.com/vllm-project/vllm/commit/1da94e6)), **vLLM Ascend Version**: v0.10.1rc1 ([7e16b4a](https://github.com/vllm-project/vllm-ascend/commit/7e16b4a))
|
||||
- **Software Environment**: **CANN**: 8.2.RC1, **PyTorch**: 2.7.1, **torch-npu**: 2.7.1.dev20250724
|
||||
- **Hardware Environment**: Atlas A2 Series
|
||||
- **Parallel mode**: TP1
|
||||
- **Execution mode**: ACLGraph
|
||||
|
||||
**Command**:
|
||||
|
||||
```bash
|
||||
export MODEL_ARGS='pretrained=Qwen/Qwen2.5-VL-7B-Instruct,tensor_parallel_size=1,dtype=auto,trust_remote_code=False,max_model_len=8192'
|
||||
lm_eval --model vllm-vlm --model_args $MODEL_ARGS --tasks mmmu_val \
|
||||
--apply_chat_template True --fewshot_as_multiturn True --batch_size auto
|
||||
```
|
||||
|
||||
| Task | Metric | Value | Stderr |
|
||||
|-----------------------|-------------|----------:|-------:|
|
||||
| mmmu_val | acc,none | ✅0.52 | ± 0.0162 |
|
||||
@@ -0,0 +1,21 @@
|
||||
# Qwen/Qwen3-30B-A3B
|
||||
|
||||
- **vLLM Version**: vLLM: 0.10.1.1 ([1da94e6](https://github.com/vllm-project/vllm/commit/1da94e6)), **vLLM Ascend Version**: v0.10.1rc1 ([7e16b4a](https://github.com/vllm-project/vllm-ascend/commit/7e16b4a))
|
||||
- **Software Environment**: **CANN**: 8.2.RC1, **PyTorch**: 2.7.1, **torch-npu**: 2.7.1.dev20250724
|
||||
- **Hardware Environment**: Atlas A2 Series
|
||||
- **Parallel mode**: TP2 + EP
|
||||
- **Execution mode**: ACLGraph
|
||||
|
||||
**Command**:
|
||||
|
||||
```bash
|
||||
export MODEL_ARGS='pretrained=Qwen/Qwen3-30B-A3B,tensor_parallel_size=2,dtype=auto,trust_remote_code=False,max_model_len=4096,gpu_memory_utilization=0.6,enable_expert_parallel=True'
|
||||
lm_eval --model vllm --model_args $MODEL_ARGS --tasks gsm8k,ceval-valid \
|
||||
--num_fewshot 5 --batch_size auto
|
||||
```
|
||||
|
||||
| Task | Metric | Value | Stderr |
|
||||
|-----------------------|-------------|----------:|-------:|
|
||||
| gsm8k | exact_match,strict-match | ✅0.8923 | ± 0.0085 |
|
||||
| gsm8k | exact_match,flexible-extract | ✅0.8506 | ± 0.0098 |
|
||||
| ceval-valid | acc,none | ✅0.8358 | ± 0.0099 |
|
||||
@@ -0,0 +1,21 @@
|
||||
# Qwen/Qwen3-8B-Base
|
||||
|
||||
- **vLLM Version**: vLLM: 0.10.1.1 ([1da94e6](https://github.com/vllm-project/vllm/commit/1da94e6)), **vLLM Ascend Version**: v0.10.1rc1 ([7e16b4a](https://github.com/vllm-project/vllm-ascend/commit/7e16b4a))
|
||||
- **Software Environment**: **CANN**: 8.2.RC1, **PyTorch**: 2.7.1, **torch-npu**: 2.7.1.dev20250724
|
||||
- **Hardware Environment**: Atlas A2 Series
|
||||
- **Parallel mode**: TP1
|
||||
- **Execution mode**: ACLGraph
|
||||
|
||||
**Command**:
|
||||
|
||||
```bash
|
||||
export MODEL_ARGS='pretrained=Qwen/Qwen3-8B-Base,tensor_parallel_size=1,dtype=auto,trust_remote_code=False,max_model_len=4096'
|
||||
lm_eval --model vllm --model_args $MODEL_ARGS --tasks gsm8k,ceval-valid \
|
||||
--apply_chat_template True --fewshot_as_multiturn True --num_fewshot 5 --batch_size auto
|
||||
```
|
||||
|
||||
| Task | Metric | Value | Stderr |
|
||||
|-----------------------|-------------|----------:|-------:|
|
||||
| gsm8k | exact_match,strict-match | ✅0.8271 | ± 0.0104 |
|
||||
| gsm8k | exact_match,flexible-extract | ✅0.8294 | ± 0.0104 |
|
||||
| ceval-valid | acc,none | ✅0.815 | ± 0.0103 |
|
||||
@@ -3,4 +3,8 @@
|
||||
:::{toctree}
|
||||
:caption: Accuracy Report
|
||||
:maxdepth: 1
|
||||
DeepSeek-V2-Lite
|
||||
Qwen2.5-VL-7B-Instruct
|
||||
Qwen3-30B-A3B
|
||||
Qwen3-8B-Base
|
||||
:::
|
||||
|
||||
@@ -61,7 +61,6 @@ from torch import nn
|
||||
from vllm.attention import Attention
|
||||
from vllm.config import VllmConfig
|
||||
from vllm.sequence import IntermediateTensors
|
||||
from vllm.model_executor.sampling_metadata import SamplingMetadata
|
||||
|
||||
class CustomAttention(nn.Module):
|
||||
def __init__(self, vllm_config: VllmConfig, prefix: str):
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
## Version Specific FAQs
|
||||
|
||||
- [[v0.9.1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/2643)
|
||||
- [[v0.10.1rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/2630)
|
||||
- [[v0.11.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/3222)
|
||||
|
||||
## General FAQs
|
||||
|
||||
@@ -196,3 +196,21 @@ export ATB_LLM_LCOC_ENABLE=0
|
||||
### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model?
|
||||
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
|
||||
this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.
|
||||
|
||||
### 20. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
|
||||
|
||||
```
|
||||
error example in detail:
|
||||
ERROR 09-26 10:48:07 [model_runner_v1.py:3029] ACLgraph sizes capture fail: RuntimeError:
|
||||
ERROR 09-26 10:48:07 [model_runner_v1.py:3029] ACLgraph has insufficient available streams to capture the configured number of sizes.Please verify both the availability of adequate streams and the appropriateness of the configured size count.
|
||||
```
|
||||
|
||||
Recommended mitigation strategies:
|
||||
1. Manually configure the compilation_config parameter with a reduced size set: '{"cudagraph_capture_sizes":[size1, size2, size3, ...]}'.
|
||||
2. Employ ACLgraph's full graph mode as an alternative to the piece-wise approach.
|
||||
|
||||
Root cause analysis:
|
||||
The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements - such as operator characteristics and specific hardware features - consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.
|
||||
|
||||
### 21. Installing vllm-ascend will overwrite the existing torch-npu package?
|
||||
Installing vllm-ascend will overwrite the existing torch-npu package. If you need to install a specific version of torch-npu, you can manually install the specified version of torch-npu after installing vllm-ascend.
|
||||
|
||||
@@ -11,6 +11,7 @@ This document describes how to install vllm-ascend manually.
|
||||
|
||||
| Software | Supported version | Note |
|
||||
|---------------|----------------------------------|-------------------------------------------|
|
||||
| Ascend HDK | Refer to [here](https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/releasenote/releasenote_0000.html) | Required for CANN |
|
||||
| CANN | >= 8.2.RC1 | Required for vllm-ascend and torch-npu |
|
||||
| torch-npu | >= 2.7.1.dev20250724 | Required for vllm-ascend, No need to install manually, it will be auto installed in below steps |
|
||||
| torch | >= 2.7.1 | Required for torch-npu and vllm |
|
||||
|
||||
@@ -148,10 +148,6 @@ msgid ""
|
||||
" to be passed in."
|
||||
msgstr "在为MOE模型使用专家负载均衡时,需要传入专家映射路径。"
|
||||
|
||||
#: ../../user_guide/configuration/additional_config.md
|
||||
msgid "`chunked_prefill_for_mla`"
|
||||
msgstr "`chunked_prefill_for_mla`"
|
||||
|
||||
#: ../../user_guide/configuration/additional_config.md
|
||||
msgid "`False`"
|
||||
msgstr "`False`"
|
||||
@@ -199,8 +195,8 @@ msgid ""
|
||||
msgstr "是否将MLA的向量操作放到另一个流中。此选项仅对使用MLA的模型(例如,DeepSeek)有效。"
|
||||
|
||||
#: ../../user_guide/configuration/additional_config.md
|
||||
msgid "`enable_multistream_moe`"
|
||||
msgstr "`enable_multistream_moe`"
|
||||
msgid "`multistream_overlap_shared_expert`"
|
||||
msgstr "`multistream_overlap_shared_expert`"
|
||||
|
||||
#: ../../user_guide/configuration/additional_config.md
|
||||
msgid ""
|
||||
|
||||
@@ -8,6 +8,7 @@ single_npu_multimodal
|
||||
single_npu_audio
|
||||
single_npu_qwen3_embedding
|
||||
single_npu_qwen3_quantization
|
||||
multi_npu_qwen3_next
|
||||
multi_npu
|
||||
multi_npu_moge
|
||||
multi_npu_qwen3_moe
|
||||
@@ -15,4 +16,7 @@ multi_npu_quantization
|
||||
single_node_300i
|
||||
multi_node
|
||||
multi_node_kimi
|
||||
multi_node_qwen3vl
|
||||
multi_node_pd_disaggregation
|
||||
multi_node_ray
|
||||
:::
|
||||
|
||||
244
docs/source/tutorials/multi_node_pd_disaggregation.md
Normal file
244
docs/source/tutorials/multi_node_pd_disaggregation.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Prefill-Decode Disaggregation Verification (Qwen)
|
||||
|
||||
## Getting Start
|
||||
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
|
||||
|
||||
Take the Qwen3-30B-A3B model as an example, use vllm-ascend v0.10.1rc1 (with vLLM v0.10.1.1) on 3 Atlas 800T A2 servers to deploy the "1P2D" architecture. Assume the ip of the prefiller server is 192.0.0.1, and the decoder servers are 192.0.0.2 (decoder 1) and 192.0.0.3 (decoder 2). On each server, use 2 NPUs to deploy one service instance.
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
### Physical Layer Requirements
|
||||
|
||||
- The physical machines must be located on the same WLAN, with network connectivity.
|
||||
- All NPUs must be interconnected. Intra-node connectivity is via HCCS, and inter-node connectivity is via RDMA.
|
||||
|
||||
### Verification Process
|
||||
|
||||
1. Single Node Verification:
|
||||
|
||||
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
|
||||
|
||||
```bash
|
||||
# Check the remote switch ports
|
||||
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
|
||||
# Get the link status of the Ethernet ports (UP or DOWN)
|
||||
for i in {0..7}; do hccn_tool -i $i -link -g ; done
|
||||
# Check the network health status
|
||||
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
|
||||
# View the network detected IP configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
|
||||
# View gateway configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
|
||||
# View NPU network configuration
|
||||
cat /etc/hccn.conf
|
||||
```
|
||||
|
||||
2. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
for i in {0..7}; do hccn_tool -i $i -ip -g;done
|
||||
```
|
||||
|
||||
3. Cross-Node PING Test
|
||||
|
||||
```bash
|
||||
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
|
||||
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
|
||||
```
|
||||
|
||||
## Generate Ranktable
|
||||
|
||||
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. For more details please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md). Execute the following commands for reference.
|
||||
|
||||
```shell
|
||||
cd vllm-ascend/examples/disaggregate_prefill_v1/
|
||||
bash gen_ranktable.sh --ips <prefiller_node1_local_ip> <prefiller_node2_local_ip> <decoder_node1_local_ip> <decoder_node2_local_ip> \
|
||||
--npus-per-node <npu_clips> --network-card-name <nic_name> --prefill-device-cnt <prefiller_npu_clips> --decode-device-cnt <decode_npu_clips> \
|
||||
[--local-device-ids <id_1>,<id_2>,<id_3>...]
|
||||
```
|
||||
|
||||
Assume that we use device 0,1 on the prefiller server node and device 6,7 on both of the decoder server nodes. Take the following commands as an example. (`--local-device-ids` is necessary if you specify certain NPU devices on the local server.)
|
||||
|
||||
```shell
|
||||
# On the prefiller node
|
||||
cd vllm-ascend/examples/disaggregate_prefill_v1/
|
||||
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
|
||||
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 0,1
|
||||
|
||||
# On the decoder 1
|
||||
cd vllm-ascend/examples/disaggregate_prefill_v1/
|
||||
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
|
||||
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 6,7
|
||||
|
||||
# On the decoder 2
|
||||
cd vllm-ascend/examples/disaggregate_prefill_v1/
|
||||
bash gen_ranktable.sh --ips 192.0.0.1 192.0.0.2 192.0.0.3 \
|
||||
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 2 --decode-device-cnt 4 --local-device-ids 6,7
|
||||
```
|
||||
|
||||
Rank table will generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
|
||||
|
||||
|Parameter | meaning |
|
||||
| --- | --- |
|
||||
| --ips | Each node's local ip (prefiller nodes should be front of decoder nodes) |
|
||||
| --npus-per-node | Each node's npu clips |
|
||||
| --network-card-name | The physical machines' NIC |
|
||||
|--prefill-device-cnt | Npu clips used for prefill |
|
||||
|--decode-device-cnt |Npu clips used for decode |
|
||||
|--local-device-ids |Optional. No need if using all devices on the local node. |
|
||||
|
||||
## Prefiller / Decoder Deployment
|
||||
|
||||
We can run the following scripts to launch a server on the prefiller/decoder node respectively.
|
||||
|
||||
:::::{tab-set}
|
||||
|
||||
::::{tab-item} Prefiller node
|
||||
|
||||
```shell
|
||||
export HCCL_IF_IP=192.0.0.1 # node ip
|
||||
export GLOO_SOCKET_IFNAME="eth0" # network card name
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
vllm serve /model/Qwen3-30B-A3B \
|
||||
--host 0.0.0.0 \
|
||||
--port 13700 \
|
||||
--tensor-parallel-size 2 \
|
||||
--no-enable-prefix-caching \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3-moe \
|
||||
--max-model-len 6144 \
|
||||
--max-num-batched-tokens 6144 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-expert-parallel \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "LLMDataDistCMgrConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_parallel_size": 1,
|
||||
"kv_port": "20001",
|
||||
"engine_id": "0",
|
||||
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
|
||||
}' \
|
||||
--additional-config \
|
||||
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}' \
|
||||
--enforce-eager
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder node 1
|
||||
|
||||
```shell
|
||||
export HCCL_IF_IP=192.0.0.2 # node ip
|
||||
export GLOO_SOCKET_IFNAME="eth0" # network card name
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
vllm serve /model/Qwen3-30B-A3B \
|
||||
--host 0.0.0.0 \
|
||||
--port 13700 \
|
||||
--no-enable-prefix-caching \
|
||||
--tensor-parallel-size 2 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3-moe \
|
||||
--max-model-len 6144 \
|
||||
--max-num-batched-tokens 6144 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-expert-parallel \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "LLMDataDistCMgrConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_parallel_size": 1,
|
||||
"kv_port": "20001",
|
||||
"engine_id": "0",
|
||||
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
|
||||
}' \
|
||||
--additional-config \
|
||||
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder node 2
|
||||
|
||||
```shell
|
||||
export HCCL_IF_IP=192.0.0.3 # node ip
|
||||
export GLOO_SOCKET_IFNAME="eth0" # network card name
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
vllm serve /model/Qwen3-30B-A3B \
|
||||
--host 0.0.0.0 \
|
||||
--port 13700 \
|
||||
--no-enable-prefix-caching \
|
||||
--tensor-parallel-size 2 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3-moe \
|
||||
--max-model-len 6144 \
|
||||
--max-num-batched-tokens 6144 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-expert-parallel \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "LLMDataDistCMgrConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_parallel_size": 1,
|
||||
"kv_port": "20001",
|
||||
"engine_id": "0",
|
||||
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
|
||||
}' \
|
||||
--additional-config \
|
||||
'{"torchair_graph_config": {"enabled":false, "enable_multistream_shared_expert":false}, "ascend_scheduler_config":{"enabled":true, "enable_chunked_prefill":false}}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Example proxy for Deployment
|
||||
|
||||
Run a proxy server on the same node with prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py \
|
||||
--host 192.0.0.1 \
|
||||
--port 8080 \
|
||||
--prefiller-hosts 192.0.0.1 \
|
||||
--prefiller-port 13700 \
|
||||
--decoder-hosts 192.0.0.2 192.0.0.3 \
|
||||
--decoder-ports 13700 13700
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
Check service health using the proxy server endpoint.
|
||||
|
||||
```shell
|
||||
curl http://192.0.0.1:8080/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3-moe",
|
||||
"prompt": "Who are you?",
|
||||
"max_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
156
docs/source/tutorials/multi_node_qwen3vl.md
Normal file
156
docs/source/tutorials/multi_node_qwen3vl.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Multi-Node-DP (Qwen3-VL-235B-A22B)
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
referring to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process)
|
||||
|
||||
## Run with docker
|
||||
Assume you have an Atlas 800 A3(64G*16) nodes(or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multi-node.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci8 \
|
||||
--device /dev/davinci9 \
|
||||
--device /dev/davinci10 \
|
||||
--device /dev/davinci11 \
|
||||
--device /dev/davinci12 \
|
||||
--device /dev/davinci13 \
|
||||
--device /dev/davinci14 \
|
||||
--device /dev/davinci15 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Run the following scripts on two nodes respectively
|
||||
|
||||
:::{note}
|
||||
Before launch the inference server, ensure the following environment variables are set for multi node communication
|
||||
:::
|
||||
|
||||
node0
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--data-parallel-size 2 \
|
||||
--api-server-count 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-address $local_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3vl \
|
||||
--tensor-parallel-size 8 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.8 \
|
||||
```
|
||||
|
||||
node1
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
node0_ip="xxxx"
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--headless \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-start-rank 1 \
|
||||
--data-parallel-address $node0_ip \
|
||||
--data-parallel-rpc-port 13389 \
|
||||
--seed 1024 \
|
||||
--tensor-parallel-size 8 \
|
||||
--served-model-name qwen3vl \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching \
|
||||
--gpu-memory-utilization 0.8 \
|
||||
```
|
||||
|
||||
If the service starts successfully, the following information will be displayed on node0:
|
||||
|
||||
```shell
|
||||
INFO: Started server process [44610]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Started server process [44611]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3vl",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
|
||||
{"type": "text", "text": "What is the text in the illustrate?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
182
docs/source/tutorials/multi_node_ray.md
Normal file
182
docs/source/tutorials/multi_node_ray.md
Normal file
@@ -0,0 +1,182 @@
|
||||
# Multi-Node-Ray (Qwen/Qwen3-235B-A22B)
|
||||
|
||||
Multi-node inference is suitable for the scenarios that the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
|
||||
|
||||
* **Verify Multi-Node Communication Environment**
|
||||
* **Set Up and Start the Ray Cluster**
|
||||
* **Start the Online Inference Service on multinode**
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
### Physical Layer Requirements:
|
||||
|
||||
* The physical machines must be located on the same LAN, with network connectivity.
|
||||
* All NPUs are connected with optical modules, and the connection status must be normal.
|
||||
|
||||
### Verification Process:
|
||||
|
||||
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
|
||||
|
||||
```bash
|
||||
# Check the remote switch ports
|
||||
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
|
||||
# Get the link status of the Ethernet ports (UP or DOWN)
|
||||
for i in {0..7}; do hccn_tool -i $i -link -g ; done
|
||||
# Check the network health status
|
||||
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
|
||||
# View the network detected IP configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
|
||||
# View gateway configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
|
||||
# View NPU network configuration
|
||||
cat /etc/hccn.conf
|
||||
```
|
||||
|
||||
### NPU Interconnect Verification:
|
||||
#### 1. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
|
||||
```
|
||||
|
||||
#### 2. Cross-Node PING Test
|
||||
|
||||
```bash
|
||||
# Execute on the target node (replace with actual IP)
|
||||
hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||
```
|
||||
|
||||
## Set Up and Start the Ray Cluster
|
||||
### Setting Up the Basic Container
|
||||
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is recommended to use Docker images.
|
||||
|
||||
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the master and worker nodes, with the `--net=host` option to enable proper network connectivity.
|
||||
|
||||
Below is the example container setup command, which should be executed on **all nodes** :
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /path/to/shared/cache:/root/.cache \ # IMPORTANT: This must be a shared directory accessible by all nodes
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
### Start Ray Cluster
|
||||
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
|
||||
|
||||
Choose one machine as the head node and the others as worker nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
|
||||
|
||||
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues.
|
||||
|
||||
Below are the commands for the head and worker nodes:
|
||||
|
||||
**Head node**:
|
||||
|
||||
:::{note}
|
||||
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
|
||||
Updating the environment variables requires restarting the Ray cluster.
|
||||
:::
|
||||
|
||||
```shell
|
||||
# Head node
|
||||
export HCCL_IF_IP={local_ip}
|
||||
export GLOO_SOCKET_IFNAME={nic_name}
|
||||
export TP_SOCKET_IFNAME={nic_name}
|
||||
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
ray start --head
|
||||
```
|
||||
|
||||
**Worker node**:
|
||||
|
||||
:::{note}
|
||||
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
|
||||
:::
|
||||
|
||||
```shell
|
||||
# Worker node
|
||||
export HCCL_IF_IP={local_ip}
|
||||
export GLOO_SOCKET_IFNAME={nic_name}
|
||||
export TP_SOCKET_IFNAME={nic_name}
|
||||
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
ray start --address='{head_node_ip}:6379' --node-ip-address={local_ip}
|
||||
```
|
||||
|
||||
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
|
||||
|
||||
## Start the Online Inference Service on multinode scenario
|
||||
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster.
|
||||
|
||||
**You only need to run the vllm command on one node.**
|
||||
|
||||
To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
|
||||
|
||||
For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22B \
|
||||
--distributed-executor-backend ray \
|
||||
--pipeline-parallel-size 2 \
|
||||
--tensor-parallel-size 8 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-seqs 25 \
|
||||
--served-model-name qwen \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22B \
|
||||
--distributed-executor-backend ray \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-seqs 25 \
|
||||
--served-model-name qwen \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen",
|
||||
"prompt": "tell me how to sleep well",
|
||||
"max_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
156
docs/source/tutorials/multi_npu_qwen3_next.md
Normal file
156
docs/source/tutorials/multi_npu_qwen3_next.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Multi-NPU (Qwen3-Next)
|
||||
|
||||
```{note}
|
||||
The Qwen3 Next are using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes around stability, accuracy and performance improvement.
|
||||
```
|
||||
|
||||
## Run vllm-ascend on Multi-NPU with Qwen3 Next
|
||||
|
||||
Run docker container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend-qwen3 \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Setup environment variables:
|
||||
|
||||
```bash
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
```
|
||||
|
||||
### Install Triton Ascend
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Linux (aarch64)
|
||||
|
||||
The [Triton Ascend](https://gitee.com/ascend/triton-ascend) is required when you run Qwen3 Next, please follow the instructions below to install it and its dependency.
|
||||
|
||||
Install the Ascend BiSheng toolkit:
|
||||
|
||||
```bash
|
||||
wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/Ascend-BiSheng-toolkit_aarch64.run
|
||||
chmod a+x Ascend-BiSheng-toolkit_aarch64.run
|
||||
./Ascend-BiSheng-toolkit_aarch64.run --install
|
||||
source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
|
||||
```
|
||||
|
||||
Install Triton Ascend:
|
||||
|
||||
```bash
|
||||
wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
|
||||
pip install triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Linux (x86_64)
|
||||
|
||||
Coming soon ...
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
### Inference on Multi-NPU
|
||||
|
||||
Please make sure you already executed the command:
|
||||
|
||||
```bash
|
||||
source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
|
||||
```
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Online Inference
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
|
||||
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32GB of memory, tensor-parallel-size should be at least 8.
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --enforce-eager
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||||
"model": "Qwen/Qwen3-Next-80B-A3B-Instruct",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Who are you?"}
|
||||
],
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20,
|
||||
"max_tokens": 32
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Offline Inference
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU:
|
||||
|
||||
```python
|
||||
import gc
|
||||
import torch
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
||||
destroy_model_parallel)
|
||||
|
||||
def clean_up():
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect()
|
||||
torch.npu.empty_cache()
|
||||
|
||||
if __name__ == '__main__':
|
||||
prompts = [
|
||||
"Who are you?",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
|
||||
llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
enforce_eager=True,
|
||||
distributed_executor_backend="mp",
|
||||
gpu_memory_utilization=0.7,
|
||||
max_model_len=4096)
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am'
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
@@ -30,11 +30,18 @@ The following table lists the additional configuration options available in vLLM
|
||||
| `ascend_scheduler_config` | dict | `{}` | The config options for ascend scheduler |
|
||||
| `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf or ut/e2e test case. |
|
||||
| `expert_map_path` | str | `None` | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
|
||||
| `chunked_prefill_for_mla` | bool | `False` | Whether to enable the fused operator-like chunked_prefill. |
|
||||
| `enable_prefetch` | bool | `False` | Whether to enable weight prefetch. |
|
||||
| `kv_cache_dtype` | str | `None` | When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. |
|
||||
| `enable_shared_expert_dp` | bool | `False` | When the shared expert in DP, it has better performance but consumes more memory. Currently only DeepSeek series models are supported to use. |
|
||||
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. |
|
||||
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
|
||||
| `multistream_overlap_shared_expert`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on moe models with shared experts. |
|
||||
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic eplb |
|
||||
|`num_iterations_eplb_update`| int | `400` | Forward iterations when eplb would begin |
|
||||
|`gate_eplb`| bool | `False` | Whether to enale eplb only once. |
|
||||
|`num_wait_worker_iterations`| int | `30` | The forward iterations when eplb worker will finish cpu task. In our test default value 30 would cover most cases. |
|
||||
|`expert_map_record_path`| str | `None` | When dynamic eplb is completed, save the current expert load heatmap to the specified path. |
|
||||
|`init_redundancy_expert`| int | `0` |Specify redundant experts during initialization.|
|
||||
|
||||
The details of each config option are as follows:
|
||||
|
||||
@@ -45,8 +52,8 @@ The details of each config option are as follows:
|
||||
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported to use torchair graph mode |
|
||||
| `mode` | str | `None` | When using reduce-overhead mode for torchair, mode needs to be set |
|
||||
| `enable_multistream_mla`| bool | `False` | Whether to put vector ops of MLA to another stream. This option only takes effects on models using MLA (e.g., DeepSeek). |
|
||||
| `enable_multistream_moe`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on DeepSeek moe models. |
|
||||
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
|
||||
| `enable_frozen_parameter` | bool | `True` | Whether to fix the memory address of weights during inference to reduce the input address refresh time during graph execution. |
|
||||
| `use_cached_graph` | bool | `False` | Whether to use cached graph |
|
||||
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache |
|
||||
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty |
|
||||
@@ -57,6 +64,10 @@ The details of each config option are as follows:
|
||||
| Name | Type | Default | Description |
|
||||
| ---- | ---- | ------- | ----------- |
|
||||
| `enabled` | bool | `False` | Whether to enable ascend scheduler for V1 engine|
|
||||
| `enable_pd_transfer` | bool | `False` | Whether to enable pd transfer. When using it, decode is started only when prefill of all requests is done. This option only takes effects on offline inference. |
|
||||
| `decode_max_num_seqs` | int | `0` | Whether to change max_num_seqs of decode phase when enable pd transfer. This option only takes effects when enable_pd_transfer is True. |
|
||||
| `max_long_partial_prefills` | Union[int, float] | `float('inf')` | the maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. |
|
||||
| `long_prefill_token_threshold` | Union[int, float] | `float('inf')` | a request is considered long if the prompt is longer than this number of tokens. |
|
||||
|
||||
ascend_scheduler_config also support the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well.
|
||||
|
||||
@@ -71,13 +82,15 @@ An example of additional configuration is as follows:
|
||||
"use_cached_graph": True,
|
||||
"graph_batch_sizes": [1, 2, 4, 8],
|
||||
"graph_batch_sizes_init": False,
|
||||
"enable_multistream_moe": False,
|
||||
"enable_kv_nz": False
|
||||
},
|
||||
"ascend_scheduler_config": {
|
||||
"enabled": True,
|
||||
"enable_chunked_prefill": True,
|
||||
"max_long_partial_prefills": 1,
|
||||
"long_prefill_token_threshold": 4096,
|
||||
},
|
||||
"multistream_overlap_shared_expert": True,
|
||||
"refresh": False,
|
||||
}
|
||||
```
|
||||
|
||||
94
docs/source/user_guide/feature_guide/eplb_swift_balancer.md
Normal file
94
docs/source/user_guide/feature_guide/eplb_swift_balancer.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# Expert Load Balance (EPLB)
|
||||
|
||||
## Overview
|
||||
|
||||
Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Tokens Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
|
||||
|
||||
## EPLB Effects
|
||||
|
||||
- Reduced Latency: Dynamically balances expert loads to minimize TTFT and TPOT by distributing workloads evenly across experts.
|
||||
- Enhanced Throughput: Optimizes GPU utilization, increasing token generation speed under high-concurrency scenarios.
|
||||
- Zero-Overhead Movement: Expert redistribution occurs asynchronously without interrupting ongoing inference requests.
|
||||
- Adaptive Scaling: Automatically adjusts to workload fluctuations while maintaining stable performance.
|
||||
- Fault Tolerance: Redundant expert placement ensures system resilience during hardware failures.
|
||||
|
||||
## How to Use EPLB
|
||||
|
||||
### Dynamic EPLB
|
||||
|
||||
Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22 \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--additional-config '{
|
||||
"dynamic_eplb": true,
|
||||
"num_iterations_eplb_update": 400,
|
||||
"gate_eplb": true,
|
||||
"num_wait_worker_iterations": 30
|
||||
}'
|
||||
```
|
||||
|
||||
### Static EPLB
|
||||
#### Initial Setup (Record Expert Map)
|
||||
|
||||
Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22 \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--additional-config '{
|
||||
"expert_map_record_path": "/path/to/eplb.json",
|
||||
"init_redundancy_expert": 16,
|
||||
"dynamic_eplb": true,
|
||||
"num_iterations_eplb_update": 400,
|
||||
"gate_eplb": true,
|
||||
"num_wait_worker_iterations": 30
|
||||
}'
|
||||
```
|
||||
|
||||
#### Subsequent Deployments (Use Recorded Map)
|
||||
Load the pre-recorded expert map for consistent performance. This avoids recalculating distributions at runtime.
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22 \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--additional-config '{
|
||||
"expert_map_path": "/path/to/eplb.json"
|
||||
}'
|
||||
```
|
||||
|
||||
## Critical Considerations
|
||||
1. Parameter Tuning:
|
||||
- num_iterations_eplb_update: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.
|
||||
- num_wait_worker_iterations: Should be ≥30 to avoid premature balancing during startup.
|
||||
- init_redundancy_expert: Must match tensor-parallel size (e.g., 16 for 16 GPUs) to ensure sufficient redundancy.
|
||||
|
||||
2. Hardware Requirements:
|
||||
- Ensure all GPUs have identical memory capacity and compute capabilities.
|
||||
- Network bandwidth must support expert redistribution traffic (≥10Gbps recommended).
|
||||
|
||||
3. Model Compatibility:
|
||||
- Only MoE models with explicit expert parallelism support (e.g., Qwen3-235B-A22) are compatible.
|
||||
- Verify model architecture supports dynamic expert routing via --enable-expert-parallel.
|
||||
|
||||
4. Gating Configuration:
|
||||
- When gate_eplb=true, validate that the gating mechanism can handle expert movement without routing errors.
|
||||
- Test with synthetic workloads before production deployment.
|
||||
|
||||
5. Monitoring & Validation:
|
||||
- Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and gpu_utilization.
|
||||
- Use vllm monitor to detect imbalances during runtime.
|
||||
- Always verify expert map JSON structure before loading (validate with jq or similar tools).
|
||||
|
||||
6. Startup Behavior:
|
||||
- Initial requests may experience higher latency during the first balancing cycle (typically 1-2 minutes).
|
||||
- Avoid sudden traffic spikes during warm-up phase.
|
||||
|
||||
7. Common Pitfalls:
|
||||
- Incorrect tensor-parallel-size vs. actual GPU count → causes resource underutilization.
|
||||
- Using expert_map_path without generating the map first → runtime errors.
|
||||
- Setting init_redundancy_expert > available GPUs → system failure.
|
||||
BIN
docs/source/user_guide/feature_guide/images/eplb_img.png
Normal file
BIN
docs/source/user_guide/feature_guide/images/eplb_img.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 55 KiB |
@@ -10,4 +10,5 @@ quantization
|
||||
sleep_mode
|
||||
structured_output
|
||||
lora
|
||||
eplb_swift_balancer
|
||||
:::
|
||||
|
||||
@@ -108,18 +108,19 @@ Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_2
|
||||
|
||||
### 3. When converting deepseek series models with modelslim, what should you pay attention?
|
||||
|
||||
When using the weight generated by modelslim with the `--dynamic` parameter, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
|
||||
When the mla portion of the weights used `W8A8_DYNAMIC` quantization, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
|
||||
|
||||
The operation steps are as follows:
|
||||
|
||||
1. Search in the CANN package directory used, for example:
|
||||
find /usr/local/Ascend/ -name fusion_config.json
|
||||
|
||||
2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
|
||||
2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` and `"MultiAddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
|
||||
|
||||
```bash
|
||||
{
|
||||
"Switch":{
|
||||
"GraphFusion":{
|
||||
"AddRmsNormDynamicQuantFusionPass":"off",
|
||||
"MultiAddRmsNormDynamicQuantFusionPass":"off",
|
||||
```
|
||||
|
||||
@@ -1,5 +1,70 @@
|
||||
# Release note
|
||||
|
||||
## v0.11.0rc0 - 2025.09.30
|
||||
|
||||
This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to get started.
|
||||
|
||||
### Highlights
|
||||
|
||||
- DeepSeek V3.2 is supported now. [#3270](https://github.com/vllm-project/vllm-ascend/pull/3270)
|
||||
- Qwen3-vl is supported now. [#3103](https://github.com/vllm-project/vllm-ascend/pull/3103)
|
||||
|
||||
### Core
|
||||
|
||||
- DeepSeek works with aclgraph now. [#2707](https://github.com/vllm-project/vllm-ascend/pull/2707)
|
||||
- MTP works with aclgraph now. [#2932](https://github.com/vllm-project/vllm-ascend/pull/2932)
|
||||
- EPLB is supported now. [#2956](https://github.com/vllm-project/vllm-ascend/pull/2956)
|
||||
- Mooncacke store kvcache connector is supported now. [#2913](https://github.com/vllm-project/vllm-ascend/pull/2913)
|
||||
- CPU offload connector is supported now. [#1659](https://github.com/vllm-project/vllm-ascend/pull/1659)
|
||||
|
||||
### Other
|
||||
|
||||
- Qwen3-next is stable now. [#3007](https://github.com/vllm-project/vllm-ascend/pull/3007)
|
||||
- Fixed a lot of bugs introduced in v0.10.2 by Qwen3-next. [#2964](https://github.com/vllm-project/vllm-ascend/pull/2964) [#2781](https://github.com/vllm-project/vllm-ascend/pull/2781) [#3070](https://github.com/vllm-project/vllm-ascend/pull/3070) [#3113](https://github.com/vllm-project/vllm-ascend/pull/3113)
|
||||
- The LoRA feature is back now. [#3044](https://github.com/vllm-project/vllm-ascend/pull/3044)
|
||||
- Eagle3 spec decode method is back now. [#2949](https://github.com/vllm-project/vllm-ascend/pull/2949)
|
||||
|
||||
## v0.10.2rc1 - 2025.09.16
|
||||
|
||||
This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to get started.
|
||||
|
||||
### Highlights
|
||||
|
||||
- Add support for Qwen3 Next. Please note that expert parallel and MTP feature doesn't work with this release. We'll make it work enough soon. Follow the [official guide](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html) to get start [#2917](https://github.com/vllm-project/vllm-ascend/pull/2917)
|
||||
- Add quantization support for aclgraph [#2841](https://github.com/vllm-project/vllm-ascend/pull/2841)
|
||||
|
||||
### Core
|
||||
|
||||
- Aclgraph now works with Ray backend. [#2589](https://github.com/vllm-project/vllm-ascend/pull/2589)
|
||||
- MTP now works with the token > 1. [#2708](https://github.com/vllm-project/vllm-ascend/pull/2708)
|
||||
- Qwen2.5 VL now works with quantization. [#2778](https://github.com/vllm-project/vllm-ascend/pull/2778)
|
||||
- Improved the performance with async scheduler enabled. [#2783](https://github.com/vllm-project/vllm-ascend/pull/2783)
|
||||
- Fixed the performance regression with non MLA model when use default scheduler. [#2894](https://github.com/vllm-project/vllm-ascend/pull/2894)
|
||||
|
||||
### Other
|
||||
- The performance of w8a8 quantization is improved. [#2275](https://github.com/vllm-project/vllm-ascend/pull/2275)
|
||||
- The performance of moe model is improved. [#2689](https://github.com/vllm-project/vllm-ascend/pull/2689) [#2842](https://github.com/vllm-project/vllm-ascend/pull/2842)
|
||||
- Fixed resources limit error when apply speculative decoding and aclgraph. [#2472](https://github.com/vllm-project/vllm-ascend/pull/2472)
|
||||
- Fixed the git config error in docker images. [#2746](https://github.com/vllm-project/vllm-ascend/pull/2746)
|
||||
- Fixed the sliding windows attention bug with prefill. [#2758](https://github.com/vllm-project/vllm-ascend/pull/2758)
|
||||
- The official doc for Prefill Decode Disaggregation with Qwen3 is added. [#2751](https://github.com/vllm-project/vllm-ascend/pull/2751)
|
||||
- `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` env works again. [#2740](https://github.com/vllm-project/vllm-ascend/pull/2740)
|
||||
- A new improvement for oproj in deepseek is added. Set `oproj_tensor_parallel_size` to enable this feature[#2167](https://github.com/vllm-project/vllm-ascend/pull/2167)
|
||||
- Fix a bug that deepseek with torchair doesn't work as expect when `graph_batch_sizes` is set. [#2760](https://github.com/vllm-project/vllm-ascend/pull/2760)
|
||||
- Avoid duplicate generation of sin_cos_cache in rope when kv_seqlen > 4k. [#2744](https://github.com/vllm-project/vllm-ascend/pull/2744)
|
||||
- The performance of Qwen3 dense model is improved with flashcomm_v1. Set `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1` and `VLLM_ASCEND_ENABLE_FLASHCOMM=1` to enable it. [#2779](https://github.com/vllm-project/vllm-ascend/pull/2779)
|
||||
- The performance of Qwen3 dense model is improved with prefetch feature. Set `VLLM_ASCEND_ENABLE_PREFETCH_MLP=1` to enable it. [#2816](https://github.com/vllm-project/vllm-ascend/pull/2816)
|
||||
- The performance of Qwen3 MoE model is improved with rope ops update. [#2571](https://github.com/vllm-project/vllm-ascend/pull/2571)
|
||||
- Fix the weight load error for RLHF case. [#2756](https://github.com/vllm-project/vllm-ascend/pull/2756)
|
||||
- Add warm_up_atb step to speed up the inference. [#2823](https://github.com/vllm-project/vllm-ascend/pull/2823)
|
||||
- Fixed the aclgraph steam error for moe model. [#2827](https://github.com/vllm-project/vllm-ascend/pull/2827)
|
||||
|
||||
### Known issue
|
||||
- The server will be hang when running Prefill Decode Disaggregation with different TP size for P and D. It's fixed by [vLLM commit](https://github.com/vllm-project/vllm/pull/23917) which is not included in v0.10.2. You can pick this commit to fix the issue.
|
||||
- The HBM usage of Qwen3 Next is higher than expected. It's a [known issue](https://github.com/vllm-project/vllm-ascend/issues/2884) and we're working on it. You can set `max_model_len` and `gpu_memory_utilization` to suitable value basing on your parallel config to avoid oom error.
|
||||
- We notice that lora doesn't work with this release due to the refactor of kv cache. We'll fix it soon. [2941](https://github.com/vllm-project/vllm-ascend/issues/2941)
|
||||
- Please do not enable chunked prefill with prefix cache when running with Ascend scheduler. The performance and accuracy is not good/correct. [#2943](https://github.com/vllm-project/vllm-ascend/issues/2943)
|
||||
|
||||
## v0.10.1rc1 - 2025.09.04
|
||||
|
||||
This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to get started.
|
||||
|
||||
Reference in New Issue
Block a user