[DOC]update Qwen3.5 user guide (#7934)

This pr cherry pick from #7866. Update the model user guide

---------
Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
This commit is contained in:
shaopeng-666
2026-04-02 22:09:00 +08:00
committed by GitHub
parent 85234d096d
commit 3218eb9fe1
2 changed files with 296 additions and 11 deletions

View File

@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
### Model Weight
- `Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3.5-397B-A17B)
- `Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3.5-397B-A17B)
- `Qwen3.5-397B-A17B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
@@ -87,10 +87,10 @@ If you want to deploy multi-node environment, you need to set up environment on
### Single-node Deployment
`Qwen3.5-397B-A17B` can be deployed on 1 Atlas 800 A3(64G*16) or 2 Atlas 800 A2(64G*8).
`Qwen3.5-397B-A17B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16) or 1 Atlas 800 A2(64G*8), need to start with parameter `--quantization ascend`.
`Qwen3.5-397B-A17B` can be deployed on 2 Atlas 800 A3(64G*16) or 4 Atlas 800 A2(64G*8).
`Qwen3.5-397B-A17B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16) or 2 Atlas 800 A2(64G*8), need to start with parameter `--quantization ascend`.
Run the following script to execute online 128k inference.
Run the following script to execute online 128k inference On 1 Atlas 800 A3(64G*16).
```shell
#!/bin/sh
@@ -98,10 +98,17 @@ Run the following script to execute online 128k inference.
export VLLM_USE_MODELSCOPE=true
# To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=1024
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl kernel.sched_migration_cost_ns=50000
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
--host 0.0.0.0 \
@@ -112,15 +119,15 @@ vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
--seed 1024 \
--quantization ascend \
--served-model-name qwen3.5 \
--max-num-seqs 32 \
--max-num-seqs 128 \
--max-model-len 133000 \
--max-num-batched-tokens 8096 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--no-enable-prefix-caching \
--enable-prefix-caching \
--speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--additional-config '{"enable_cpu_binding":true, "multistream_overlap_shared_expert": true}' \
--additional-config '{"enable_cpu_binding":true}' \
--async-scheduling
```
@@ -260,7 +267,283 @@ INFO: Application startup complete.
### Prefill-Decode Disaggregation
- refer to [Prefill-Decode Disaggregation Mooncake Verification](../features/pd_disaggregation_mooncake_multi_node.md)
We recommend using Mooncake for deployment: [Mooncake](../features/pd_disaggregation_mooncake_multi_node.md).
Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 1P1D (3 nodes) to run Qwen3.5-397B-A17B.
- `Qwen3.5-397B-A17B-w8a8-mtp 1P1D` require 3 Atlas 800 A3 (64G × 16).
To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy `run_p.sh` 、`run_d0.sh` and `run_d1.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests.
1. Prefill Node 0 `run_p.sh` script
```shell
unset ftp_proxy
unset https_proxy
unset http_proxy
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"
# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export VLLM_ENGINE_READY_TIMEOUT_S=30000
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000
export IP_ADDRESS=$local_ip
export NETWORK_CARD_NAME=$nic_name
export HCCL_IF_IP=$IP_ADDRESS
export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME
export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME
export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1536
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export VLLM_TORCH_PROFILER_WITH_STACK=0
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
export HCCL_OP_EXPANSION_MODE="AIV"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
--host ${IP_ADDRESS} \
--port 30060 \
--no-enable-prefix-caching \
--enable-expert-parallel \
--data-parallel-size 8 \
--data-parallel-size-local 8 \
--api-server-count 1 \
--data-parallel-address ${IP_ADDRESS} \
--max-num_seqs 64 \
--data-parallel-rpc-port 6884 \
--tensor-parallel-size 2 \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name qwen3.5 \
--max-model-len 16384 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--quantization ascend \
--no-disable-hybrid-kv-cache-manager \
--speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \
--additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \
--gpu-memory-utilization 0.9 \
--enforce-eager \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_producer",
"kv_port": "23010",
"engine_id": "0",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 8,
"tp_size": 2
},
"decode": {
"dp_size": 16,
"tp_size": 2
}
}
}'
```
3. Decode Node 0 `run_d0.sh` script
```shell
unset ftp_proxy
unset https_proxy
unset http_proxy
#!/bin/bash
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
export VLLM_ENGINE_READY_TIMEOUT_S=30000
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000
export MASTER_IP_ADDRESS=$node0_ip
export IP_ADDRESS=$local_ip
export NETWORK_CARD_NAME=$nic_name
export HCCL_IF_IP=$IP_ADDRESS
export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME
export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME
export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1536
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export VLLM_TORCH_PROFILER_WITH_STACK=0
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
export HCCL_OP_EXPANSION_MODE="AIV"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
--host ${IP_ADDRESS} \
--port 30050 \
--no-enable-prefix-caching \
--enable-expert-parallel \
--data-parallel-size 16 \
--data-parallel-size-local 8 \
--data-parallel-start-rank 0 \
--api-server-count 1 \
--data-parallel-address ${MASTER_IP_ADDRESS} \
--max-num_seqs 32 \
--data-parallel-rpc-port 6884 \
--tensor-parallel-size 2 \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name qwen3.5 \
--max-model-len 16384 \
--max-num-batched-tokens 128 \
--trust-remote-code \
--quantization ascend \
--no-disable-hybrid-kv-cache-manager \
--speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \
--additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--gpu-memory-utilization 0.96 \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_port": "36010",
"engine_id": "1",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 8,
"tp_size": 2
},
"decode": {
"dp_size": 16,
"tp_size": 2
}
}
}'
```
5. Decode Node 1 `run_d1.sh` script
```shell
unset ftp_proxy
unset https_proxy
unset http_proxy
#!/bin/bash
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
export VLLM_ENGINE_READY_TIMEOUT_S=30000
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000
export MASTER_IP_ADDRESS=$node0_ip
export IP_ADDRESS=$local_ip
export NETWORK_CARD_NAME=$nic_name
export HCCL_IF_IP=$IP_ADDRESS
export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME
export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME
export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1536
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export VLLM_TORCH_PROFILER_WITH_STACK=0
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
export HCCL_OP_EXPANSION_MODE="AIV"
vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
--host ${IP_ADDRESS} \
--port 30050 \
--headless \
--no-enable-prefix-caching \
--enable-expert-parallel \
--data-parallel-size 16 \
--data-parallel-size-local 8 \
--data-parallel-start-rank 8 \
--data-parallel-address ${MASTER_IP_ADDRESS} \
--max-num_seqs 32 \
--data-parallel-rpc-port 6884 \
--tensor-parallel-size 2 \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name qwen3.5 \
--max-model-len 16384 \
--max-num-batched-tokens 128 \
--trust-remote-code \
--quantization ascend \
--no-disable-hybrid-kv-cache-manager \
--speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \
--additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--gpu-memory-utilization 0.96 \
--kv-transfer-config \
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_port": "36010",
"engine_id": "2",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 8,
"tp_size": 2
},
"decode": {
"dp_size": 16,
"tp_size": 2
}
}
}'
```
**Notice:**
The parameters are explained as follows:
- `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency.
- `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node.
- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
- `no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944)
7. Run the `proxy.sh` script on the prefill master node
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
```shell
unset ftp_proxy
unset https_proxy
unset http_proxy
#/bin/bash
if [[ "$offset" == "" ]]; then
offset=0
fi
python3 load_balance_proxy_layerwise_server_example.py \
--prefiller-hosts 141.xx.xx.1 \
--prefiller-ports 30060 \
--decoder-hosts 141.xx.xx.2 \
--decoder-ports 30050 \
--host 141.xx.xx.1 \
--port 8010
```
```shell
cd vllm-ascend/examples/disaggregated_prefill_v1/
bash proxy.sh
```
## Functional Verification

View File

@@ -77,6 +77,8 @@ Get the latest info here: <https://github.com/vllm-project/vllm-ascend/issues/16
| Qwen2.5-VL | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ ||| ✅ | ✅ |||| ✅ | ✅ | ✅ | 30k || [Qwen-VL-Dense](../../tutorials/models/Qwen-VL-Dense.md) |
| Qwen3-VL | ✅ | ||A2/A3|||||||✅|||||✅|✅||| [Qwen-VL-Dense](../../tutorials/models/Qwen-VL-Dense.md) |
| Qwen3-VL-MOE | ✅ | | ✅ | A2/A3||✅|✅|||✅|✅|✅|✅|✅|✅|✅|✅|256k||[Qwen3-VL-MOE](../../tutorials/models/Qwen3-VL-235B-A22B-Instruct.md)|
| Qwen3.5-397B-A17B | ✅ | |✅| A2/A3 |✅|✅|✅||✅|✅|✅||✅|✅|✅|✅|✅|1010000|| [Qwen3.5-397B-A17B](../../tutorials/models/Qwen3.5-397B-A17B.md) |
| Qwen3.5-27B | ✅ | |✅| A2/A3 |✅|✅|✅||✅|✅|✅||✅|✅|✅|✅|✅|1010000|| [Qwen3.5-27B](../../tutorials/models/Qwen3.5-27B.md) |
| Qwen3-Omni-30B-A3B-Thinking | 🔵 | ||A2/A3|||||||✅||✅|||||||[Qwen3-Omni-30B-A3B-Thinking](../../tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md)|
| Qwen2.5-Omni | 🔵 | || A2/A3 |||||||||||||||| [Qwen2.5-Omni](../../tutorials/models/Qwen2.5-Omni.md) |