[DOC]update Qwen3.5 user guide (#7934)
This pr cherry pick from #7866. Update the model user guide --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
This commit is contained in:
@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3.5-397B-A17B)
|
||||
- `Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3.5-397B-A17B)
|
||||
- `Qwen3.5-397B-A17B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
|
||||
@@ -87,10 +87,10 @@ If you want to deploy multi-node environment, you need to set up environment on
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
`Qwen3.5-397B-A17B` can be deployed on 1 Atlas 800 A3(64G*16) or 2 Atlas 800 A2(64G*8).
|
||||
`Qwen3.5-397B-A17B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16) or 1 Atlas 800 A2(64G*8), need to start with parameter `--quantization ascend`.
|
||||
`Qwen3.5-397B-A17B` can be deployed on 2 Atlas 800 A3(64G*16) or 4 Atlas 800 A2(64G*8).
|
||||
`Qwen3.5-397B-A17B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16) or 2 Atlas 800 A2(64G*8), need to start with parameter `--quantization ascend`.
|
||||
|
||||
Run the following script to execute online 128k inference.
|
||||
Run the following script to execute online 128k inference On 1 Atlas 800 A3(64G*16).
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -98,10 +98,17 @@ Run the following script to execute online 128k inference.
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
# To reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=512
|
||||
export OMP_PROC_BIND=false
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export OMP_NUM_THREADS=1
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
sysctl -w vm.swappiness=0
|
||||
sysctl -w kernel.numa_balancing=0
|
||||
sysctl kernel.sched_migration_cost_ns=50000
|
||||
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
|
||||
--host 0.0.0.0 \
|
||||
@@ -112,15 +119,15 @@ vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
|
||||
--seed 1024 \
|
||||
--quantization ascend \
|
||||
--served-model-name qwen3.5 \
|
||||
--max-num-seqs 32 \
|
||||
--max-num-seqs 128 \
|
||||
--max-model-len 133000 \
|
||||
--max-num-batched-tokens 8096 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.90 \
|
||||
--no-enable-prefix-caching \
|
||||
--enable-prefix-caching \
|
||||
--speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
|
||||
--additional-config '{"enable_cpu_binding":true, "multistream_overlap_shared_expert": true}' \
|
||||
--additional-config '{"enable_cpu_binding":true}' \
|
||||
--async-scheduling
|
||||
```
|
||||
|
||||
@@ -260,7 +267,283 @@ INFO: Application startup complete.
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
- refer to [Prefill-Decode Disaggregation Mooncake Verification](../features/pd_disaggregation_mooncake_multi_node.md)
|
||||
We recommend using Mooncake for deployment: [Mooncake](../features/pd_disaggregation_mooncake_multi_node.md).
|
||||
|
||||
Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 1P1D (3 nodes) to run Qwen3.5-397B-A17B.
|
||||
|
||||
- `Qwen3.5-397B-A17B-w8a8-mtp 1P1D` require 3 Atlas 800 A3 (64G × 16).
|
||||
|
||||
To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy `run_p.sh` 、`run_d0.sh` and `run_d1.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests.
|
||||
|
||||
1. Prefill Node 0 `run_p.sh` script
|
||||
|
||||
```shell
|
||||
unset ftp_proxy
|
||||
unset https_proxy
|
||||
unset http_proxy
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="xxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
export VLLM_ENGINE_READY_TIMEOUT_S=30000
|
||||
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000
|
||||
export IP_ADDRESS=$local_ip
|
||||
export NETWORK_CARD_NAME=$nic_name
|
||||
export HCCL_IF_IP=$IP_ADDRESS
|
||||
export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME
|
||||
export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME
|
||||
export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=1536
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
|
||||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||||
export VLLM_TORCH_PROFILER_WITH_STACK=0
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
|
||||
vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
|
||||
--host ${IP_ADDRESS} \
|
||||
--port 30060 \
|
||||
--no-enable-prefix-caching \
|
||||
--enable-expert-parallel \
|
||||
--data-parallel-size 8 \
|
||||
--data-parallel-size-local 8 \
|
||||
--api-server-count 1 \
|
||||
--data-parallel-address ${IP_ADDRESS} \
|
||||
--max-num_seqs 64 \
|
||||
--data-parallel-rpc-port 6884 \
|
||||
--tensor-parallel-size 2 \
|
||||
--seed 1024 \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name qwen3.5 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--trust-remote-code \
|
||||
--quantization ascend \
|
||||
--no-disable-hybrid-kv-cache-manager \
|
||||
--speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \
|
||||
--additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enforce-eager \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeLayerwiseConnector",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "23010",
|
||||
"engine_id": "0",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 8,
|
||||
"tp_size": 2
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 16,
|
||||
"tp_size": 2
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
3. Decode Node 0 `run_d0.sh` script
|
||||
|
||||
```shell
|
||||
unset ftp_proxy
|
||||
unset https_proxy
|
||||
unset http_proxy
|
||||
#!/bin/bash
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="xxx"
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
export VLLM_ENGINE_READY_TIMEOUT_S=30000
|
||||
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000
|
||||
export MASTER_IP_ADDRESS=$node0_ip
|
||||
export IP_ADDRESS=$local_ip
|
||||
|
||||
export NETWORK_CARD_NAME=$nic_name
|
||||
|
||||
export HCCL_IF_IP=$IP_ADDRESS
|
||||
export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME
|
||||
export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME
|
||||
export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME
|
||||
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=1536
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
|
||||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||||
export VLLM_TORCH_PROFILER_WITH_STACK=0
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
|
||||
vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
|
||||
--host ${IP_ADDRESS} \
|
||||
--port 30050 \
|
||||
--no-enable-prefix-caching \
|
||||
--enable-expert-parallel \
|
||||
--data-parallel-size 16 \
|
||||
--data-parallel-size-local 8 \
|
||||
--data-parallel-start-rank 0 \
|
||||
--api-server-count 1 \
|
||||
--data-parallel-address ${MASTER_IP_ADDRESS} \
|
||||
--max-num_seqs 32 \
|
||||
--data-parallel-rpc-port 6884 \
|
||||
--tensor-parallel-size 2 \
|
||||
--seed 1024 \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name qwen3.5 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 128 \
|
||||
--trust-remote-code \
|
||||
--quantization ascend \
|
||||
--no-disable-hybrid-kv-cache-manager \
|
||||
--speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \
|
||||
--additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--gpu-memory-utilization 0.96 \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeLayerwiseConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "36010",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 8,
|
||||
"tp_size": 2
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 16,
|
||||
"tp_size": 2
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
5. Decode Node 1 `run_d1.sh` script
|
||||
|
||||
```shell
|
||||
unset ftp_proxy
|
||||
unset https_proxy
|
||||
unset http_proxy
|
||||
#!/bin/bash
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="xxx"
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
export VLLM_ENGINE_READY_TIMEOUT_S=30000
|
||||
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000
|
||||
export MASTER_IP_ADDRESS=$node0_ip
|
||||
export IP_ADDRESS=$local_ip
|
||||
|
||||
export NETWORK_CARD_NAME=$nic_name
|
||||
|
||||
export HCCL_IF_IP=$IP_ADDRESS
|
||||
export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME
|
||||
export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME
|
||||
export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME
|
||||
|
||||
export VLLM_USE_V1=1
|
||||
export HCCL_BUFFSIZE=1536
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
|
||||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||||
export VLLM_TORCH_PROFILER_WITH_STACK=0
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \
|
||||
--host ${IP_ADDRESS} \
|
||||
--port 30050 \
|
||||
--headless \
|
||||
--no-enable-prefix-caching \
|
||||
--enable-expert-parallel \
|
||||
--data-parallel-size 16 \
|
||||
--data-parallel-size-local 8 \
|
||||
--data-parallel-start-rank 8 \
|
||||
--data-parallel-address ${MASTER_IP_ADDRESS} \
|
||||
--max-num_seqs 32 \
|
||||
--data-parallel-rpc-port 6884 \
|
||||
--tensor-parallel-size 2 \
|
||||
--seed 1024 \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name qwen3.5 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 128 \
|
||||
--trust-remote-code \
|
||||
--quantization ascend \
|
||||
--no-disable-hybrid-kv-cache-manager \
|
||||
--speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \
|
||||
--additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--gpu-memory-utilization 0.96 \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeLayerwiseConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "36010",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 8,
|
||||
"tp_size": 2
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 16,
|
||||
"tp_size": 2
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency.
|
||||
- `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node.
|
||||
- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
|
||||
- `no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944)
|
||||
|
||||
7. Run the `proxy.sh` script on the prefill master node
|
||||
|
||||
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
```shell
|
||||
unset ftp_proxy
|
||||
unset https_proxy
|
||||
unset http_proxy
|
||||
#/bin/bash
|
||||
|
||||
if [[ "$offset" == "" ]]; then
|
||||
offset=0
|
||||
fi
|
||||
|
||||
python3 load_balance_proxy_layerwise_server_example.py \
|
||||
--prefiller-hosts 141.xx.xx.1 \
|
||||
--prefiller-ports 30060 \
|
||||
--decoder-hosts 141.xx.xx.2 \
|
||||
--decoder-ports 30050 \
|
||||
--host 141.xx.xx.1 \
|
||||
--port 8010
|
||||
```
|
||||
|
||||
```shell
|
||||
cd vllm-ascend/examples/disaggregated_prefill_v1/
|
||||
bash proxy.sh
|
||||
```
|
||||
|
||||
## Functional Verification
|
||||
|
||||
|
||||
@@ -77,6 +77,8 @@ Get the latest info here: <https://github.com/vllm-project/vllm-ascend/issues/16
|
||||
| Qwen2.5-VL | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ ||| ✅ | ✅ |||| ✅ | ✅ | ✅ | 30k || [Qwen-VL-Dense](../../tutorials/models/Qwen-VL-Dense.md) |
|
||||
| Qwen3-VL | ✅ | ||A2/A3|||||||✅|||||✅|✅||| [Qwen-VL-Dense](../../tutorials/models/Qwen-VL-Dense.md) |
|
||||
| Qwen3-VL-MOE | ✅ | | ✅ | A2/A3||✅|✅|||✅|✅|✅|✅|✅|✅|✅|✅|256k||[Qwen3-VL-MOE](../../tutorials/models/Qwen3-VL-235B-A22B-Instruct.md)|
|
||||
| Qwen3.5-397B-A17B | ✅ | |✅| A2/A3 |✅|✅|✅||✅|✅|✅||✅|✅|✅|✅|✅|1010000|| [Qwen3.5-397B-A17B](../../tutorials/models/Qwen3.5-397B-A17B.md) |
|
||||
| Qwen3.5-27B | ✅ | |✅| A2/A3 |✅|✅|✅||✅|✅|✅||✅|✅|✅|✅|✅|1010000|| [Qwen3.5-27B](../../tutorials/models/Qwen3.5-27B.md) |
|
||||
| Qwen3-Omni-30B-A3B-Thinking | 🔵 | ||A2/A3|||||||✅||✅|||||||[Qwen3-Omni-30B-A3B-Thinking](../../tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md)|
|
||||
| Qwen2.5-Omni | 🔵 | || A2/A3 |||||||||||||||| [Qwen2.5-Omni](../../tutorials/models/Qwen2.5-Omni.md) |
|
||||
|
||||
|
||||
Reference in New Issue
Block a user