[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
@@ -27,7 +27,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
- `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1).
|
||||
- `DeepSeek-V3.1-w8a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot).
|
||||
- `DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot).
|
||||
- `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model.
|
||||
- `Quantization method`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use this method to quantize the model.
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
|
||||
|
||||
@@ -264,391 +264,391 @@ To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to depl
|
||||
|
||||
2. Prefill Node 0 `run_dp_template.sh` script
|
||||
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.1"
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.1"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=256
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=256
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--max-num-seqs 8 \
|
||||
--enforce-eager \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30000",
|
||||
"engine_id": "0",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--max-num-seqs 8 \
|
||||
--enforce-eager \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30000",
|
||||
"engine_id": "0",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
3. Prefill Node 1 `run_dp_template.sh` script
|
||||
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.2"
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.2"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=256
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=256
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--max-num-seqs 8 \
|
||||
--enforce-eager \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30100",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--max-num-seqs 8 \
|
||||
--enforce-eager \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30100",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
4. Decode Node 0 `run_dp_template.sh` script
|
||||
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.3"
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.3"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=1100
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=1100
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 28 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--async-scheduling \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30200",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 28 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--async-scheduling \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30200",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
5. Decode Node 1 `run_dp_template.sh` script
|
||||
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.4"
|
||||
```shell
|
||||
# this obtained through ifconfig
|
||||
# nic_name is the network interface name corresponding to local_ip of the current node
|
||||
nic_name="xxx"
|
||||
local_ip="141.xx.xx.4"
|
||||
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
export VLLM_RPC_TIMEOUT=3600000
|
||||
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
|
||||
export HCCL_EXEC_TIMEOUT=204
|
||||
export HCCL_CONNECT_TIMEOUT=120
|
||||
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=1100
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=1100
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 28 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--async-scheduling \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30200",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek_v3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 28 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.92 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--async-scheduling \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30200",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`: enables the communication optimization function on the prefill nodes.
|
||||
- `VLLM_ASCEND_ENABLE_MLAPO=1`: enables the fusion operator, which can significantly improve performance but consumes more NPU memory. In the Prefill-Decode (PD) separation scenario, enable MLAPO only on decode nodes.
|
||||
- `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency.
|
||||
- `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node.
|
||||
- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
|
||||
- `multistream_overlap_shared_expert: true`: When the Tensor Parallelism (TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream is enabled to overlap the computation process of shared experts for improved efficiency.
|
||||
- `lmhead_tensor_parallel_size: 16`: When the Tensor Parallelism (TP) size of the decode node is 1, this parameter allows the TP size of the LMHead embedding layer to be greater than 1, which is used to reduce the computational load of each card on the LMHead embedding layer.
|
||||
- `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`: enables the communication optimization function on the prefill nodes.
|
||||
- `VLLM_ASCEND_ENABLE_MLAPO=1`: enables the fusion operator, which can significantly improve performance but consumes more NPU memory. In the Prefill-Decode (PD) separation scenario, enable MLAPO only on decode nodes.
|
||||
- `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency.
|
||||
- `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node.
|
||||
- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
|
||||
- `multistream_overlap_shared_expert: true`: When the Tensor Parallelism (TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream is enabled to overlap the computation process of shared experts for improved efficiency.
|
||||
- `lmhead_tensor_parallel_size: 16`: When the Tensor Parallelism (TP) size of the decode node is 1, this parameter allows the TP size of the LMHead embedding layer to be greater than 1, which is used to reduce the computational load of each card on the LMHead embedding layer.
|
||||
|
||||
6. run server for each node:
|
||||
|
||||
```shell
|
||||
# p0
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.1 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# p1
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.2 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# d0
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# d1
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
```
|
||||
```shell
|
||||
# p0
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.1 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# p1
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.2 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# d0
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# d1
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
```
|
||||
|
||||
7. Run the `proxy.sh` script on the prefill master node
|
||||
|
||||
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py \
|
||||
--port 1999 \
|
||||
--host 141.xx.xx.1 \
|
||||
--prefiller-hosts \
|
||||
141.xx.xx.1 \
|
||||
141.xx.xx.1 \
|
||||
141.xx.xx.2 \
|
||||
141.xx.xx.2 \
|
||||
--prefiller-ports \
|
||||
7100 7101 7100 7101 \
|
||||
--decoder-hosts \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
--decoder-ports \
|
||||
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
|
||||
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
|
||||
```
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py \
|
||||
--port 1999 \
|
||||
--host 141.xx.xx.1 \
|
||||
--prefiller-hosts \
|
||||
141.xx.xx.1 \
|
||||
141.xx.xx.1 \
|
||||
141.xx.xx.2 \
|
||||
141.xx.xx.2 \
|
||||
--prefiller-ports \
|
||||
7100 7101 7100 7101 \
|
||||
--decoder-hosts \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.3 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
141.xx.xx.4 \
|
||||
--decoder-ports \
|
||||
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
|
||||
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
|
||||
```
|
||||
|
||||
```shell
|
||||
cd vllm-ascend/examples/disaggregated_prefill_v1/
|
||||
bash proxy.sh
|
||||
```
|
||||
```shell
|
||||
cd vllm-ascend/examples/disaggregated_prefill_v1/
|
||||
bash proxy.sh
|
||||
```
|
||||
|
||||
## Functional Verification
|
||||
|
||||
@@ -704,7 +704,7 @@ The performance result is:
|
||||
|
||||
Run performance evaluation of `DeepSeek-V3.1-w8a8-mtp-QuaRot` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user