From 06732dbf5b2e79396c041eaa3e46f06dbac30385 Mon Sep 17 00:00:00 2001 From: Zhu Yi Lin <116337067+GDzhu01@users.noreply.github.com> Date: Fri, 26 Dec 2025 17:09:22 +0800 Subject: [PATCH] [Doc] update R1/V3.1 doc (#5383) ### What this PR does / why we need it? This PR updates DeepSeek-R1/V3.1 doc to give a simple recipe for repreducing our latest perfomance on Atlas A3/A2 servers. ### Does this PR introduce any user-facing change? No. Signed-off-by: GDzhu01 <809721801@qq.com> --- docs/source/tutorials/DeepSeek-R1.md | 16 +++++++++---- docs/source/tutorials/DeepSeek-V3.1.md | 31 +++++++++++++++++--------- 2 files changed, 32 insertions(+), 15 deletions(-) diff --git a/docs/source/tutorials/DeepSeek-R1.md b/docs/source/tutorials/DeepSeek-R1.md index 32a3d6a3..1e9f7b4c 100644 --- a/docs/source/tutorials/DeepSeek-R1.md +++ b/docs/source/tutorials/DeepSeek-R1.md @@ -93,6 +93,7 @@ export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_MLAPO=1 +export VLLM_ASCEND_BALANCE_SCHEDULING=1 export VLLM_USE_MODELSCOPE=True vllm serve vllm-ascend/DeepSeek-R1-W8A8 \ @@ -104,21 +105,24 @@ vllm serve vllm-ascend/DeepSeek-R1-W8A8 \ --seed 1024 \ --served-model-name deepseek_r1 \ --enable-expert-parallel \ + --async-scheduling \ --max-num-seqs 16 \ --max-model-len 16384 \ --max-num-batched-tokens 4096 \ --trust-remote-code \ --gpu-memory-utilization 0.92 \ - --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \ - --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' + --speculative-config '{"num_speculative_tokens":3,"method":"mtp"}' \ + --compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}' ``` **Notice:** The parameters are explained as follows: - Setting the environment variable `VLLM_ASCEND_ENABLE_MLAPO=1` enables a fusion operator that can significantly improve performance, though it requires more NPU memory. It is therefore recommended to enable this option when sufficient NPU memory is available. +- Setting the environment variable `VLLM_ASCEND_BALANCE_SCHEDULING=1` enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated. - For single-node deployment, we recommend using `dp4tp4` instead of `dp2tp8`. - `--max-model-len` specifies the maximum context length - that is, the sum of input and output tokens for a single request. For performance testing with an input length of 3.5K and output length of 1.5K, a value of `16384` is sufficient, however, for precision testing, please set it at least `35000`. - `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option. +- If you use the w4a8 weight, more memory will be allocated to kvcache, and you can try to increase system throughput to achieve greater throughput. :::: ::::{tab-item} DeepSeek-R1-W8A8 A2 series @@ -143,6 +147,7 @@ export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export VLLM_ASCEND_ENABLE_MLAPO=1 +export VLLM_ASCEND_BALANCE_SCHEDULING=1 export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_ROCE_ENABLE=0 export VLLM_USE_MODELSCOPE=True @@ -159,13 +164,14 @@ vllm serve vllm-ascend/DeepSeek-R1-W8A8 \ --seed 1024 \ --served-model-name deepseek_r1 \ --enable-expert-parallel \ + --async-scheduling \ --max-num-seqs 16 \ --max-model-len 16384 \ --max-num-batched-tokens 4096 \ --trust-remote-code \ --gpu-memory-utilization 0.94 \ --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \ - --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' + --compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}' ``` **Node 1** @@ -187,6 +193,7 @@ export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export VLLM_ASCEND_ENABLE_MLAPO=1 +export VLLM_ASCEND_BALANCE_SCHEDULING=1 export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_ROCE_ENABLE=0 export VLLM_USE_MODELSCOPE=True @@ -205,13 +212,14 @@ vllm serve vllm-ascend/DeepSeek-R1-W8A8 \ --seed 1024 \ --served-model-name deepseek_r1 \ --enable-expert-parallel \ + --async-scheduling \ --max-num-seqs 16 \ --max-model-len 16384 \ --max-num-batched-tokens 4096 \ --trust-remote-code \ --gpu-memory-utilization 0.94 \ --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \ - --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' + --compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}' ``` :::: diff --git a/docs/source/tutorials/DeepSeek-V3.1.md b/docs/source/tutorials/DeepSeek-V3.1.md index 8b82093e..dc2dc93a 100644 --- a/docs/source/tutorials/DeepSeek-V3.1.md +++ b/docs/source/tutorials/DeepSeek-V3.1.md @@ -23,9 +23,10 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur ## Environment Preparation ### Model Weight -- `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1) +- `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1). - `DeepSeek-V3.1-w8a8`(Quantized version without mtp): [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-w8a8). -- `DeepSeek-V3.1-w8a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`. +- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`. +- `DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot). - `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model. It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` @@ -105,6 +106,7 @@ export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_MLAPO=1 +export VLLM_ASCEND_BALANCE_SCHEDULING=1 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \ @@ -116,22 +118,25 @@ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \ --seed 1024 \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ +--async-scheduling \ --max-num-seqs 16 \ --max-model-len 16384 \ --max-num-batched-tokens 4096 \ --trust-remote-code \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.92 \ ---speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \ ---compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ +--speculative-config '{"num_speculative_tokens": 3, "method": "mtp"}' \ +--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}' ``` **Notice:** The parameters are explained as follows: - Setting the environment variable `VLLM_ASCEND_ENABLE_MLAPO=1` enables a fusion operator that can significantly improve performance, though it requires more NPU memory. It is therefore recommended to enable this option when sufficient NPU memory is available. +- Setting the environment variable `VLLM_ASCEND_BALANCE_SCHEDULING=1` enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated. - For single-node deployment, we recommend using `dp4tp4` instead of `dp2tp8`. - `--max-model-len` specifies the maximum context length - that is, the sum of input and output tokens for a single request. For performance testing with an input length of 3.5K and output length of 1.5K, a value of `16384` is sufficient, however, for precision testing, please set it at least `35000`. - `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option. +- If you use the w4a8 weight, more memory will be allocated to kvcache, and you can try to increase system throughput to achieve greater throughput. ### Multi-node Deployment @@ -169,6 +174,7 @@ export VLLM_USE_V1=1 export HCCL_BUFFSIZE=200 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export VLLM_ASCEND_ENABLE_MLAPO=1 +export VLLM_ASCEND_BALANCE_SCHEDULING=1 export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_ROCE_ENABLE=0 @@ -184,14 +190,15 @@ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \ --seed 1024 \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ ---max-num-seqs 20 \ +--async-scheduling \ +--max-num-seqs 16 \ --max-model-len 16384 \ --max-num-batched-tokens 4096 \ --trust-remote-code \ --no-enable-prefix-caching \ ---gpu-memory-utilization 0.94 \ ---speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \ ---compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ +--gpu-memory-utilization 0.92 \ +--speculative-config '{"num_speculative_tokens": 3, "method": "mtp"}' \ +--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}' ``` **Node 1** @@ -223,6 +230,7 @@ export OMP_NUM_THREADS=10 export HCCL_BUFFSIZE=200 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export VLLM_ASCEND_ENABLE_MLAPO=1 +export VLLM_ASCEND_BALANCE_SCHEDULING=1 export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_ROCE_ENABLE=0 @@ -240,14 +248,15 @@ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \ --seed 1024 \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ ---max-num-seqs 20 \ +--async-scheduling \ +--max-num-seqs 16 \ --max-model-len 16384 \ --max-num-batched-tokens 4096 \ --trust-remote-code \ --no-enable-prefix-caching \ ---gpu-memory-utilization 0.94 \ +--gpu-memory-utilization 0.92 \ --speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \ ---compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ +--compilation-config '{"cudagraph_capture_sizes":[4,16,32,48,64], "cudagraph_mode": "FULL_DECODE_ONLY"}' ``` ### Prefill-Decode Disaggregation