[doc] update using command (#5373)

### What this PR does / why we need it? Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial. - vLLM version: release/v0.13.0 - vLLM main: bc0a5a0c08 --------- Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-25 22:28:35 +08:00
parent 59f11dd1cb
commit 2da8038dd2
1 changed files with 23 additions and 15 deletions
--- a/docs/source/tutorials/DeepSeek-V3.2.md
+++ b/docs/source/tutorials/DeepSeek-V3.2.md
@@ -454,10 +454,10 @@ Before you start, please
            --seed 1024 \
            --served-model-name dsv3 \
            --max-model-len 68000 \
-            --max-num-batched-tokens 4 \
-            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 6, 8]}' \
+            --max-num-batched-tokens 12 \
+            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
            --trust-remote-code \
-            --max-num-seqs 1 \
+            --max-num-seqs 4 \
            --gpu-memory-utilization 0.95 \
            --no-enable-prefix-caching \
            --async-scheduling \
@@ -479,7 +479,8 @@ Before you start, please
                                "tp_size": 4
                        }
                }
-            }'
+            }' \
+            --additional-config '{"recompute_scheduler_enable" : true}'
        ```

    4. Decode node 1
@@ -532,11 +533,11 @@ Before you start, please
            --seed 1024 \
            --served-model-name dsv3 \
            --max-model-len 68000 \
-            --max-num-batched-tokens 4 \
-            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY",  "cudagraph_capture_sizes":[2, 4, 6, 8]}' \
+            --max-num-batched-tokens 12 \
+            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY",  "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
            --trust-remote-code \
            --async-scheduling \
-            --max-num-seqs 1 \
+            --max-num-seqs 4 \
            --gpu-memory-utilization 0.95 \
            --no-enable-prefix-caching \
            --quantization ascend \
@@ -557,7 +558,8 @@ Before you start, please
                                "tp_size": 4
                        }
                }
-            }'
+            }' \
+            --additional-config '{"recompute_scheduler_enable" : true}'
        ```

 Once the preparation is done, you can start the server with the following command on each node:
@@ -639,6 +641,16 @@ lm_eval \

 Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.

+The performance result is:  
+
+**Hardware**: A3-752T, 4 node
+
+**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4
+
+**Input/Output**: 64k/3k
+
+**Performance**: 533tps, TPOT 32ms
+
 ### Using vLLM Benchmark

 Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.
@@ -657,12 +669,8 @@ export VLLM_USE_MODELSCOPE=true
 vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot  --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
 ```

-After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is:
+## Function Call

-**Hardware**: A3-752T, 4 node
+The function call feature is supported from v0.13.0rc1 on. Please use the latest version.

-**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4
-
-**Input/Output**: 64k/3k
-
-**Performance**: 255tps, TPOT 23ms
+Refer to [DeepSeek-V3.2 Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#tool-calling-example) for details.