[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
This commit is contained in:
cookieyyds
2025-12-25 22:28:35 +08:00
committed by GitHub
parent 59f11dd1cb
commit 2da8038dd2

View File

@@ -454,10 +454,10 @@ Before you start, please
--seed 1024 \
--served-model-name dsv3 \
--max-model-len 68000 \
--max-num-batched-tokens 4 \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 6, 8]}' \
--max-num-batched-tokens 12 \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
--trust-remote-code \
--max-num-seqs 1 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.95 \
--no-enable-prefix-caching \
--async-scheduling \
@@ -479,7 +479,8 @@ Before you start, please
"tp_size": 4
}
}
}'
}' \
--additional-config '{"recompute_scheduler_enable" : true}'
```
4. Decode node 1
@@ -532,11 +533,11 @@ Before you start, please
--seed 1024 \
--served-model-name dsv3 \
--max-model-len 68000 \
--max-num-batched-tokens 4 \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 6, 8]}' \
--max-num-batched-tokens 12 \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
--trust-remote-code \
--async-scheduling \
--max-num-seqs 1 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.95 \
--no-enable-prefix-caching \
--quantization ascend \
@@ -557,7 +558,8 @@ Before you start, please
"tp_size": 4
}
}
}'
}' \
--additional-config '{"recompute_scheduler_enable" : true}'
```
Once the preparation is done, you can start the server with the following command on each node:
@@ -639,6 +641,16 @@ lm_eval \
Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
The performance result is:
**Hardware**: A3-752T, 4 node
**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4
**Input/Output**: 64k/3k
**Performance**: 533tps, TPOT 32ms
### Using vLLM Benchmark
Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.
@@ -657,12 +669,8 @@ export VLLM_USE_MODELSCOPE=true
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is:
## Function Call
**Hardware**: A3-752T, 4 node
The function call feature is supported from v0.13.0rc1 on. Please use the latest version.
**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4
**Input/Output**: 64k/3k
**Performance**: 255tps, TPOT 23ms
Refer to [DeepSeek-V3.2 Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#tool-calling-example) for details.