diff --git a/docs/source/tutorials/DeepSeek-V3.2.md b/docs/source/tutorials/DeepSeek-V3.2.md index 199acfd6..30f8d907 100644 --- a/docs/source/tutorials/DeepSeek-V3.2.md +++ b/docs/source/tutorials/DeepSeek-V3.2.md @@ -454,10 +454,10 @@ Before you start, please --seed 1024 \ --served-model-name dsv3 \ --max-model-len 68000 \ - --max-num-batched-tokens 4 \ - --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 6, 8]}' \ + --max-num-batched-tokens 12 \ + --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \ --trust-remote-code \ - --max-num-seqs 1 \ + --max-num-seqs 4 \ --gpu-memory-utilization 0.95 \ --no-enable-prefix-caching \ --async-scheduling \ @@ -479,7 +479,8 @@ Before you start, please "tp_size": 4 } } - }' + }' \ + --additional-config '{"recompute_scheduler_enable" : true}' ``` 4. Decode node 1 @@ -532,11 +533,11 @@ Before you start, please --seed 1024 \ --served-model-name dsv3 \ --max-model-len 68000 \ - --max-num-batched-tokens 4 \ - --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 6, 8]}' \ + --max-num-batched-tokens 12 \ + --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \ --trust-remote-code \ --async-scheduling \ - --max-num-seqs 1 \ + --max-num-seqs 4 \ --gpu-memory-utilization 0.95 \ --no-enable-prefix-caching \ --quantization ascend \ @@ -557,7 +558,8 @@ Before you start, please "tp_size": 4 } } - }' + }' \ + --additional-config '{"recompute_scheduler_enable" : true}' ``` Once the preparation is done, you can start the server with the following command on each node: @@ -639,6 +641,16 @@ lm_eval \ Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details. +The performance result is: + +**Hardware**: A3-752T, 4 node + +**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4 + +**Input/Output**: 64k/3k + +**Performance**: 533tps, TPOT 32ms + ### Using vLLM Benchmark Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example. @@ -657,12 +669,8 @@ export VLLM_USE_MODELSCOPE=true vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ ``` -After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is: +## Function Call -**Hardware**: A3-752T, 4 node +The function call feature is supported from v0.13.0rc1 on. Please use the latest version. -**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4 - -**Input/Output**: 64k/3k - -**Performance**: 255tps, TPOT 23ms +Refer to [DeepSeek-V3.2 Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#tool-calling-example) for details.