[Doc] Upgrade multi-node doc (#4365)

### What this PR does / why we need it? When we are using `Ascend scheduler`, the param `max_num_batched_tokens` should be larger than `max_model_len`, otherwise, will encountered the follow error: ```shell Value error, Ascend scheduler is enabled without chunked prefill feature. Argument max_num_batched_tokens (4096) is smaller than max_model_len (32768). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len. [type=value_error, input_value=ArgsKwargs((), {'model_co...g': {'enabled': True}}}), input_type=ArgsKwargs] ``` ### Does this PR introduce _any_ user-facing change? Users/Developers who running the model according to the [tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node.html), the parameters can be specified correctly. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: 2918c1b49c --------- Signed-off-by: wangli <wangli858794774@gmail.com>
2025-11-24 10:57:50 +08:00
parent b34f195cc8
commit b5f7a83927
3 changed files with 9 additions and 9 deletions
--- a/docs/source/tutorials/multi_node.md
+++ b/docs/source/tutorials/multi_node.md
@@ -131,9 +131,9 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
 --served-model-name deepseek_v3.1 \
 --enable-expert-parallel \
 --max-num-seqs 16 \
--max-model-len 32768 \
+--max-model-len 8192 \
 --quantization ascend \
--max-num-batched-tokens 4096 \
+--max-num-batched-tokens 8192 \
 --trust-remote-code \
 --no-enable-prefix-caching \
 --gpu-memory-utilization 0.9 \
@@ -176,8 +176,8 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
 --quantization ascend \
 --served-model-name deepseek_v3.1 \
 --max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
+--max-model-len 8192 \
+--max-num-batched-tokens 8192 \
 --enable-expert-parallel \
 --trust-remote-code \
 --no-enable-prefix-caching \
--- a/docs/source/tutorials/multi_node_kimi.md
+++ b/docs/source/tutorials/multi_node_kimi.md
@@ -88,8 +88,8 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
 --tensor-parallel-size 8 \
 --enable-expert-parallel \
 --max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
+--max-model-len 8192 \
+--max-num-batched-tokens 8192 \
 --trust-remote-code \
 --no-enable-prefix-caching \
 --gpu-memory-utilization 0.9 \
@@ -130,9 +130,9 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
 --tensor-parallel-size 8 \
 --served-model-name kimi \
 --max-num-seqs 16 \
--max-model-len 32768 \
+--max-model-len 8192 \
 --quantization ascend \
--max-num-batched-tokens 4096 \
+--max-num-batched-tokens 8192 \
 --enable-expert-parallel \
 --trust-remote-code \
 --no-enable-prefix-caching \