[Doc] Upgrade multi-node doc (#4365)
### What this PR does / why we need it?
When we are using `Ascend scheduler`, the param `max_num_batched_tokens`
should be larger than `max_model_len`, otherwise, will encountered the
follow error:
```shell
Value error, Ascend scheduler is enabled without chunked prefill feature. Argument max_num_batched_tokens (4096) is smaller than max_model_len (32768). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len. [type=value_error, input_value=ArgsKwargs((), {'model_co...g': {'enabled': True}}}), input_type=ArgsKwargs]
```
### Does this PR introduce _any_ user-facing change?
Users/Developers who running the model according to the
[tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node.html),
the parameters can be specified correctly.
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
@@ -216,7 +216,7 @@ jobs:
|
|||||||
|
|
||||||
# 1) check follower pods
|
# 1) check follower pods
|
||||||
ALL_FOLLOWERS_READY=true
|
ALL_FOLLOWERS_READY=true
|
||||||
for ((i=1; i<${SIZE}; i++)); do
|
for ((i=1; i<SIZE; i++)); do
|
||||||
POD="${POD_PREFIX}-${i}"
|
POD="${POD_PREFIX}-${i}"
|
||||||
PHASE=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || echo "NotFound")
|
PHASE=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || echo "NotFound")
|
||||||
READY=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[*].ready}' 2>/dev/null)
|
READY=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[*].ready}' 2>/dev/null)
|
||||||
|
|||||||
@@ -131,9 +131,9 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
|
|||||||
--served-model-name deepseek_v3.1 \
|
--served-model-name deepseek_v3.1 \
|
||||||
--enable-expert-parallel \
|
--enable-expert-parallel \
|
||||||
--max-num-seqs 16 \
|
--max-num-seqs 16 \
|
||||||
--max-model-len 32768 \
|
--max-model-len 8192 \
|
||||||
--quantization ascend \
|
--quantization ascend \
|
||||||
--max-num-batched-tokens 4096 \
|
--max-num-batched-tokens 8192 \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
--gpu-memory-utilization 0.9 \
|
--gpu-memory-utilization 0.9 \
|
||||||
@@ -176,8 +176,8 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
|
|||||||
--quantization ascend \
|
--quantization ascend \
|
||||||
--served-model-name deepseek_v3.1 \
|
--served-model-name deepseek_v3.1 \
|
||||||
--max-num-seqs 16 \
|
--max-num-seqs 16 \
|
||||||
--max-model-len 32768 \
|
--max-model-len 8192 \
|
||||||
--max-num-batched-tokens 4096 \
|
--max-num-batched-tokens 8192 \
|
||||||
--enable-expert-parallel \
|
--enable-expert-parallel \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
|
|||||||
@@ -88,8 +88,8 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
|
|||||||
--tensor-parallel-size 8 \
|
--tensor-parallel-size 8 \
|
||||||
--enable-expert-parallel \
|
--enable-expert-parallel \
|
||||||
--max-num-seqs 16 \
|
--max-num-seqs 16 \
|
||||||
--max-model-len 32768 \
|
--max-model-len 8192 \
|
||||||
--max-num-batched-tokens 4096 \
|
--max-num-batched-tokens 8192 \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
--gpu-memory-utilization 0.9 \
|
--gpu-memory-utilization 0.9 \
|
||||||
@@ -130,9 +130,9 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
|
|||||||
--tensor-parallel-size 8 \
|
--tensor-parallel-size 8 \
|
||||||
--served-model-name kimi \
|
--served-model-name kimi \
|
||||||
--max-num-seqs 16 \
|
--max-num-seqs 16 \
|
||||||
--max-model-len 32768 \
|
--max-model-len 8192 \
|
||||||
--quantization ascend \
|
--quantization ascend \
|
||||||
--max-num-batched-tokens 4096 \
|
--max-num-batched-tokens 8192 \
|
||||||
--enable-expert-parallel \
|
--enable-expert-parallel \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
|
|||||||
Reference in New Issue
Block a user