[DOC] Add explaination of 310p special param: max-model-len (#7065)
### What this PR does / why we need it?
This PR updates the documentation for running vLLM on Atlas 300I series
(310p) hardware. It adds a warning to explicitly set `--max-model-len`
to prevent potential Out-of-Memory (OOM) errors that can occur with the
default configuration.
The example commands and Python scripts for online and offline inference
have been updated to:
- Include `--max-model-len 4096` (or `max_model_len=4096`).
- Remove the `compilation-config` parameter, which is no longer
necessary for 310p devices.
These changes ensure users have a clearer and more stable experience
when using vLLM on Atlas 300I hardware.
### Does this PR introduce _any_ user-facing change?
No, this is a documentation-only update.
### How was this patch tested?
The changes are to documentation and do not require testing.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
This commit is contained in:
@@ -49,6 +49,25 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
||||
|
||||
### Online Inference on NPU
|
||||
|
||||
```{warning}
|
||||
For Atlas 300I (310P), do not rely on `max-model-len` auto detection
|
||||
(omit `--max-model-len`), because it may cause OOM.
|
||||
|
||||
Reason (current 310P attention path):
|
||||
- `AscendAttentionMetadataBuilder310` passes `model_config.max_model_len`
|
||||
to `AttentionMaskBuilder310`.
|
||||
- `AttentionMaskBuilder310` builds a full causal mask with shape
|
||||
`[max_model_len, max_model_len]` in float16, then casts it to FRACTAL_NZ.
|
||||
- In 310P `attention_v1` prefill/chunked-prefill
|
||||
(`_npu_flash_attention` / `_npu_paged_attention_splitfuse`),
|
||||
this explicit mask tensor is consumed directly, and there is no
|
||||
compressed-mask path.
|
||||
|
||||
So if auto resolves to a large context length, the mask allocation
|
||||
(`O(max_model_len^2)`) can exceed NPU memory and trigger OOM.
|
||||
Always set a conservative explicit value, for example `--max-model-len 4096`.
|
||||
```
|
||||
|
||||
Run the following script to start the vLLM server on NPU (Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards):
|
||||
|
||||
:::::{tab-set}
|
||||
@@ -64,9 +83,9 @@ Run the following command to start the vLLM server:
|
||||
:substitutions:
|
||||
vllm serve Qwen/Qwen3-0.6B \
|
||||
--tensor-parallel-size 1 \
|
||||
--max-model-len 4096 \
|
||||
--enforce-eager \
|
||||
--dtype float16 \
|
||||
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
|
||||
--dtype float16
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
@@ -94,9 +113,9 @@ Run the following command to start the vLLM server:
|
||||
:substitutions:
|
||||
vllm serve Qwen/Qwen2.5-7B-Instruct \
|
||||
--tensor-parallel-size 2 \
|
||||
--max-model-len 4096 \
|
||||
--enforce-eager \
|
||||
--dtype float16 \
|
||||
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
|
||||
--dtype float16
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
@@ -124,9 +143,9 @@ Run the following command to start the vLLM server:
|
||||
:substitutions:
|
||||
vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
|
||||
--tensor-parallel-size 1 \
|
||||
--max-model-len 4096 \
|
||||
--enforce-eager \
|
||||
--dtype float16 \
|
||||
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
|
||||
--dtype float16
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
@@ -183,9 +202,9 @@ sampling_params = SamplingParams(max_completion_tokens=100, temperature=0.0)
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen3-0.6B",
|
||||
tensor_parallel_size=1,
|
||||
max_model_len=4096,
|
||||
enforce_eager=True, # For 300I series, only eager mode is supported.
|
||||
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
|
||||
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
|
||||
)
|
||||
# Generate texts from the prompts.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
@@ -226,9 +245,9 @@ sampling_params = SamplingParams(max_completion_tokens=100, temperature=0.0)
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-7B-Instruct",
|
||||
tensor_parallel_size=2,
|
||||
max_model_len=4096,
|
||||
enforce_eager=True, # For 300I series, only eager mode is supported.
|
||||
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
|
||||
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
|
||||
)
|
||||
# Generate texts from the prompts.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
@@ -269,9 +288,9 @@ sampling_params = SamplingParams(max_completion_tokens=100, top_p=0.95, top_k=50
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
tensor_parallel_size=1,
|
||||
max_model_len=4096,
|
||||
enforce_eager=True, # For 300I series, only eager mode is supported.
|
||||
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
|
||||
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
|
||||
)
|
||||
# Generate texts from the prompts.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
Reference in New Issue
Block a user