### What this PR does / why we need it?
This PR updates the documentation for running vLLM on Atlas 300I series
(310p) hardware. It adds a warning to explicitly set `--max-model-len`
to prevent potential Out-of-Memory (OOM) errors that can occur with the
default configuration.
The example commands and Python scripts for online and offline inference
have been updated to:
- Include `--max-model-len 4096` (or `max_model_len=4096`).
- Remove the `compilation-config` parameter, which is no longer
necessary for 310p devices.
These changes ensure users have a clearer and more stable experience
when using vLLM on Atlas 300I hardware.
### Does this PR introduce _any_ user-facing change?
No, this is a documentation-only update.
### How was this patch tested?
The changes are to documentation and do not require testing.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
vLLM Ascend Plugin documents
Live doc: https://docs.vllm.ai/projects/ascend
Build the docs
# Install dependencies.
pip install -r requirements-docs.txt
# Build the docs.
make clean
make html
# Build the docs with translation
make intl
# Open the docs with your browser
python -m http.server -d _build/html/
Launch your browser and open:
- English version: http://localhost:8000
- Chinese version: http://localhost:8000/zh_CN