[Info][main] Correct the mistake in information documents (#4157)
### What this PR does / why we need it?
Correct the mistake in information documents
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
# Maintainers and contributors
|
||||
# Maintainers and Contributors
|
||||
|
||||
## Maintainers
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# User stories
|
||||
# User Stories
|
||||
|
||||
Read case studies on how users and developers solve real, everyday problems with vLLM Ascend
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Versioning policy
|
||||
# Versioning Policy
|
||||
|
||||
Starting with vLLM 0.7.x, the vLLM Ascend Plugin ([vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend)) project follows the [PEP 440](https://peps.python.org/pep-0440/) to publish matching with vLLM ([vllm-project/vllm](https://github.com/vllm-project/vllm)).
|
||||
|
||||
|
||||
@@ -15,7 +15,8 @@ Currently, **ONLY** Atlas A2 series(Ascend-cann-kernels-910b),Atlas A3 series(
|
||||
- Atlas 800I A2 Inference series (Atlas 800I A2)
|
||||
- Atlas A3 Training series (Atlas 800T A3, Atlas 900 A3 SuperPoD, Atlas 9000 A3 SuperPoD)
|
||||
- Atlas 800I A3 Inference series (Atlas 800I A3)
|
||||
- [Experimental] Atlas 300I Inference series (Atlas 300I Duo). Currently for 310I Duo the stable version is vllm-ascend v0.10.0rc1.
|
||||
- [Experimental] Atlas 300I Inference series (Atlas 300I Duo).
|
||||
- [Experimental] Currently for 310I Duo the stable version is vllm-ascend v0.10.0rc1.
|
||||
|
||||
Below series are NOT supported yet:
|
||||
- Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
|
||||
@@ -135,7 +136,7 @@ OOM errors typically occur when the model exceeds the memory capacity of a singl
|
||||
|
||||
In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
|
||||
|
||||
- **Limit --max-model-len**: It can save the HBM usage for kv cache initialization step.
|
||||
- **Limit `--max-model-len`**: It can save the HBM usage for kv cache initialization step.
|
||||
|
||||
- **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
|
||||
|
||||
|
||||
@@ -61,7 +61,7 @@ msgid ""
|
||||
"**ACLGraph**: This is the default graph mode supported by vLLM Ascend. In "
|
||||
"v0.9.1rc1, only Qwen series models are well tested."
|
||||
msgstr ""
|
||||
"**ACLGraph**:这是 vLLM Ascend 支持的默认图模式。在 v0.9.1rc1 版本中,只有 Qwen 系列模型得到了充分测试。"
|
||||
"**ACLGraph**:这是 vLLM Ascend 支持的默认图模式。在 v0.9.1rc1 版本中,Qwen 和Deepseek系列模型得到了充分测试。"
|
||||
|
||||
#: ../../user_guide/feature_guide/graph_mode.md:15
|
||||
msgid ""
|
||||
|
||||
@@ -47,6 +47,8 @@ Run the following script to execute offline inference on a single NPU:
|
||||
pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
|
||||
```
|
||||
|
||||
Create a python script with the following content:
|
||||
|
||||
```python
|
||||
from transformers import AutoProcessor
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -11,7 +11,7 @@ This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. P
|
||||
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
|
||||
|
||||
There are two kinds for graph mode supported by vLLM Ascend:
|
||||
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested.
|
||||
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, Qwen and Deepseek series models are well tested.
|
||||
- **TorchAirGraph**: This is the GE graph mode. In v0.9.1rc1, only DeepSeek series models are supported.
|
||||
|
||||
## Using ACLGraph
|
||||
@@ -24,7 +24,7 @@ import os
|
||||
|
||||
from vllm import LLM
|
||||
|
||||
model = LLM(model="Qwen/Qwen2-7B-Instruct")
|
||||
model = LLM(model="path/to/Qwen2-7B-Instruct")
|
||||
outputs = model.generate("Hello, how are you?")
|
||||
```
|
||||
|
||||
@@ -44,15 +44,15 @@ Offline example:
|
||||
import os
|
||||
from vllm import LLM
|
||||
|
||||
# TorchAirGraph is only work without chunked-prefill now
|
||||
model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True}})
|
||||
# TorchAirGraph only works without chunked-prefill now
|
||||
model = LLM(model="path/to/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True}})
|
||||
outputs = model.generate("Hello, how are you?")
|
||||
```
|
||||
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
vllm serve deepseek-ai/DeepSeek-R1-0528 --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true}}'
|
||||
vllm serve path/to/DeepSeek-R1-0528 --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true}}'
|
||||
```
|
||||
|
||||
You can find more details about additional configuration [here](../configuration/additional_config.md).
|
||||
|
||||
@@ -136,7 +136,7 @@ This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the
|
||||
* Added `lmhead_tensor_parallel_size` in `additional_config`, set it to enable lmhead tensor parallel. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
|
||||
* Some unused environment variables `HCCN_PATH`, `PROMPT_DEVICE_ID`, `DECODE_DEVICE_ID`, `LLMDATADIST_COMM_PORT` and `LLMDATADIST_SYNC_CACHE_WAIT_TIME` are removed. [#2448](https://github.com/vllm-project/vllm-ascend/pull/2448)
|
||||
* Environment variable `VLLM_LLMDD_RPC_PORT` is renamed to `VLLM_ASCEND_LLMDD_RPC_PORT` now. [#2450](https://github.com/vllm-project/vllm-ascend/pull/2450)
|
||||
* Added `VLLM_ASCEND_ENABLE_MLP_OPTIMIZE` in environment variables, whether to enable mlp optimize when tensor parallel is enabled. This feature provides better performance in eager mode. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
|
||||
* Added `VLLM_ASCEND_ENABLE_MLP_OPTIMIZE` in environment variables, Whether to enable mlp optimize when tensor parallel is enabled. This feature provides better performance in eager mode. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
|
||||
* Removed `MOE_ALL2ALL_BUFFER` and `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environment variables. [#2612](https://github.com/vllm-project/vllm-ascend/pull/2612)
|
||||
* Added `enable_prefetch` in `additional_config`, Whether to enable weight prefetch. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
|
||||
* Added `mode` in `additional_config.torchair_graph_config`, When using reduce-overhead mode for torchair, mode needs to be set. [#2461](https://github.com/vllm-project/vllm-ascend/pull/2461)
|
||||
|
||||
@@ -71,7 +71,7 @@ Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/160
|
||||
| LLaVA-Next-Video | ✅ | |||||||||||||||||||
|
||||
| MiniCPM-V | ✅ | |||||||||||||||||||
|
||||
| Mistral3 | ✅ | |||||||||||||||||||
|
||||
| Phi-3-Vison/Phi-3.5-Vison | ✅ | |||||||||||||||||||
|
||||
| Phi-3-Vision/Phi-3.5-Vision | ✅ | |||||||||||||||||||
|
||||
| Gemma3 | ✅ | |||||||||||||||||||
|
||||
| Llama4 | ❌ | [1972](https://github.com/vllm-project/vllm-ascend/issues/1972) |||||||||||||||||||
|
||||
| Llama3.2 | ❌ | [1972](https://github.com/vllm-project/vllm-ascend/issues/1972) |||||||||||||||||||
|
||||
|
||||
Reference in New Issue
Block a user