[Info][main] Corrected the errors in the information (#4055)

### What this PR does / why we need it?
Corrected the errors in the information

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
This commit is contained in:
lilinsiman
2025-11-08 18:48:59 +08:00
committed by GitHub
parent 1d7cb5880a
commit a3ff765c65
20 changed files with 35 additions and 35 deletions

View File

@@ -11,9 +11,9 @@ We are working on further improvements and this feature will support more XPUs i
### Prerequisites
1. Dynamic batch now depends on a offline cost model saved in a look-up table to refine the token budget. The lookup-table is saved in '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
2. `Pandas` is needed to load the look-up table, in case `pandas` is not installed.
2. `Pandas` is needed to load the lookup table, in case `pandas` is not installed.
```bash
pip install pandas

View File

@@ -8,7 +8,7 @@ This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. P
## Getting Started
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by set `enforce_eager=True` when initializing the model.
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
There are two kinds for graph mode supported by vLLM Ascend:
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested.
@@ -45,14 +45,14 @@ import os
from vllm import LLM
# TorchAirGraph is only work without chunked-prefill now
model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True,}})
model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True}})
outputs = model.generate("Hello, how are you?")
```
Online example:
```shell
vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}'
vllm serve deepseek-ai/DeepSeek-R1-0528 --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true}}'
```
You can find more details about additional configuration [here](../configuration/additional_config.md).
@@ -74,5 +74,5 @@ outputs = model.generate("Hello, how are you?")
Online example:
```shell
vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager
vllm serve someother_model_weight --enforce-eager
```

View File

@@ -20,4 +20,4 @@ vllm serve meta-llama/Llama-2-7b \
We have implemented LoRA-related AscendC operators, such as bgmv_shrink, bgmv_expand, sgmv_shrink and sgmv_expand. You can find them under the "csrc/kernels" directory of [vllm-ascend repo](https://github.com/vllm-project/vllm-ascend.git).
When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you do not want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. To require more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).
When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you do not want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. For more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).

View File

@@ -28,7 +28,7 @@ See https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8.
This conversion process requires a larger CPU memory, ensure that the RAM size is greater than 2 TB.
:::
### Adapt to changes
### Adapts and changes
1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder.

View File

@@ -80,7 +80,7 @@ The following is a simple example of how to use sleep mode.
vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode
# after serveing is up, post these endpoints
# after serving is up, post to these endpoints
# sleep level 1
curl -X POST http://127.0.0.1:8000/sleep \