[Info][main] Corrected the errors in the information (#4055)
### What this PR does / why we need it?
Corrected the errors in the information
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
This commit is contained in:
@@ -11,9 +11,9 @@ We are working on further improvements and this feature will support more XPUs i
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. Dynamic batch now depends on a offline cost model saved in a look-up table to refine the token budget. The lookup-table is saved in '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
|
||||
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
|
||||
|
||||
2. `Pandas` is needed to load the look-up table, in case `pandas` is not installed.
|
||||
2. `Pandas` is needed to load the lookup table, in case `pandas` is not installed.
|
||||
|
||||
```bash
|
||||
pip install pandas
|
||||
|
||||
@@ -8,7 +8,7 @@ This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. P
|
||||
|
||||
## Getting Started
|
||||
|
||||
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by set `enforce_eager=True` when initializing the model.
|
||||
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
|
||||
|
||||
There are two kinds for graph mode supported by vLLM Ascend:
|
||||
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested.
|
||||
@@ -45,14 +45,14 @@ import os
|
||||
from vllm import LLM
|
||||
|
||||
# TorchAirGraph is only work without chunked-prefill now
|
||||
model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True,}})
|
||||
model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True}})
|
||||
outputs = model.generate("Hello, how are you?")
|
||||
```
|
||||
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}'
|
||||
vllm serve deepseek-ai/DeepSeek-R1-0528 --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true}}'
|
||||
```
|
||||
|
||||
You can find more details about additional configuration [here](../configuration/additional_config.md).
|
||||
@@ -74,5 +74,5 @@ outputs = model.generate("Hello, how are you?")
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager
|
||||
vllm serve someother_model_weight --enforce-eager
|
||||
```
|
||||
|
||||
@@ -20,4 +20,4 @@ vllm serve meta-llama/Llama-2-7b \
|
||||
|
||||
We have implemented LoRA-related AscendC operators, such as bgmv_shrink, bgmv_expand, sgmv_shrink and sgmv_expand. You can find them under the "csrc/kernels" directory of [vllm-ascend repo](https://github.com/vllm-project/vllm-ascend.git).
|
||||
|
||||
When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you do not want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. To require more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).
|
||||
When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you do not want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. For more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).
|
||||
|
||||
@@ -28,7 +28,7 @@ See https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8.
|
||||
This conversion process requires a larger CPU memory, ensure that the RAM size is greater than 2 TB.
|
||||
:::
|
||||
|
||||
### Adapt to changes
|
||||
### Adapts and changes
|
||||
1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
|
||||
2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder.
|
||||
|
||||
|
||||
@@ -80,7 +80,7 @@ The following is a simple example of how to use sleep mode.
|
||||
|
||||
vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode
|
||||
|
||||
# after serveing is up, post these endpoints
|
||||
# after serving is up, post to these endpoints
|
||||
|
||||
# sleep level 1
|
||||
curl -X POST http://127.0.0.1:8000/sleep \
|
||||
|
||||
Reference in New Issue
Block a user