[Info][main] Corrected the errors in the information (#4055)
### What this PR does / why we need it?
Corrected the errors in the information
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Additional Configuration
|
||||
|
||||
Additional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by their own. vLLM Ascend uses this mechanism to make the project more flexible.
|
||||
Additional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by themselves. VLLM Ascend uses this mechanism to make the project more flexible.
|
||||
|
||||
## How to use
|
||||
|
||||
@@ -35,7 +35,7 @@ The following table lists additional configuration options available in vLLM Asc
|
||||
| `enable_shared_expert_dp` | bool | `False` | When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. |
|
||||
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. |
|
||||
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
|
||||
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effects on MoE models with shared experts. |
|
||||
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effect on MoE models with shared experts. |
|
||||
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic EPLB. |
|
||||
| `num_iterations_eplb_update` | int | `400` | Forward iterations when EPLB begins. |
|
||||
| `gate_eplb` | bool | `False` | Whether to enable EPLB only once. |
|
||||
@@ -70,14 +70,14 @@ The details of each configuration option are as follows:
|
||||
| `max_long_partial_prefills` | Union[int, float] | `float('inf')` | The maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. |
|
||||
| `long_prefill_token_threshold` | Union[int, float] | `float('inf')` | a request is considered long if the prompt is longer than this number of tokens. |
|
||||
|
||||
ascend_scheduler_config also support the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well.
|
||||
ascend_scheduler_config also supports the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well.
|
||||
|
||||
**weight_prefetch_config**
|
||||
|
||||
| Name | Type | Default | Description |
|
||||
|------------------|------|-------------------------------------------------------------|------------------------------------|
|
||||
| `enabled` | bool | `False` | Whether to enable weight prefetch. |
|
||||
| `prefetch_ratio` | dict | `{"attn": {"qkv": 1.0, "o": 1.0}, "moe": {"gate_up": 0.8}}` | Prefetch ratio of each weights. |
|
||||
| `prefetch_ratio` | dict | `{"attn": {"qkv": 1.0, "o": 1.0}, "moe": {"gate_up": 0.8}}` | Prefetch ratio of each weight. |
|
||||
|
||||
### Example
|
||||
|
||||
|
||||
@@ -11,9 +11,9 @@ We are working on further improvements and this feature will support more XPUs i
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. Dynamic batch now depends on a offline cost model saved in a look-up table to refine the token budget. The lookup-table is saved in '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
|
||||
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
|
||||
|
||||
2. `Pandas` is needed to load the look-up table, in case `pandas` is not installed.
|
||||
2. `Pandas` is needed to load the lookup table, in case `pandas` is not installed.
|
||||
|
||||
```bash
|
||||
pip install pandas
|
||||
|
||||
@@ -8,7 +8,7 @@ This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. P
|
||||
|
||||
## Getting Started
|
||||
|
||||
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by set `enforce_eager=True` when initializing the model.
|
||||
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
|
||||
|
||||
There are two kinds for graph mode supported by vLLM Ascend:
|
||||
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested.
|
||||
@@ -45,14 +45,14 @@ import os
|
||||
from vllm import LLM
|
||||
|
||||
# TorchAirGraph is only work without chunked-prefill now
|
||||
model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True,}})
|
||||
model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True}})
|
||||
outputs = model.generate("Hello, how are you?")
|
||||
```
|
||||
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}'
|
||||
vllm serve deepseek-ai/DeepSeek-R1-0528 --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true}}'
|
||||
```
|
||||
|
||||
You can find more details about additional configuration [here](../configuration/additional_config.md).
|
||||
@@ -74,5 +74,5 @@ outputs = model.generate("Hello, how are you?")
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager
|
||||
vllm serve someother_model_weight --enforce-eager
|
||||
```
|
||||
|
||||
@@ -20,4 +20,4 @@ vllm serve meta-llama/Llama-2-7b \
|
||||
|
||||
We have implemented LoRA-related AscendC operators, such as bgmv_shrink, bgmv_expand, sgmv_shrink and sgmv_expand. You can find them under the "csrc/kernels" directory of [vllm-ascend repo](https://github.com/vllm-project/vllm-ascend.git).
|
||||
|
||||
When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you do not want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. To require more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).
|
||||
When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you do not want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. For more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).
|
||||
|
||||
@@ -28,7 +28,7 @@ See https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8.
|
||||
This conversion process requires a larger CPU memory, ensure that the RAM size is greater than 2 TB.
|
||||
:::
|
||||
|
||||
### Adapt to changes
|
||||
### Adapts and changes
|
||||
1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
|
||||
2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder.
|
||||
|
||||
|
||||
@@ -80,7 +80,7 @@ The following is a simple example of how to use sleep mode.
|
||||
|
||||
vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode
|
||||
|
||||
# after serveing is up, post these endpoints
|
||||
# after serving is up, post to these endpoints
|
||||
|
||||
# sleep level 1
|
||||
curl -X POST http://127.0.0.1:8000/sleep \
|
||||
|
||||
@@ -39,7 +39,7 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
|
||||
- MTP now works with the token > 1. [#2708](https://github.com/vllm-project/vllm-ascend/pull/2708)
|
||||
- Qwen2.5 VL now works with quantization. [#2778](https://github.com/vllm-project/vllm-ascend/pull/2778)
|
||||
- Improved the performance with async scheduler enabled. [#2783](https://github.com/vllm-project/vllm-ascend/pull/2783)
|
||||
- Fixed the performance regression with non MLA model when use default scheduler. [#2894](https://github.com/vllm-project/vllm-ascend/pull/2894)
|
||||
- Fixed the performance regression with non MLA model when using default scheduler. [#2894](https://github.com/vllm-project/vllm-ascend/pull/2894)
|
||||
|
||||
### Others
|
||||
- The performance of W8A8 quantization is improved. [#2275](https://github.com/vllm-project/vllm-ascend/pull/2275)
|
||||
@@ -106,7 +106,7 @@ This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the
|
||||
* Environment variable `VLLM_LLMDD_RPC_PORT` is renamed to `VLLM_ASCEND_LLMDD_RPC_PORT` now. [#2450](https://github.com/vllm-project/vllm-ascend/pull/2450)
|
||||
* Added `VLLM_ASCEND_ENABLE_MLP_OPTIMIZE` in environment variables, whether to enable mlp optimize when tensor parallel is enabled. This feature provides better performance in eager mode. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
|
||||
* Removed `MOE_ALL2ALL_BUFFER` and `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environment variables. [#2612](https://github.com/vllm-project/vllm-ascend/pull/2612)
|
||||
* Added `enable_prefetch` in `additional_config`, whether to enable weight prefetch. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
|
||||
* Added `enable_prefetch` in `additional_config`, Whether to enable weight prefetch. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
|
||||
* Added `mode` in `additional_config.torchair_graph_config`, When using reduce-overhead mode for torchair, mode needs to be set. [#2461](https://github.com/vllm-project/vllm-ascend/pull/2461)
|
||||
* `enable_shared_expert_dp` in `additional_config` is disabled by default now, and it is recommended to be enabled when inferencing with deepseek. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
|
||||
|
||||
@@ -461,7 +461,7 @@ This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [
|
||||
### Highlights
|
||||
|
||||
- DeepSeek works with graph mode now. Follow the [official doc](https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/graph_mode.html) to take a try. [#789](https://github.com/vllm-project/vllm-ascend/pull/789)
|
||||
- Qwen series models works with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
|
||||
- Qwen series models work with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
|
||||
|
||||
### Core
|
||||
|
||||
@@ -590,13 +590,13 @@ This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the
|
||||
|
||||
- vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcibly.
|
||||
- LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the [official doc](https://docs.vllm.ai/en/latest/features/lora.html) for more usage information. Thanks for the contribution from China Merchants Bank. [#521](https://github.com/vllm-project/vllm-ascend/pull/521).
|
||||
- Sleep Mode feature is supported. Currently it's only work on V0 engine. V1 engine support will come soon. [#513](https://github.com/vllm-project/vllm-ascend/pull/513)
|
||||
- Sleep Mode feature is supported. Currently it only works on V0 engine. V1 engine support will come soon. [#513](https://github.com/vllm-project/vllm-ascend/pull/513)
|
||||
|
||||
### Core
|
||||
|
||||
- The Ascend scheduler is added for V1 engine. This scheduler is more affinity with Ascend hardware. More scheduler policy will be added in the future. [#543](https://github.com/vllm-project/vllm-ascend/pull/543)
|
||||
- Disaggregated Prefill feature is supported. Currently only 1P1D works. NPND is under design by vllm team. vllm-ascend will support it once it's ready from vLLM. Follow the [official guide](https://docs.vllm.ai/en/latest/features/disagg_prefill.html) to use. [#432](https://github.com/vllm-project/vllm-ascend/pull/432)
|
||||
- Spec decode feature works now. Currently it's only work on V0 engine. V1 engine support will come soon. [#500](https://github.com/vllm-project/vllm-ascend/pull/500)
|
||||
- Spec decode feature works now. Currently it only works on V0 engine. V1 engine support will come soon. [#500](https://github.com/vllm-project/vllm-ascend/pull/500)
|
||||
- Structured output feature works now on V1 Engine. Currently it only supports xgrammar backend while using guidance backend may get some errors. [#555](https://github.com/vllm-project/vllm-ascend/pull/555)
|
||||
|
||||
### Others
|
||||
|
||||
@@ -10,7 +10,7 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th
|
||||
| Automatic Prefix Caching | 🟢 Functional | Functional, see detailed note: [vllm-ascend#732][apc] |
|
||||
| LoRA | 🟢 Functional | [vllm-ascend#396][multilora], [vllm-ascend#893][v1 multilora] |
|
||||
| Speculative decoding | 🟢 Functional | Basic support |
|
||||
| Pooling | 🟢 Functional | CI needed to adapt to more models; V1 support rely on vLLM support. |
|
||||
| Pooling | 🟢 Functional | CI needed to adapt to more models; V1 support relies on vLLM support. |
|
||||
| Enc-dec | 🟡 Planned | vLLM should support this feature first. |
|
||||
| Multi Modality | 🟢 Functional | [Tutorial][multimodal], optimizing and adapting more models |
|
||||
| LogProbs | 🟢 Functional | CI needed |
|
||||
|
||||
Reference in New Issue
Block a user