Drop torchair (#4814)
aclgraph is stable and fast now. Let's drop torchair graph mode now.
TODO: some logic to adapt torchair should be cleaned up as well. We'll
do it in the following PR.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
This commit is contained in:
@@ -27,7 +27,6 @@ The following table lists additional configuration options available in vLLM Asc
|
||||
| Name | Type | Default | Description |
|
||||
|-------------------------------------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `xlite_graph_config` | dict | `{}` | Configuration options for xlite graph mode |
|
||||
| `torchair_graph_config` | dict | `{}` | Configuration options for torchair graph mode |
|
||||
| `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch |
|
||||
| `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
|
||||
| `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
|
||||
@@ -52,21 +51,6 @@ The details of each configuration option are as follows:
|
||||
| `enabled` | bool | `False` | Whether to enable xlite graph mode. Currently only Llama or Qwen dense series models are supported. |
|
||||
| `full_mode` | bool | `False` | Whether to enable xlite for both the prefill and decode stages. By default, xlite is only enabled for the decode stage. |
|
||||
|
||||
**torchair_graph_config**
|
||||
|
||||
| Name | Type | Default | Description |
|
||||
| ---- | ---- | ------- | ----------- |
|
||||
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported. |
|
||||
| `mode` | str | `None` | When using reduce-overhead mode for torchair, it needs to be set. |
|
||||
| `enable_multistream_mla`| bool | `False` | Whether to put vector operators of MLA to another stream. This option only takes effect on models using MLA (for example, DeepSeek). |
|
||||
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization. |
|
||||
| `enable_frozen_parameter` | bool | `True` | Whether to fix the memory address of weights during inference to reduce the input address refresh time during graph execution. |
|
||||
| `use_cached_graph` | bool | `False` | Whether to use cached graph. |
|
||||
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache. |
|
||||
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty. |
|
||||
| `enable_kv_nz`| bool | `False` | Whether to enable KV Cache NZ layout. This option only takes effect on models using MLA (for example, DeepSeek). |
|
||||
| `enable_super_kernel` | bool | `False` | Whether to enable super kernel to fuse operators in deepseek moe layers. This option only takes effects on moe models using dynamic w8a8 quantization.|
|
||||
|
||||
**weight_prefetch_config**
|
||||
|
||||
| Name | Type | Default | Description |
|
||||
@@ -80,13 +64,6 @@ An example of additional configuration is as follows:
|
||||
|
||||
```
|
||||
{
|
||||
"torchair_graph_config": {
|
||||
"enabled": True,
|
||||
"use_cached_graph": True,
|
||||
"graph_batch_sizes": [1, 2, 4, 8],
|
||||
"graph_batch_sizes_init": False,
|
||||
"enable_kv_nz": False
|
||||
},
|
||||
"weight_prefetch_config": {
|
||||
"enabled": True,
|
||||
"prefetch_ratio": {
|
||||
|
||||
@@ -10,9 +10,8 @@ This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. P
|
||||
|
||||
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
|
||||
|
||||
There are three kinds for graph mode supported by vLLM Ascend:
|
||||
There are two kinds for graph mode supported by vLLM Ascend:
|
||||
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, Qwen and Deepseek series models are well tested.
|
||||
- **TorchAirGraph**: This is the GE graph mode. In v0.9.1rc1, only DeepSeek series models are supported.
|
||||
- **XliteGraph**: This is the euler xlite graph mode. In v0.11.0, only Llama and Qwen dense serise models are supported.
|
||||
|
||||
## Using ACLGraph
|
||||
@@ -35,29 +34,6 @@ Online example:
|
||||
vllm serve Qwen/Qwen2-7B-Instruct
|
||||
```
|
||||
|
||||
## Using TorchAirGraph
|
||||
|
||||
If you want to run DeepSeek series models with the graph mode, you should use [TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html). In this case, additional configuration is required.
|
||||
|
||||
Offline example:
|
||||
|
||||
```python
|
||||
import os
|
||||
from vllm import LLM
|
||||
|
||||
# TorchAirGraph only works without chunked-prefill now
|
||||
model = LLM(model="path/to/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True}})
|
||||
outputs = model.generate("Hello, how are you?")
|
||||
```
|
||||
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
vllm serve path/to/DeepSeek-R1-0528 --additional-config='{"torchair_graph_config": {"enabled": true}}'
|
||||
```
|
||||
|
||||
You can find more details about additional configuration [here](../configuration/additional_config.md).
|
||||
|
||||
## Using XliteGraph
|
||||
|
||||
If you want to run Llama or Qwen dense series models with xlite graph mode, please install xlite, and set xlite_graph_config.
|
||||
@@ -87,7 +63,7 @@ You can find more details abort xlite [here](https://gitee.com/openeuler/GVirt/b
|
||||
|
||||
## Fallback to the Eager Mode
|
||||
|
||||
If `ACLGraph`, `TorchAirGraph` and `XliteGraph` all fail to run, you should fallback to the eager mode.
|
||||
If `ACLGraph` and `XliteGraph` all fail to run, you should fallback to the eager mode.
|
||||
|
||||
Offline example:
|
||||
|
||||
|
||||
@@ -104,22 +104,3 @@ First, make sure you specify `ascend` as the quantization method. Second, check
|
||||
### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
|
||||
|
||||
Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim, where the missing configuration_deepseek.py error has been fixed.
|
||||
|
||||
### 3. What should be considered when converting DeepSeek series models with ModelSlim?
|
||||
|
||||
When the MLA portion of the weights used the `W8A8_DYNAMIC` quantization with the torchair graph mode enabled, modify the configuration file in the CANN package to prevent incorrect inference results.
|
||||
|
||||
The operation steps are as follows:
|
||||
|
||||
1. Search in the CANN package directory, for example:
|
||||
find /usr/local/Ascend/ -name fusion_config.json
|
||||
|
||||
2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` and `"MultiAddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
|
||||
|
||||
```bash
|
||||
{
|
||||
"Switch":{
|
||||
"GraphFusion":{
|
||||
"AddRmsNormDynamicQuantFusionPass":"off",
|
||||
"MultiAddRmsNormDynamicQuantFusionPass":"off",
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user