Drop torchair (#4814)

aclgraph is stable and fast now. Let's drop torchair graph mode now.

TODO: some logic to adapt torchair should be cleaned up as well. We'll
do it in the following PR.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
This commit is contained in:
wangxiyuan
2025-12-10 09:20:40 +08:00
committed by GitHub
parent ba9cda9dfd
commit 835b4c8f1d
84 changed files with 77 additions and 16881 deletions

View File

@@ -251,7 +251,6 @@ This will reproduce the E2E test. See [vllm_ascend_test.yaml](https://github.com
- Offline test example: [`tests/e2e/singlecard/test_offline_inference.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_offline_inference.py)
- Online test examples: [`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)
- Correctness test example: [`tests/e2e/singlecard/test_aclgraph.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_aclgraph.py)
- Reduced Layer model test example: [test_torchair_graph_mode.py - DeepSeek-V3-Pruning](https://github.com/vllm-project/vllm-ascend/blob/20767a043cccb3764214930d4695e53941de87ec/tests/e2e/multicard/test_torchair_graph_mode.py#L48)
The CI resource is limited, and you might need to reduce the number of layers of a model. Below is an example of how to generate a reduced layer model:
1. Fork the original model repo in modelscope. All the files in the repo except for weights are required.

View File

@@ -6,7 +6,7 @@ MTP boosts inference performance by parallelizing the prediction of multiple tok
## How to Use MTP
To enable MTP for DeepSeek-V3 models, add the following parameter when starting the service:
--speculative_config ' {"method": "deepseek_mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} '
--speculative_config ' {"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} '
- `num_speculative_tokens`: The number of speculative tokens which enable model to predict multiple tokens at once, if provided. It will default to the number in the draft model config if present, otherwise, it is required.
- `disable_padded_drafter_batch`: Disable input padding for speculative decoding. If set to True, speculative input batches can contain sequences of different lengths, which may only be supported by certain attention backends. This currently only affects the MTP method of speculation, default is False.
@@ -74,21 +74,18 @@ If the bonus token is accepted, the MTP model performs inference for (num_specul
### Method Validation
- Currently, the spec_decode scenario only supports methods such as ngram, eagle, eagle3, and deepseek_mtp. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
- Currently, the spec_decode scenario only supports methods such as ngram, eagle, eagle3, and mtp. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
```
def get_spec_decode_method(method,
vllm_config,
device,
runner,
is_torchair_graph=False):
runner):
if method == "ngram":
return NgramProposer(vllm_config, device, runner)
elif method in ["eagle", "eagle3"]:
return EagleProposer(vllm_config, device, runner)
elif method == 'deepseek_mtp':
if is_torchair_graph:
return TorchairMtpProposer(vllm_config, device, runner)
elif method == 'mtp':
return MtpProposer(vllm_config, device, runner)
else:
raise ValueError("Unknown speculative decoding method: "

View File

@@ -128,9 +128,8 @@ vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--speculative-config '{"num_speculative_tokens": 1, "method": "deepseek_mtp"}' \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"torchair_graph_config":{"enabled":false}}'
```
### Multi-node Deployment
@@ -190,9 +189,8 @@ vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.94 \
--speculative-config '{"num_speculative_tokens": 1, "method": "deepseek_mtp"}' \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"torchair_graph_config":{"enabled":false}}'
```
**Node 1**
@@ -247,9 +245,8 @@ vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.94 \
--speculative-config '{"num_speculative_tokens": 1, "method": "deepseek_mtp"}' \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"torchair_graph_config":{"enabled":false}}'
```
### Prefill-Decode Disaggregation
@@ -421,7 +418,7 @@ vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method": "deepseek_mtp"}' \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--additional-config '{"recompute_scheduler_enable":true,"enable_shared_expert_dp": true}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",

View File

@@ -173,8 +173,7 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
--gpu-memory-utilization 0.92
```
### Multi-node Deployment
@@ -225,8 +224,7 @@ vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp \
--max-num-batched-tokens 17450 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
--gpu-memory-utilization 0.9
```
**Node 1**
@@ -269,8 +267,7 @@ vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
--gpu-memory-utilization 0.92
```
::::
@@ -316,8 +313,7 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
--trust-remote-code \
--quantization ascend \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
--gpu-memory-utilization 0.9
```
**Node 1**
@@ -362,8 +358,7 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
--trust-remote-code \
--quantization ascend \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
--gpu-memory-utilization 0.92
```
::::

View File

@@ -136,8 +136,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"torchair_graph_config":{"enabled":true}}'
--gpu-memory-utilization 0.9
```
**Node 1**
@@ -181,8 +180,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--additional-config '{"torchair_graph_config":{"enabled":true}}'
--gpu-memory-utilization 0.92
```
The deployment view looks like:

View File

@@ -92,8 +92,7 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"torchair_graph_config":{"enabled":true}}'
--gpu-memory-utilization 0.9
```
**Node 1**
@@ -136,8 +135,7 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--additional-config '{"torchair_graph_config":{"enabled":true}}'
--gpu-memory-utilization 0.92
```
The deployment view looks like:

View File

@@ -153,12 +153,7 @@ if __name__ == "__main__":
enable_expert_parallel=True,
distributed_executor_backend="mp",
max_model_len=1024,
trust_remote_code=True,
additional_config={
'torchair_graph_config': {
'enabled': True,
}
})
trust_remote_code=True)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:

View File

@@ -27,7 +27,6 @@ The following table lists additional configuration options available in vLLM Asc
| Name | Type | Default | Description |
|-------------------------------------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| `xlite_graph_config` | dict | `{}` | Configuration options for xlite graph mode |
| `torchair_graph_config` | dict | `{}` | Configuration options for torchair graph mode |
| `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch |
| `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
| `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
@@ -52,21 +51,6 @@ The details of each configuration option are as follows:
| `enabled` | bool | `False` | Whether to enable xlite graph mode. Currently only Llama or Qwen dense series models are supported. |
| `full_mode` | bool | `False` | Whether to enable xlite for both the prefill and decode stages. By default, xlite is only enabled for the decode stage. |
**torchair_graph_config**
| Name | Type | Default | Description |
| ---- | ---- | ------- | ----------- |
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported. |
| `mode` | str | `None` | When using reduce-overhead mode for torchair, it needs to be set. |
| `enable_multistream_mla`| bool | `False` | Whether to put vector operators of MLA to another stream. This option only takes effect on models using MLA (for example, DeepSeek). |
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization. |
| `enable_frozen_parameter` | bool | `True` | Whether to fix the memory address of weights during inference to reduce the input address refresh time during graph execution. |
| `use_cached_graph` | bool | `False` | Whether to use cached graph. |
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache. |
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty. |
| `enable_kv_nz`| bool | `False` | Whether to enable KV Cache NZ layout. This option only takes effect on models using MLA (for example, DeepSeek). |
| `enable_super_kernel` | bool | `False` | Whether to enable super kernel to fuse operators in deepseek moe layers. This option only takes effects on moe models using dynamic w8a8 quantization.|
**weight_prefetch_config**
| Name | Type | Default | Description |
@@ -80,13 +64,6 @@ An example of additional configuration is as follows:
```
{
"torchair_graph_config": {
"enabled": True,
"use_cached_graph": True,
"graph_batch_sizes": [1, 2, 4, 8],
"graph_batch_sizes_init": False,
"enable_kv_nz": False
},
"weight_prefetch_config": {
"enabled": True,
"prefetch_ratio": {

View File

@@ -10,9 +10,8 @@ This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. P
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
There are three kinds for graph mode supported by vLLM Ascend:
There are two kinds for graph mode supported by vLLM Ascend:
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, Qwen and Deepseek series models are well tested.
- **TorchAirGraph**: This is the GE graph mode. In v0.9.1rc1, only DeepSeek series models are supported.
- **XliteGraph**: This is the euler xlite graph mode. In v0.11.0, only Llama and Qwen dense serise models are supported.
## Using ACLGraph
@@ -35,29 +34,6 @@ Online example:
vllm serve Qwen/Qwen2-7B-Instruct
```
## Using TorchAirGraph
If you want to run DeepSeek series models with the graph mode, you should use [TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html). In this case, additional configuration is required.
Offline example:
```python
import os
from vllm import LLM
# TorchAirGraph only works without chunked-prefill now
model = LLM(model="path/to/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True}})
outputs = model.generate("Hello, how are you?")
```
Online example:
```shell
vllm serve path/to/DeepSeek-R1-0528 --additional-config='{"torchair_graph_config": {"enabled": true}}'
```
You can find more details about additional configuration [here](../configuration/additional_config.md).
## Using XliteGraph
If you want to run Llama or Qwen dense series models with xlite graph mode, please install xlite, and set xlite_graph_config.
@@ -87,7 +63,7 @@ You can find more details abort xlite [here](https://gitee.com/openeuler/GVirt/b
## Fallback to the Eager Mode
If `ACLGraph`, `TorchAirGraph` and `XliteGraph` all fail to run, you should fallback to the eager mode.
If `ACLGraph` and `XliteGraph` all fail to run, you should fallback to the eager mode.
Offline example:

View File

@@ -104,22 +104,3 @@ First, make sure you specify `ascend` as the quantization method. Second, check
### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim, where the missing configuration_deepseek.py error has been fixed.
### 3. What should be considered when converting DeepSeek series models with ModelSlim?
When the MLA portion of the weights used the `W8A8_DYNAMIC` quantization with the torchair graph mode enabled, modify the configuration file in the CANN package to prevent incorrect inference results.
The operation steps are as follows:
1. Search in the CANN package directory, for example:
find /usr/local/Ascend/ -name fusion_config.json
2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` and `"MultiAddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
```bash
{
"Switch":{
"GraphFusion":{
"AddRmsNormDynamicQuantFusionPass":"off",
"MultiAddRmsNormDynamicQuantFusionPass":"off",
```