[Feat]Xlite Qwen3-vl Support (#5228)

### What this PR does / why we need it?
This patch adds support for the Qwen3-VL model in Xlite. For more
details about Xlite, please refer to the following
link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:

### Does this PR introduce _any_ user-facing change?
XLite graph mode supports the Qwen3-VL model.

### How was this patch tested?
vLLM version: v0.12.0 

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: lvjunqi <lvjunqi1@huawei.com>
Co-authored-by: lvjunqi <lvjunqi1@huawei.com>
This commit is contained in:
lvjunqi
2025-12-22 16:30:52 +08:00
committed by GitHub
parent 78aa7f2693
commit 55beac9c91
4 changed files with 19 additions and 9 deletions

View File

@@ -49,7 +49,7 @@ The details of each configuration option are as follows:
**xlite_graph_config**
| Name | Type | Default | Description |
| ---- | ---- | ------- | ----------- |
| `enabled` | bool | `False` | Whether to enable xlite graph mode. Currently only Llama or Qwen dense series models are supported. |
| `enabled` | bool | `False` | Whether to enable xlite graph mode. Currently only Llama, Qwen dense series models, and Qwen3-vl are supported. |
| `full_mode` | bool | `False` | Whether to enable xlite for both the prefill and decode stages. By default, xlite is only enabled for the decode stage. |
**weight_prefetch_config**

View File

@@ -12,7 +12,7 @@ From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by defa
There are two kinds for graph mode supported by vLLM Ascend:
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, Qwen and Deepseek series models are well tested.
- **XliteGraph**: This is the euler xlite graph mode. In v0.11.0, only Llama and Qwen dense serise models are supported.
- **XliteGraph**: This is the openeuler xlite graph mode. In v0.11.0, only Llama, Qwen dense series models, and Qwen3-vl are supported.
## Using ACLGraph
ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
@@ -36,7 +36,7 @@ vllm serve Qwen/Qwen2-7B-Instruct
## Using XliteGraph
If you want to run Llama or Qwen dense series models with xlite graph mode, please install xlite, and set xlite_graph_config.
If you want to run Llama, Qwen dense series models, or Qwen3-vl with xlite graph mode, please install xlite, and set xlite_graph_config.
```bash
pip install xlite
@@ -59,7 +59,7 @@ Online example:
vllm serve path/to/Qwen3-32B --tensor-parallel-size 8 --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}'
```
You can find more details abort xlite [here](https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md)
You can find more details abort xlite [here](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)
## Fallback to the Eager Mode