Commit Graph

7 Commits

Author SHA1 Message Date
Wang Xiaoran
3ce5a34468 [BugFix] Xlite: Bypass the padding of the graph mode in non-MTP cases to obtain the correct decode num. (#5711)
### What this PR does / why we need it?
This PR fixes a bug in Xlite
backend(https://atomgit.com/openeuler/GVirt/issues/1), The direct cause
of the problem is that the XModel::PrepareAttn function obtained an
illegal number of tokens to be inferred, -540. This illegal value is due
to the padding feature of inference in graph mode and the residual state
across steps. This issue is triggered when a prefill request is newly
added in a step and a decode ends simultaneously. It is first fixed
using num_decode_tokens instead of attn_metadata.num_decodes.
1. In graph mode, vllm_ascend has padding characteristics. In the
_prepare_inputs function, if the number of tokens to be inferred is less
than the set threshold (8 in this case), the attn_metadata.num_decode
array will be expanded to 8.
2. Meanwhile, vllm_ascend uses the class variable self.query_start_loc
of NPUModelRunner to record the tokens to be inferred. Due to poor
coordination with the graph mode padding mechanism when crossing steps,
in some cases (such as when a decode request is completed in a certain
step and a new prefill request is added at the same time), negative
values may be calculated for attn_metadata.query_lens.
3. After type conversion, the negative values in query_lens cause an
overflow. Xlite detects that the number of tokens to be inferred for the
decode request is too large and triggers a "decode len too long" alert.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Same with https://atomgit.com/openeuler/GVirt/issues/1
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wwwumr <1127858301@qq.com>
2026-01-09 15:55:30 +08:00
Shanshan Shen
b94d589769 [MM][Bugfix] Update hf_config to hf_text_config (#5319)
### What this PR does / why we need it?

Following https://github.com/vllm-project/vllm-ascend/pull/5205, update
`hf_config` to `hf_text_config`.

Find more details at
https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534
and
https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-06 16:41:39 +08:00
Magnus
a9fccbeb30 [CI] add xlite e2e test (#5305)
### What this PR does / why we need it?
add xlite e2e test

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

Signed-off-by: DaweiChang <405739598@qq.com>
2025-12-25 09:17:06 +08:00
lvjunqi
55beac9c91 [Feat]Xlite Qwen3-vl Support (#5228)
### What this PR does / why we need it?
This patch adds support for the Qwen3-VL model in Xlite. For more
details about Xlite, please refer to the following
link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:

### Does this PR introduce _any_ user-facing change?
XLite graph mode supports the Qwen3-VL model.

### How was this patch tested?
vLLM version: v0.12.0 

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: lvjunqi <lvjunqi1@huawei.com>
Co-authored-by: lvjunqi <lvjunqi1@huawei.com>
2025-12-22 16:30:52 +08:00
weijinqian0
35ad11b637 [Refactor] remove some metadata variables in attention_v1. (#5160)
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629

Reason:

The metadata data class contains an excessive number of variables. We
will inherit the metadata of the community and simultaneously remove
some variables that are no longer needed at present.

Todo:
1. remove attn_state partly.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-12-19 14:57:09 +08:00
zzzzwwjj
cc23067f1e [refactor] refactor weight trans nz and transpose (#4878)
### What this PR does / why we need it?

Now `VLLM_ASCEND_ENABLE_NZ` will have three options:
0: disable nz;
1: only quant case enable nz;
2: enable nz as long as possible;

And `VLLM_ASCEND_ENABLE_NZ`=1 by default.

All cases are shown in the table below:

|  | W4A4 | W4A8 | W8A8 | fp16/bf16 | fp32 |
|---|---|---|---|---|---|
| trans nz | can't support nz | trans nz by default | trans nz by
default | trans nz when VLLM_ASCEND_ENABLE_NZ is 2 | can't support nz |
| transpose | only support not transpose case | only support transpose
case | only support transpose case | linear: only support not transpose
case<br>gmm: only support transpose case | same to fp16/bf16 |

Some exceptional cases:
1. MLAPO op need to do some additional processing on the weights,
including trans nz. If use MLAPO op, some weight will be transformed to
nz forcely;
2. MLA/SFA's weight `W_UV` will be used by op
`torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support
nz currently;

### Does this PR introduce _any_ user-facing change?
Now fp16/bf16 weight will not trans nz by default.

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-12-19 14:27:24 +08:00
LuLina
2be0fe2691 [Feat] Add Euler xlite graph wrapper support (#4526)
### What this PR does / why we need it?
This patch adds support for the xlite graph wrapper to vllm_ascend.
Xlite provides operator implementations of the transformer network on
Ascend hardware. For details about xlite, please refer to the following
link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:

## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison
- aclgraph: main(c4a71fc6) 
- xlite-full: main(c4a71fc6) + xlite-full
- xlite-decode-only: main(c4a71fc6) + xlite-decode-only
- diff1: Performance comparison between xlite-full and aclgraph
- diff2: Performance comparison between xlite-decode-only and aclgraph


### Does this PR introduce _any_ user-facing change?
Enable the xlite graph mode by setting xlite_graph_config:
--additional-config='{"xlite_graph_config": {"enabled": true}}' #
Enabled for decode only
--additional-config='{"xlite_graph_config": {"enabled": true,
"full_mode": true}}' # Enabled for prefill and decode

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lulina <lina.lulina@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-08 08:27:46 +08:00