xc-llm-ascend

Files

Wang Xiaoran 3ce5a34468 [BugFix] Xlite: Bypass the padding of the graph mode in non-MTP cases to obtain the correct decode num. (#5711 )

### What this PR does / why we need it?
This PR fixes a bug in Xlite
backend(https://atomgit.com/openeuler/GVirt/issues/1), The direct cause
of the problem is that the XModel::PrepareAttn function obtained an
illegal number of tokens to be inferred, -540. This illegal value is due
to the padding feature of inference in graph mode and the residual state
across steps. This issue is triggered when a prefill request is newly
added in a step and a decode ends simultaneously. It is first fixed
using num_decode_tokens instead of attn_metadata.num_decodes.
1. In graph mode, vllm_ascend has padding characteristics. In the
_prepare_inputs function, if the number of tokens to be inferred is less
than the set threshold (8 in this case), the attn_metadata.num_decode
array will be expanded to 8.
2. Meanwhile, vllm_ascend uses the class variable self.query_start_loc
of NPUModelRunner to record the tokens to be inferred. Due to poor
coordination with the graph mode padding mechanism when crossing steps,
in some cases (such as when a decode request is completed in a certain
step and a new prefill request is added at the same time), negative
values may be calculated for attn_metadata.query_lens.
3. After type conversion, the negative values in query_lens cause an
overflow. Xlite detects that the number of tokens to be inferred for the
decode request is too large and triggers a "decode len too long" alert.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Same with https://atomgit.com/openeuler/GVirt/issues/1
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wwwumr <1127858301@qq.com>

2026-01-09 15:55:30 +08:00

__init__.py

[Feat] Add Euler xlite graph wrapper support (#4526 )

2025-12-08 08:27:46 +08:00

xlite_model_runner.py

[Feat] Add Euler xlite graph wrapper support (#4526 )

2025-12-08 08:27:46 +08:00

xlite_worker.py

[Bugfix] fix dcp_only bug and add e2e accuracy test for dcp only and pcp only (#5565 )

2026-01-06 22:48:21 +08:00

xlite.py

[BugFix] Xlite: Bypass the padding of the graph mode in non-MTP cases to obtain the correct decode num. (#5711 )

2026-01-09 15:55:30 +08:00