xc-llm-ascend

Files

Chen Chen 16cb3cc45d adapt the mla_v1 with the mla_preprocess kernel (#3397 )

### What this PR does / why we need it?

This pull request integrates a new `mla_preprocess` kernel to create an
optimized path for MLA (Multi-Layer Attention) decode operations on
Ascend hardware, controlled by an environment flag. The changes include
new utility functions for weight transformation, a method to prepare
weights for the fused kernel, and logic to route decode-only batches to
this new path. My review identified a critical bug in the `transdata`
utility function where padding dimensions are swapped, which will lead
to incorrect tensor shapes and kernel failures. Additionally, I've
pointed out a high-severity maintainability issue in the
trans_rope_weight function, which modifies its input in-place, and I
have provided a pure-function alternative.

### Does this PR introduce _any_ user-facing change?

No user-facing changes by default. User can enable the `mla_preprocess`
kernel in model by enable the env-var `VLLM_ASCEND_ENABLE_MLAPO`.

### How was this patch tested?

Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Chen Chen <0109chenchen@gmail.com>

2025-10-15 10:34:25 +08:00

attention

adapt the mla_v1 with the mla_preprocess kernel (#3397 )

2025-10-15 10:34:25 +08:00

compilation

fix pagedattention to support fullgraph. (#3436 )

2025-10-14 16:10:09 +08:00

core

[BugFix] Fix ascend scheduler assert error (#3191 )

2025-09-28 18:22:08 +08:00

device_allocator

[Misc]Clean up useless import from vllm (#2049 )