add mla_preprocess kernel (#3226)

### What this PR does / why we need it?

- Adds the `mla_preprocess` custom kernel to provide an optimized
pre-processing operator for Multi-head Latent Attention (MLA) on Ascend
NPUs.
- Wires the new kernel into the C++ extension pipeline so vLLM can
invoke it directly, cutting Python-side tensor shuffling and memory
copies that previously bottlenecked MLA compilation paths.

### Does this PR introduce any user-facing change?

- No. The change only introduces a low-level kernel; public APIs and
inference behavior remain unchanged.

### How was this patch tested?

- Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.

- vLLM version: v0.11.0

Signed-off-by: Chen Chen <0109chenchen@gmail.com>

This commit is contained in:

Chen Chen

2025-10-12 07:39:45 +08:00

committed by

GitHub

parent 1b1207e3c3

commit bcc313e8f2

32 changed files with 9158 additions and 3 deletions

2918

csrc/mla_preprocess/op_kernel/mla_preprocess_mix_bf16.hpp Normal file

View File

File diff suppressed because it is too large Load Diff

add mla_preprocess kernel (#3226)

2918 csrc/mla_preprocess/op_kernel/mla_preprocess_mix_bf16.hpp Normal file View File

2918

csrc/mla_preprocess/op_kernel/mla_preprocess_mix_bf16.hpp Normal file

View File