add mla_preprocess kernel (#3226)

### What this PR does / why we need it? - Adds the `mla_preprocess` custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs. - Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths. ### Does this PR introduce any user-facing change? - No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged. ### How was this patch tested? - Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com>
2025-10-12 07:39:45 +08:00
parent 1b1207e3c3
commit bcc313e8f2
32 changed files with 9158 additions and 3 deletions
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -12,8 +12,8 @@ repos:
    - id: codespell
      args: [
        --toml, pyproject.toml,
-        '--skip', 'tests/e2e/multicard/test_torchair_graph_mode.py,tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**,./vllm_ascend.egg-info/**,.github/**,typos.toml',
-        '-L', 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn'
+        '--skip', 'tests/e2e/multicard/test_torchair_graph_mode.py,csrc/mla_preprocess/**,tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**,./vllm_ascend.egg-info/**,.github/**,typos.toml',
+        '-L', 'CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn,ArchType,AND'
      ]
      additional_dependencies:
        - tomli
@@ -35,6 +35,10 @@ repos:
  rev: v1.32.0
  hooks:
  - id: typos
+    args: [
+      "--force-exclude",
+      "--exclude", "csrc/mla_preprocess/**"
+    ]
 - repo: https://github.com/PyCQA/isort
  rev: 6.0.1
  hooks: