[bugfix] restore pr-7029 and fix patch error (#7294)

### What this PR does / why we need it? This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using the `lightning_indexer_quant` ops in the pd-mix stage. The original PR was reverted by #7288 because the patch did not work with the recompute scheduler. This PR also fixes the patching issue so that it works correctly with the recompute scheduler. ### Does this PR introduce _any_ user-facing change? Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to `"true"` in `additional_config`. - vLLM version: v0.17.0 - vLLM main: 4034c3d32e --------- Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-16 15:39:42 +08:00
parent 9320365dab
commit 4d443b9228
25 changed files with 4309 additions and 78 deletions
--- a/vllm_ascend/patch/init.py
+++ b/vllm_ascend/patch/init.py
@@ -137,6 +137,28 @@
 #       Remove this patch if upstream provides an official NPU graph-capture
 #       guidance / auto-configuration path for HCCL.
 #
+# ** 8. File: platform/patch_kv_cache_interface.py**
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#   1. `vllm.v1.kv_cache_interface.MLAAttentionSpec`
+#    Why:
+#       The default `MLAAttentionSpec` is mainly built around `kv_lora_rank`
+#       and `qk_rope_head_dim`. On NPU, we also use this class to describe DSA
+#       models. Unlike the GPU path, where cache management is handled by an
+#       additional indexer module, extending this class directly simplifies the
+#       corresponding `model_runner` implementation on NPU.
+#
+#       This patch also adds Sparse C8 support for DSA models on NPU. As part
+#       of that support, members such as `page_size_bytes` need to be adapted,
+#       so they are overridden here as well to preserve overall readability.
+#    How:
+#       This patch subclasses the original implementation, overrides selected
+#       methods, and adds DSA-specific attributes and helpers with default
+#       values where needed.
+#    Related PR (if no, explain why):
+#       https://github.com/vllm-project/vllm/pull/25896
+#    Future Plan:
+#       Remove this patch after the upcoming KV cache spec refactor.
+#
 # * Worker Patch:
 # ===============
 #