[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)

### What this PR does / why we need it? This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops in pd-mix stage mainly. Because the code for the current PD-disaggregated scenario is still under refactoring and cleanup, this PR prioritizes ensuring the C8 functionality in the pd-mix scenario. The next steps are planned in two parts: ① Once the optimized scatter operator is updated, we will replace the original operator to improve the performance of storing k_scale. ② Once the code logic for the PD-disaggregated scenario becomes stable, we will carry out more comprehensive validation and make appropriate adaptations. ③ Because enabling C8 currently introduces several new operators whose performance still needs improvement, performance may regress in some scenarios. Therefore, only after all the operators are fully ready can we ensure that this feature does not cause any performance degradation. At that point, we will enable this feature by default and remove the switch in `additional_config`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: 4034c3d32e --------- Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-13 14:47:42 +08:00
parent df1ee8070d
commit 7ed9e9de69
24 changed files with 4279 additions and 77 deletions
--- a/vllm_ascend/patch/init.py
+++ b/vllm_ascend/patch/init.py
@@ -137,6 +137,28 @@
 #       Remove this patch if upstream provides an official NPU graph-capture
 #       guidance / auto-configuration path for HCCL.
 #
+# ** 8. File: platform/patch_kv_cache_interface.py**
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#   1. `vllm.v1.kv_cache_interface.MLAAttentionSpec`
+#    Why:
+#       The default `MLAAttentionSpec` is mainly built around `kv_lora_rank`
+#       and `qk_rope_head_dim`. On NPU, we also use this class to describe DSA
+#       models. Unlike the GPU path, where cache management is handled by an
+#       additional indexer module, extending this class directly simplifies the
+#       corresponding `model_runner` implementation on NPU.
+#
+#       This patch also adds Sparse C8 support for DSA models on NPU. As part
+#       of that support, members such as `page_size_bytes` need to be adapted,
+#       so they are overridden here as well to preserve overall readability.
+#    How:
+#       This patch subclasses the original implementation, overrides selected
+#       methods, and adds DSA-specific attributes and helpers with default
+#       values where needed.
+#    Related PR (if no, explain why):
+#       https://github.com/vllm-project/vllm/pull/25896
+#    Future Plan:
+#       Remove this patch after the upcoming KV cache spec refactor.
+#
 # * Worker Patch:
 # ===============
 #