[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)

### What this PR does / why we need it?
This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops
in pd-mix stage mainly.

Because the code for the current PD-disaggregated scenario is still
under refactoring and cleanup, this PR prioritizes ensuring the C8
functionality in the pd-mix scenario.

The next steps are planned in two parts:
① Once the optimized scatter operator is updated, we will replace the
original operator to improve the performance of storing k_scale.
② Once the code logic for the PD-disaggregated scenario becomes stable,
we will carry out more comprehensive validation and make appropriate
adaptations.
③ Because enabling C8 currently introduces several new operators whose
performance still needs improvement, performance may regress in some
scenarios. Therefore, only after all the operators are fully ready can
we ensure that this feature does not cause any performance degradation.
At that point, we will enable this feature by default and remove the
switch in `additional_config`.


### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
This commit is contained in:
rjg-lyh
2026-03-13 14:47:42 +08:00
committed by GitHub
parent df1ee8070d
commit 7ed9e9de69
24 changed files with 4279 additions and 77 deletions

View File

@@ -137,6 +137,28 @@
# Remove this patch if upstream provides an official NPU graph-capture
# guidance / auto-configuration path for HCCL.
#
# ** 8. File: platform/patch_kv_cache_interface.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.kv_cache_interface.MLAAttentionSpec`
# Why:
# The default `MLAAttentionSpec` is mainly built around `kv_lora_rank`
# and `qk_rope_head_dim`. On NPU, we also use this class to describe DSA
# models. Unlike the GPU path, where cache management is handled by an
# additional indexer module, extending this class directly simplifies the
# corresponding `model_runner` implementation on NPU.
#
# This patch also adds Sparse C8 support for DSA models on NPU. As part
# of that support, members such as `page_size_bytes` need to be adapted,
# so they are overridden here as well to preserve overall readability.
# How:
# This patch subclasses the original implementation, overrides selected
# methods, and adds DSA-specific attributes and helpers with default
# values where needed.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/25896
# Future Plan:
# Remove this patch after the upcoming KV cache spec refactor.
#
# * Worker Patch:
# ===============
#