[bugfix] restore pr-7029 and fix patch error (#7294)
### What this PR does / why we need it?
This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using
the `lightning_indexer_quant` ops in the pd-mix stage.
The original PR was reverted by #7288 because the patch did not work
with the recompute scheduler.
This PR also fixes the patching issue so that it works correctly with
the recompute scheduler.
### Does this PR introduce _any_ user-facing change?
Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to
`"true"` in `additional_config`.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
This commit is contained in:
@@ -137,6 +137,28 @@
|
||||
# Remove this patch if upstream provides an official NPU graph-capture
|
||||
# guidance / auto-configuration path for HCCL.
|
||||
#
|
||||
# ** 8. File: platform/patch_kv_cache_interface.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.v1.kv_cache_interface.MLAAttentionSpec`
|
||||
# Why:
|
||||
# The default `MLAAttentionSpec` is mainly built around `kv_lora_rank`
|
||||
# and `qk_rope_head_dim`. On NPU, we also use this class to describe DSA
|
||||
# models. Unlike the GPU path, where cache management is handled by an
|
||||
# additional indexer module, extending this class directly simplifies the
|
||||
# corresponding `model_runner` implementation on NPU.
|
||||
#
|
||||
# This patch also adds Sparse C8 support for DSA models on NPU. As part
|
||||
# of that support, members such as `page_size_bytes` need to be adapted,
|
||||
# so they are overridden here as well to preserve overall readability.
|
||||
# How:
|
||||
# This patch subclasses the original implementation, overrides selected
|
||||
# methods, and adds DSA-specific attributes and helpers with default
|
||||
# values where needed.
|
||||
# Related PR (if no, explain why):
|
||||
# https://github.com/vllm-project/vllm/pull/25896
|
||||
# Future Plan:
|
||||
# Remove this patch after the upcoming KV cache spec refactor.
|
||||
#
|
||||
# * Worker Patch:
|
||||
# ===============
|
||||
#
|
||||
|
||||
Reference in New Issue
Block a user