### What this PR does / why we need it? This reverts commit7ed9e9de69, which introduces an issue that the patch doesn't work with recompute scheduler enabled. - vLLM version: v0.17.0 - vLLM main:4034c3d32e--------- Signed-off-by: MengqingCao <cmq0113@163.com>
This commit is contained in:
@@ -137,28 +137,6 @@
|
||||
# Remove this patch if upstream provides an official NPU graph-capture
|
||||
# guidance / auto-configuration path for HCCL.
|
||||
#
|
||||
# ** 8. File: platform/patch_kv_cache_interface.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.v1.kv_cache_interface.MLAAttentionSpec`
|
||||
# Why:
|
||||
# The default `MLAAttentionSpec` is mainly built around `kv_lora_rank`
|
||||
# and `qk_rope_head_dim`. On NPU, we also use this class to describe DSA
|
||||
# models. Unlike the GPU path, where cache management is handled by an
|
||||
# additional indexer module, extending this class directly simplifies the
|
||||
# corresponding `model_runner` implementation on NPU.
|
||||
#
|
||||
# This patch also adds Sparse C8 support for DSA models on NPU. As part
|
||||
# of that support, members such as `page_size_bytes` need to be adapted,
|
||||
# so they are overridden here as well to preserve overall readability.
|
||||
# How:
|
||||
# This patch subclasses the original implementation, overrides selected
|
||||
# methods, and adds DSA-specific attributes and helpers with default
|
||||
# values where needed.
|
||||
# Related PR (if no, explain why):
|
||||
# https://github.com/vllm-project/vllm/pull/25896
|
||||
# Future Plan:
|
||||
# Remove this patch after the upcoming KV cache spec refactor.
|
||||
#
|
||||
# * Worker Patch:
|
||||
# ===============
|
||||
#
|
||||
|
||||
Reference in New Issue
Block a user