Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) (#5317)

### What this PR does / why we need it? Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as it causes deepseek v3.2 hang error - vLLM version: release/v0.13.0 - vLLM main: 5fbfa8d9ef --------- Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-24 22:24:17 +08:00
parent fb3d6ca08c
commit e54630e01c
5 changed files with 18 additions and 96 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -1195,10 +1195,6 @@ class NPUModelRunner(GPUModelRunner):

    def _build_attn_state(self, num_reqs, num_scheduled_tokens,
                          num_valid_tokens):
-        if self.shared_kv_cache_layers is not None:
-            # sharing kv across layers need to read the kvcache,
-            # directly return chunked prefill in this scenario
-            return AscendAttentionState.ChunkedPrefill
        if np.array_equal(self.seq_lens.np[:num_reqs], num_scheduled_tokens):
            attn_state = AscendAttentionState.PrefillNoCache
        # We assume it is the decode stage, where prefill occurs but only one token is not hit in cache.