[HybridKV] Fix prefill disaggregation kvcache addr alignment & use hybrid kv cache only when running qwen3_next (#3007)

### What this PR does / why we need it? This pr fixes a few issues on prefill disaggregation: 1. Fix prefill disaggregation kvcache addr alignment issue, llmdatadist needs the addr of tensors to be aligned with 2M 2. Fix prefill disaggregation kvcache shape error, llmdatadist requires k/v tensors with shape [num_blocks, ...], however the implentment before this pr is [2, num_blocks, ...], which will break prefill disaggregation 3. Use hybrid kv cache only when running qwen3_next to fix accuracy issue on prefill disaggregation. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Tested locally by @liziyu179 - vLLM version: v0.10.2 - vLLM main: 4f02b77de4 --------- Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-18 21:43:22 +08:00
parent acb46f303f
commit 367edff5af
3 changed files with 95 additions and 46 deletions
--- a/vllm_ascend/platform.py
+++ b/vllm_ascend/platform.py
@@ -25,7 +25,6 @@ from torch.distributed import ProcessGroup
 from torch.distributed.distributed_c10d import PrefixStore
 from vllm.logger import logger
 from vllm.platforms import Platform, PlatformEnum
-from vllm.utils import cdiv

 from vllm_ascend.ascend_config import (check_ascend_config, get_ascend_config,
                                       init_ascend_config)
@@ -247,10 +246,6 @@ class NPUPlatform(Platform):
        if cache_config:
            if cache_config.block_size is None:
                cache_config.block_size = 128
-            else:
-                if not vllm_config.model_config.is_deepseek_mla:
-                    cache_config.block_size = cdiv(cache_config.block_size,
-                                                   64) * 64

            if cache_config.enable_prefix_caching and cache_config.block_size != 128:
                logger.warning(