[HybridKV] Fix prefill disaggregation kvcache addr alignment & use hybrid kv cache only when running qwen3_next (#3007)

### What this PR does / why we need it?
This pr fixes a few issues on prefill disaggregation:
1. Fix prefill disaggregation kvcache addr alignment issue, llmdatadist
needs the addr of tensors to be aligned with 2M
2. Fix prefill disaggregation kvcache shape error, llmdatadist requires
k/v tensors with shape [num_blocks, ...], however the implentment before
this pr is [2, num_blocks, ...], which will break prefill disaggregation
3. Use hybrid kv cache only when running qwen3_next to fix accuracy
issue on prefill disaggregation.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Tested locally by @liziyu179 

- vLLM version: v0.10.2
- vLLM main:
4f02b77de4

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
This commit is contained in:
Mengqing Cao
2025-09-18 21:43:22 +08:00
committed by GitHub
parent acb46f303f
commit 367edff5af
3 changed files with 95 additions and 46 deletions

View File

@@ -25,7 +25,6 @@ from torch.distributed import ProcessGroup
from torch.distributed.distributed_c10d import PrefixStore
from vllm.logger import logger
from vllm.platforms import Platform, PlatformEnum
from vllm.utils import cdiv
from vllm_ascend.ascend_config import (check_ascend_config, get_ascend_config,
init_ascend_config)
@@ -247,10 +246,6 @@ class NPUPlatform(Platform):
if cache_config:
if cache_config.block_size is None:
cache_config.block_size = 128
else:
if not vllm_config.model_config.is_deepseek_mla:
cache_config.block_size = cdiv(cache_config.block_size,
64) * 64
if cache_config.enable_prefix_caching and cache_config.block_size != 128:
logger.warning(