[Bugfix] Fix zero attention output in qwen3-next (#3572)
### What this PR does / why we need it? Since Attention and LinearAttention share the same ```slot_mapping```, and the ```slot_mapping``` for LinearAttention is all zeros, the ```slot_mapping``` for Attention gets overwritten, resulting in the computed output being all zeros. This PR removes the uniformly managed ```self.slot_mapping``` and directly passes the ```slot_mapping``` from ```input_batch.blocktable``` to ```attn_metadata```, along with modifying the relevant references. Due to hardware, the data type of ```block_table.slot_mapping``` needs to be set to int32. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: QilaiZhang <245706640@qq.com>
This commit is contained in:
@@ -83,7 +83,7 @@ class BlockTable:
|
||||
pin_memory=self.pin_memory)
|
||||
self.slot_mapping_np = self.slot_mapping_cpu.numpy()
|
||||
self.slot_mapping = torch.zeros(self.max_num_batched_tokens,
|
||||
dtype=torch.int64,
|
||||
dtype=torch.int32,
|
||||
device=self.device)
|
||||
try:
|
||||
self.pcp_world_size = get_pcp_group(
|
||||
|
||||
Reference in New Issue
Block a user