[Bugfix] Fix zero attention output in qwen3-next (#3572)

### What this PR does / why we need it? Since Attention and LinearAttention share the same ```slot_mapping```, and the ```slot_mapping``` for LinearAttention is all zeros, the ```slot_mapping``` for Attention gets overwritten, resulting in the computed output being all zeros. This PR removes the uniformly managed ```self.slot_mapping``` and directly passes the ```slot_mapping``` from ```input_batch.blocktable``` to ```attn_metadata```, along with modifying the relevant references. Due to hardware, the data type of ```block_table.slot_mapping``` needs to be set to int32. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: QilaiZhang <245706640@qq.com>
2025-10-25 09:47:03 +08:00
parent e33751ef8b
commit d30bb95b90
3 changed files with 13 additions and 21 deletions
--- a/vllm_ascend/worker/block_table.py
+++ b/vllm_ascend/worker/block_table.py
@@ -83,7 +83,7 @@ class BlockTable:
                                            pin_memory=self.pin_memory)
        self.slot_mapping_np = self.slot_mapping_cpu.numpy()
        self.slot_mapping = torch.zeros(self.max_num_batched_tokens,
-                                        dtype=torch.int64,
+                                        dtype=torch.int32,
                                        device=self.device)
        try:
            self.pcp_world_size = get_pcp_group(