[Bugfix] Fix zero attention output in qwen3-next (#3572)

### What this PR does / why we need it?
Since Attention and LinearAttention share the same ```slot_mapping```,
and the ```slot_mapping``` for LinearAttention is all zeros, the
```slot_mapping``` for Attention gets overwritten, resulting in the
computed output being all zeros.

This PR removes the uniformly managed ```self.slot_mapping``` and
directly passes the ```slot_mapping``` from ```input_batch.blocktable```
to ```attn_metadata```, along with modifying the relevant references.
Due to hardware, the data type of ```block_table.slot_mapping``` needs
to be set to int32.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: QilaiZhang <245706640@qq.com>
This commit is contained in:
QilaiZhang
2025-10-25 09:47:03 +08:00
committed by GitHub
parent e33751ef8b
commit d30bb95b90
3 changed files with 13 additions and 21 deletions

View File

@@ -83,7 +83,7 @@ class BlockTable:
pin_memory=self.pin_memory)
self.slot_mapping_np = self.slot_mapping_cpu.numpy()
self.slot_mapping = torch.zeros(self.max_num_batched_tokens,
dtype=torch.int64,
dtype=torch.int32,
device=self.device)
try:
self.pcp_world_size = get_pcp_group(