[Bugfix] Fix zero attention output in qwen3-next (#3572)

### What this PR does / why we need it? Since Attention and LinearAttention share the same ```slot_mapping```, and the ```slot_mapping``` for LinearAttention is all zeros, the ```slot_mapping``` for Attention gets overwritten, resulting in the computed output being all zeros. This PR removes the uniformly managed ```self.slot_mapping``` and directly passes the ```slot_mapping``` from ```input_batch.blocktable``` to ```attn_metadata```, along with modifying the relevant references. Due to hardware, the data type of ```block_table.slot_mapping``` needs to be set to int32. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: QilaiZhang <245706640@qq.com>
2025-10-25 09:47:03 +08:00
parent e33751ef8b
commit d30bb95b90
3 changed files with 13 additions and 21 deletions
--- a/vllm_ascend/spec_decode/eagle_proposer.py
+++ b/vllm_ascend/spec_decode/eagle_proposer.py
@@ -359,7 +359,8 @@ class EagleProposer(Proposer):
                actual_seq_lengths_q=self.runner.actual_seq_lengths_q,
                block_table_tensor=self.runner.input_batch.block_table[0].
                get_device_tensor(),
-                slot_mapping=self.runner.slot_mapping,
+                slot_mapping=self.runner.input_batch.block_table[0].
+                slot_mapping,
                positions=self.runner.positions,
                attn_mask=self.runner.attn_mask,
                spec_attn_mask=self.runner.spec_attn_mask,