Files
xc-llm-ascend/vllm_ascend
Jade Zheng 8b9ca86827 [Feature] Remove the transpose step after attention and switch to transpose_batchmatmul (#5390)
1. The `npu_fused_infer_attention_score` kernel supports specifying the
output layout. By selecting the appropriate layout, we can avoid the
transpose operation typically required after the attention.
2. The `transpose_batchmatmul` function allows us to control whether the
output tensor is transposed. If we configure `perm_y`, an additional
transpose after executing `v_up` becomes unnecessary.

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-26 22:03:46 +08:00
..
2025-12-20 17:03:25 +08:00
2025-12-02 22:10:52 +08:00
2025-12-11 18:45:43 +08:00
2025-12-25 09:17:06 +08:00
2025-12-02 17:35:47 +08:00
2025-12-26 14:07:37 +08:00
2025-12-26 14:07:37 +08:00
2025-12-26 14:07:37 +08:00