xc-llm-ascend

Files

Jade Zheng 8b9ca86827 [Feature] Remove the transpose step after attention and switch to transpose_batchmatmul (#5390 )

1. The `npu_fused_infer_attention_score` kernel supports specifying the
output layout. By selecting the appropriate layout, we can avoid the
transpose operation typically required after the attention.
2. The `transpose_batchmatmul` function allows us to control whether the
output tensor is transposed. If we configure `perm_y`, an additional
transpose after executing `v_up` becomes unnecessary.

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

2025-12-26 22:03:46 +08:00

__init__.py

[Core] Make V1 work and enable V1 engine test (#389 )

2025-03-28 19:34:23 +08:00

attention_cp.py

[Bugfix] Fix Qwen P/D Disaggregation accuracy issue (#5340 )

2025-12-25 22:46:08 +08:00

attention_mask.py

[Model] Support pooling models (#3122 )

2025-12-10 11:37:57 +08:00

attention_v1.py

[bugfix] Fix MHA model runtime error in aclgraph mode (#5397 )