xc-llm-ascend/vllm_ascend at f5af6bbd1eb0d8e6b003b88684a550186a94c7ae - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

Jade Zheng 8b9ca86827 [Feature] Remove the transpose step after attention and switch to transpose_batchmatmul (#5390 )

1. The `npu_fused_infer_attention_score` kernel supports specifying the
output layout. By selecting the appropriate layout, we can avoid the
transpose operation typically required after the attention.
2. The `transpose_batchmatmul` function allows us to control whether the
output tensor is transposed. If we configure `perm_y`, an additional
transpose after executing `v_up` becomes unnecessary.

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

2025-12-26 22:03:46 +08:00

..

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

[Feature] Remove the transpose step after attention and switch to transpose_batchmatmul (#5390 )

2025-12-26 22:03:46 +08:00

[feature] support pcp + mtp in full graph (#4572 )

2025-12-22 16:13:39 +08:00

[CI] fix lint (#5216 )

2025-12-20 17:03:25 +08:00

device_allocator

[Misc]Clean up useless import from vllm (#2049 )

2025-07-28 16:01:59 +08:00

[Bugfix] Use hf_text_config instead of hf_config to support multimodal PD-Disaggregated (#5205 )

2025-12-22 20:21:45 +08:00

[Misc] Cleanup useless print and logger (#5220 )

2025-12-22 11:28:26 +08:00

upgrade vLLM to main (#4608 )

2025-12-02 22:10:52 +08:00

[BugFix]Fix precision issue for LoRA feature (#4141 )

2025-12-19 14:22:06 +08:00

[CI] speed up ut (#4901 )

2025-12-11 18:45:43 +08:00

rollback causal_conv1d_fn to torch ops & update qwen3Next doc (#5391 )

2025-12-26 19:57:38 +08:00

[BugFix][Fusion] Patch compile backend to make fusion available (#5308 )

2025-12-26 09:18:16 +08:00

[quantization] Add w8a16 quantization support (#4541 )

2025-12-24 19:49:32 +08:00

Revert "Add MagicMTP(block verify) and Triton optimization (#4443 )" (#5380 )

2025-12-26 15:06:13 +08:00

[main][Refactor] Remove with_prefill parameter from set_ascend_forward_context (#5094 )

2025-12-23 14:30:50 +08:00

[Feature] Enhance all-reduce skipping logic for MoE models in NPUModelRunner (#5329 )

2025-12-26 17:39:44 +08:00

[CI] add xlite e2e test (#5305 )

2025-12-25 09:17:06 +08:00

__init__.py

clean up model module (#4611 )

2025-12-02 17:35:47 +08:00

ascend_config.py

cleanup ascend config (#5296 )

2025-12-26 14:07:37 +08:00

ascend_forward_context.py

[Bugfix] Fix unsuitable moe_comm_type under ep=1 scenario (#5388 )

2025-12-26 16:45:45 +08:00

cpu_binding.py

[main] support cpu binding (#3546 )

2025-10-21 09:17:03 +08:00

envs.py

Remove VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE (#5272 )

2025-12-25 11:09:56 +08:00

flash_common3_context.py

[Perf]enable prefill flashcommon3 (#4065 )

2025-12-14 09:34:13 +08:00

meta_registration.py

Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

2025-09-13 11:58:52 +08:00

platform.py

cleanup ascend config (#5296 )

2025-12-26 14:07:37 +08:00

profiling_config.py

Drop ascend scheduler (#4623 )

2025-12-05 09:03:45 +08:00

utils.py

cleanup ascend config (#5296 )

2025-12-26 14:07:37 +08:00