xc-llm-ascend/vllm_ascend at a813eadd2d2fb5d4f6179fbed860aaebfe2b3db6 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

Shanshan Shen a813eadd2d [MM][Perf] Enable 2.7x faster for convolution computation with aclnn BatchMatMulV2 (#7017 )

### What this PR does / why we need it?
Currently, we are using
e2b31243c0/vllm/model_executor/layers/conv.py (L219-L232)
for convolution computation, which is used in patch embedding for VL
models.

After profiling, we find that this linear method will take about **6.87
ms**, which is much slower than just using `F.conv3d()`. In
`F.conv3d()`, it will call aclnn `BatchMatMulV2` with optimization on
Ascend NPU, which only take about **2.50 ms** and is **2.7x faster**
than linear method.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: shen-shanshan <467638484@qq.com>

2026-03-06 14:26:37 +08:00

..

[300I][Bugfix] fix unquant model weight nd2nz error (#6851 )

2026-03-03 15:57:26 +08:00

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

Revert "[Refactor][EAGLE] 8/N delete mtp_proposer" (#7030 )

2026-03-06 11:24:05 +08:00

[BugFix] Fix muls_add fusion not working for GLM5 models (#6928 )

2026-03-05 22:35:54 +08:00

[v0.16.0][P/D][Bugfix] Support ALL D-Nodes in fullgraph when running MTP in PD (#6948 )

2026-03-06 10:01:33 +08:00

[misc] move mxfp_compat into device to decouple from quantization init chain (#6918 )

2026-03-02 18:17:01 +08:00

device_allocator

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

【main】ADXL/HIXL supports FabricMem Mode (#6806 )

2026-03-05 21:04:11 +08:00

[EPLB] Display the expert hotness comparison before and after eplb. (#6877 )

2026-03-06 09:53:29 +08:00

[Main2Main] Upgrade vLLM to 0226 (#6813 )

2026-02-27 16:05:21 +08:00

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #5 ) (#5996 )

2026-01-24 22:45:38 +08:00

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #6 ) (#6001 )

2026-01-24 22:08:33 +08:00

[MM][Perf] Enable 2.7x faster for convolution computation with aclnn BatchMatMulV2 (#7017 )

2026-03-06 14:26:37 +08:00

[Main2Main] Upgrade vLLM to 0303 (#6944 )

2026-03-06 09:08:52 +08:00

[bugfix]Qwen-Omni quantization model_type bugfix (#7007 )

2026-03-05 16:34:34 +08:00

[Feature] Add docs of batch invariance and make some extra operators patch (#6910 )

2026-03-05 09:12:40 +08:00

Revert "[Refactor][EAGLE] 8/N delete mtp_proposer" (#7030 )

2026-03-06 11:24:05 +08:00

Revert "[Refactor][EAGLE] 8/N delete mtp_proposer" (#7030 )

2026-03-06 11:24:05 +08:00

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10 ) (#6173 )

2026-02-06 15:35:06 +08:00

__init__.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

ascend_config.py

[Feature] Add docs of batch invariance and make some extra operators patch (#6910 )

2026-03-05 09:12:40 +08:00

ascend_forward_context.py

add mxfp8 moe quantization (#6670 )

2026-03-02 11:04:06 +08:00

batch_invariant.py

[Feature] Add docs of batch invariance and make some extra operators patch (#6910 )

2026-03-05 09:12:40 +08:00

cpu_binding.py

[CPU binding] Implement global CPU slicing and improve IRQ binding for Ascend NPUs (#6945 )

2026-03-03 17:20:52 +08:00

envs.py

[MISC] Clean up useless env USE_OPTIMIZED_MODEL (#6618 )

2026-02-09 15:38:58 +08:00

flash_common3_context.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

meta_registration.py

[Ops][Refactor] Remove custom rotary_embedding operator (#6523 )

2026-02-07 09:24:05 +08:00

platform.py

[Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518 )

2026-03-02 17:54:25 +08:00

profiling_config.py

[Core][Misc] Clean up ProfileExecuteDuration (#6461 )

2026-02-01 20:06:01 +08:00

utils.py

[MM][Perf] Enable 2.7x faster for convolution computation with aclnn BatchMatMulV2 (#7017 )

2026-03-06 14:26:37 +08:00