xc-llm-ascend/ops at 87d6424b2ee301469381d99d41936a37bc0eec91 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

chenxi-hh 42bcad7e9b GMM custom operator optimization in small batch scenarios (#7100 )

### What this PR does / why we need it?
GMM custom operator optimization in small batch scenarios

### How was this patch tested?

Qwen3-30B input: 4k, output: 1k

batch 1：
TPOT 7.9 ms -> 7.0 ms
Output Token Throughput 125.4651 token/s -> 140.6278 token/s

batch 2：
TPOT 9.4 ms -> 8.8 ms
Output Token Throughput 211.8187 token/s -> 225.2254 token/s

batch 16：
TPOT 13.6 ms -> 13.5 ms
Output Token Throughput 1159.8213 token/s -> 1165.0982 token/s

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: chenxi-hh <chen464822955@163.com>

2026-03-19 16:10:30 +08:00

..

GMM custom operator optimization in small batch scenarios (#7100 )

2026-03-19 16:10:30 +08:00

[Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align (#7103 )

2026-03-15 09:44:09 +08:00

__init__.py

[OPS]add split_qkv_rmsnorm_mrope ops (#6730 )

2026-03-06 16:18:37 +08:00

activation.py

[Attention] add gpt-oss support (#5901 )

2026-02-12 10:55:34 +08:00

conv.py

[MM][Perf] Enable 2.7x faster for convolution computation with aclnn BatchMatMulV2 (#7017 )

2026-03-06 14:26:37 +08:00

flashcomm2_oshard_manager.py

[Lint]Style: Convert vllm-ascend/ to ruff format(new Batch #8 ) (#6604 )

2026-02-07 09:16:07 +08:00

layer_shard_linear.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10 ) (#6173 )

2026-02-06 15:35:06 +08:00

layernorm.py

[Perf] Optimize bias handling in AscendRMSNorm (#7226 )

2026-03-17 16:53:28 +08:00

linear_op.py

[Bugfix][LoRA] Fix the bug when runs Qwen3-Reranker-0.6B with LoRA. (#7156 )

2026-03-15 17:55:42 +08:00

linear.py

[Bugfix][LoRA] Fix the bug when runs Qwen3-Reranker-0.6B with LoRA. (#7156 )

2026-03-15 17:55:42 +08:00

mla.py

[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222 )

2026-03-16 22:49:05 +08:00

mm_encoder_attention.py

[Main2Main] Upgrade vLLM to 0303 (#6944 )

2026-03-06 09:08:52 +08:00

register_custom_ops.py

[Feature] support aclgraph for model runner v2 (#7110 )

2026-03-13 09:11:46 +08:00

rotary_embedding.py

[Feature] support aclgraph for model runner v2 (#7110 )

2026-03-13 09:11:46 +08:00

vocab_parallel_embedding.py

[Lint]Style: Convert vllm-ascend/ to ruff format(new Batch #8 ) (#6604 )

2026-02-07 09:16:07 +08:00

weight_prefetch.py

[Misc] Drop Prefetch MLP Env (#7357 )

2026-03-19 14:27:27 +08:00