xc-llm-ascend/worker at 59a75263396ad0889aa2c85c25438a18cb16640f - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

Wangbei25 4f259d4fd8 [Performance]Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder (#7737 )

### What this PR does / why we need it?
Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder and add doc
for DeepSeekOCR2.md

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
- vllm 0.18.0
- vllm-ascend main

1. _create_custom_4d_mask during 141ms49us620ns -->
_create_npu_optimized_mask during 1ms227us780ns
2. convd2d : 27ms --> matmul <1ms
3. relposattention：sdpa->prompt_flash_attention

---------

Signed-off-by: Wangbei25 <wangbei41@huawie.com>
Signed-off-by: Wangbei25 <wangbei41@huawei.com>
Co-authored-by: Wangbei25 <wangbei41@huawie.com>

2026-03-31 14:49:29 +08:00

..

adapt to main2main for model runner v2 (#7578 )

2026-03-25 09:08:44 +08:00

__init__.py

[Performance]Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder (#7737 )

2026-03-31 14:49:29 +08:00

patch_bert.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10 ) (#6173 )

2026-02-06 15:35:06 +08:00

patch_cudagraph.py

[main][bugfix] Fixed the problem of speculative decoding in FULL mode (#7148 )

2026-03-12 14:51:12 +08:00

patch_deepencoder2.py

[Performance]Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder (#7737 )

2026-03-31 14:49:29 +08:00

patch_deepseek_mtp.py

[MTP][Bugfix] Fix GLM5-W8A8 precision issues caused by rotary quant MTP weights (#7139 )

2026-03-12 20:01:24 +08:00

patch_distributed.py

[Feat][SP] Suport SP for VL MoE models (#7044 )

2026-03-24 17:16:00 +08:00

patch_draft_quarot.py

[main][feature] Support quarot for eagle3 without embedding (#7038 )

2026-03-09 10:43:06 +08:00

patch_gdn_attn.py

[Feature] Optimize Qwen3.5/Qwen3Next GDN prefill by prebuilding chunk metadata (#7487 )

2026-03-22 23:09:23 +08:00

patch_huanyuan_vl.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10 ) (#6173 )

2026-02-06 15:35:06 +08:00

patch_kimi_k25.py

[Bugfix] Fix get_rope_shape for Kimi-K2.5 (#7521 )

2026-03-22 21:06:31 +08:00

patch_mamba_utils.py

[Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align (#7103 )

2026-03-15 09:44:09 +08:00

patch_minimax_m2_linear_attn.py

[Model] Support Minimax-m2.5 on NPU (#7105 )

2026-03-11 00:12:02 +08:00

patch_minimax_m2.py

[v0.18.0] Apply Eagle3 to MiniMax-M2.5 (#7619 ) (#7714 )

2026-03-27 18:33:29 +08:00

patch_module.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10 ) (#6173 )

2026-02-06 15:35:06 +08:00

patch_multimodal_merge.py

[bugfix]Fix parameter ordering bug in _merge_multimodal_embeddings (#7068 )

2026-03-09 16:05:52 +08:00

patch_npugraph_ex_triton.py

[npugraph_ex]enable npugraph_ex by default (#6664 )

2026-02-12 08:44:06 +08:00

patch_qwen3_5.py

Qwen3.5 MoE supports flashcomm v1 (#7644 )

2026-03-25 23:09:33 +08:00

patch_qwen3_next_mtp.py

[Main2Main] Upgrade vLLM to 0226 (#6813 )

2026-02-27 16:05:21 +08:00

patch_qwen3_next.py

[Ops][Misc] Refactor and optimize CausalConv1d for Ascend (#7495 )

2026-03-24 00:07:12 +08:00

patch_rejection_sampler.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10 ) (#6173 )

2026-02-06 15:35:06 +08:00

patch_routed_experts_capturer.py

[Feat] Support routing replay (#6696 )

2026-02-26 10:22:47 +08:00

patch_triton.py

adapt to main2main for model runner v2 (#7578 )

2026-03-25 09:08:44 +08:00

patch_unquantized_gemm.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10 ) (#6173 )

2026-02-06 15:35:06 +08:00

patch_weight_utils.py

[Bugfix]Fix deepseek 3.2 C8 precision by rotary tensor (#7537 )

2026-03-25 09:18:00 +08:00