xc-llm-ascend/vllm_ascend at 350b95efcf6ebedaae7b87ba621c38c021b471a8 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

wangqiankun13 350b95efcf [BugFix]Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series (#5293 )

…w8a8 while main model uses w8a8

### What this PR does / why we need it?

Disable dispatch_gmm_combine_decode operator when mtp drafter model uses
non-w8a8 while main model uses w8a8, or drafter model is eagle series.

More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476


- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>

2026-01-04 17:51:28 +08:00

..

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

[Perf][PCP][DCP] add multi-stream for GQA to enable computation-communication overlap (#5382 )

2026-01-04 16:33:18 +08:00

[Graph][Fusion] Add AddRMSNorm(with bias) (#5491 )

2025-12-31 17:10:26 +08:00

[CI] fix lint (#5216 )

2025-12-20 17:03:25 +08:00

device_allocator

[Misc]Clean up useless import from vllm (#2049 )

2025-07-28 16:01:59 +08:00

[Recover] [Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892 ) (revert in #4981 ) (#5511 )

2026-01-04 16:49:33 +08:00

[smoke][bugfix] moe_init_routing_v2 active_expert_range use int type (#5521 )

2025-12-31 09:19:04 +08:00

[BugFix] Fix npu-cpu offloading interface change bug. (#5290 )

2025-12-27 10:21:20 +08:00

[BugFix]Fix precision issue for LoRA feature (#4141 )

2025-12-19 14:22:06 +08:00

[CI] speed up ut (#4901 )

2025-12-11 18:45:43 +08:00

[Feat] enable hierarchical mc2 ops on A2 by default (#5545 )

2026-01-04 14:44:20 +08:00

[Bugfix] Fix mm_merge (#5249 )

2025-12-31 09:49:55 +08:00

[Model] Add LongCat-Flash (#3833 )

2025-12-31 17:06:55 +08:00

[Refactor][Triton] Move reject sample triton kernels into ops/triton (#5324 )

2025-12-29 16:15:41 +08:00

[Feat][main] Supported to use full-graph with Qwen3-Next-MTP (#5477 )

2026-01-04 12:03:21 +08:00

[bugfix](pcp) expand max_num_tokens for pcp pad (#5478 )

2026-01-04 17:25:40 +08:00

[CI] add xlite e2e test (#5305 )

2025-12-25 09:17:06 +08:00

__init__.py

clean up model module (#4611 )

2025-12-02 17:35:47 +08:00

ascend_config.py

[Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072 )

2025-12-31 14:24:04 +08:00

ascend_forward_context.py

[BugFix]Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series (#5293 )

2026-01-04 17:51:28 +08:00

cpu_binding.py

[main] support cpu binding (#3546 )

2025-10-21 09:17:03 +08:00

envs.py

Remove VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE (#5272 )

2025-12-25 11:09:56 +08:00

flash_common3_context.py

[Perf]enable prefill flashcommon3 (#4065 )

2025-12-14 09:34:13 +08:00

meta_registration.py

Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

2025-09-13 11:58:52 +08:00

platform.py

Cleanup pass config override (#5283 )

2026-01-04 11:52:12 +08:00

profiling_config.py

Drop ascend scheduler (#4623 )

2025-12-05 09:03:45 +08:00

utils.py

[BugFix]Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series (#5293 )

2026-01-04 17:51:28 +08:00