xc-llm-ascend/vllm_ascend at 6d25372baaa0ef018a75b427b387fab8dd2e92b4 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

Aoxuan Chen 6d25372baa Add MagicMTP(block verify) and Triton optimization (#4443 )

### What this PR does / why we need it?
1. MagicMTP (paper: "Block Verification Accelerates Speculative
Decoding") was introduced to consider the influence among multiple draft
tokens, improving the acceptance rate without compromising accuracy.
2. The rejection sampling logic in rejection_sampler.py was restructured
using Triton-Ascend, enabling it to operate under high concurrency, thus
resolving CPU and NPU operator bottlenecks and enhancing throughput.

### Does this PR introduce _any_ user-facing change?
MagicMTP will automatically take effect when the parameter
"num_speculative_tokens" >= 3.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: chenaoxuan <cax1165@163.com>

2025-12-25 09:00:25 +08:00

..

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) (#5317 )

2025-12-24 22:24:17 +08:00

[feature] support pcp + mtp in full graph (#4572 )

2025-12-22 16:13:39 +08:00

[CI] fix lint (#5216 )

2025-12-20 17:03:25 +08:00

device_allocator

[Misc]Clean up useless import from vllm (#2049 )

2025-07-28 16:01:59 +08:00

[Bugfix] Use hf_text_config instead of hf_config to support multimodal PD-Disaggregated (#5205 )

2025-12-22 20:21:45 +08:00

[Misc] Cleanup useless print and logger (#5220 )

2025-12-22 11:28:26 +08:00

upgrade vLLM to main (#4608 )

2025-12-02 22:10:52 +08:00

[BugFix]Fix precision issue for LoRA feature (#4141 )

2025-12-19 14:22:06 +08:00

[CI] speed up ut (#4901 )

2025-12-11 18:45:43 +08:00

[Kernel] add l2norm triton kernel (#4595 )

2025-12-25 06:06:18 +08:00

[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667 )

2025-12-23 10:04:37 +08:00

[quantization] Add w8a16 quantization support (#4541 )

2025-12-24 19:49:32 +08:00

Add MagicMTP(block verify) and Triton optimization (#4443 )

2025-12-25 09:00:25 +08:00

[main][Refactor] Remove with_prefill parameter from set_ascend_forward_context (#5094 )

2025-12-23 14:30:50 +08:00

Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138 ) (#5317 )

2025-12-24 22:24:17 +08:00

[Feat]Xlite Qwen3-vl Support (#5228 )

2025-12-22 16:30:52 +08:00

__init__.py

clean up model module (#4611 )

2025-12-02 17:35:47 +08:00

ascend_config.py

[Performance] Add async exponential while model executing (#4501 )

2025-12-20 21:23:21 +08:00

ascend_forward_context.py

[main][Refactor] Remove with_prefill parameter from set_ascend_forward_context (#5094 )

2025-12-23 14:30:50 +08:00

cpu_binding.py

[main] support cpu binding (#3546 )

2025-10-21 09:17:03 +08:00

envs.py

Cleanup uesless env (#5270 )

2025-12-24 22:07:59 +08:00

flash_common3_context.py

[Perf]enable prefill flashcommon3 (#4065 )

2025-12-14 09:34:13 +08:00

meta_registration.py

Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

2025-09-13 11:58:52 +08:00

platform.py

update to vllm 12-19 (#5223 )

2025-12-23 23:52:11 +08:00

profiling_config.py

Drop ascend scheduler (#4623 )

2025-12-05 09:03:45 +08:00

utils.py

[bugfix] remove the EP buffer allocation introduced by fused-op dispatch_ffn_c… (#5284 )

2025-12-24 11:26:19 +08:00