xc-llm-ascend

Files

Aoxuan Chen 6d25372baa Add MagicMTP(block verify) and Triton optimization (#4443 )

### What this PR does / why we need it?
1. MagicMTP (paper: "Block Verification Accelerates Speculative
Decoding") was introduced to consider the influence among multiple draft
tokens, improving the acceptance rate without compromising accuracy.
2. The rejection sampling logic in rejection_sampler.py was restructured
using Triton-Ascend, enabling it to operate under high concurrency, thus
resolving CPU and NPU operator bottlenecks and enhancing throughput.

### Does this PR introduce _any_ user-facing change?
MagicMTP will automatically take effect when the parameter
"num_speculative_tokens" >= 3.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: chenaoxuan <cax1165@163.com>

2025-12-25 09:00:25 +08:00

features

Drop torchair (#4814 )

2025-12-10 09:20:40 +08:00

models

[TEST]Update mm param --mm-processor-cache-gb (#5242 )

2025-12-22 18:54:03 +08:00

multi_node

[Doc][P/D] Fix MooncakeConnector's name (#5172 )

2025-12-18 22:29:19 +08:00

multicard_ops

[Feature] Add token mask for DispatchGmmCombineDecode operator (#5171 )

2025-12-19 16:31:48 +08:00

ops

Add MagicMTP(block verify) and Triton optimization (#4443 )

2025-12-25 09:00:25 +08:00