xc-llm-ascend

Author	SHA1	Message	Date
bowenli	048c8d1afe	[v0.18.0][Bugfix] Fix the bug of MTP1 crashing in multiple concurrent scenarios. (#7699 ) ### What this PR does / why we need it? The triton operator does not perform boundary checks on the global position within the loop, leading to the memory overflow in scenarios with multiple concurrency + 1-step MTP launch. Solution: Add a check that global_pos < vec_len, and strictly limit the boundaries of all memory accesses to avoid out-of-bounds writes. backport：#7459 Signed-off-by: Bowen-Leee <caoshankuangren@gmail.com>	2026-03-27 14:13:12 +08:00
HarpsealCC	d6661c09b6	[v0.18.0][kernel] Recompilation optimization triggered by triton function parameter optimization (#7647 ) ### What this PR does / why we need it? Some parameters of Triton operators are unnecessarily modified with the "constexpr" modifier. When these parameters change, recompilation is triggered, which significantly affects the model performance. Therefore, these parameters need to be rectified. - vLLM version: v0.17.0 - vLLM main: `8b6325758c` Signed-off-by: HarpSealCC [844291270@qq.com](mailto:844291270@qq.com) Signed-off-by: l30072083 <liuchengzhuo1@h-partners.com> Co-authored-by: l30072083 <liuchengzhuo1@h-partners.com>	2026-03-26 19:10:45 +08:00
linfeng-yuan	700423156f	[Triton] Centralize Ascend extension op dispatch in triton_utils (#6937 ) ### What this PR does / why we need it? This pull request refactors the dispatch mechanism for the triton-ascend-specific operators `insert_slice`, `extract_slice`, and `get_element` to ensure compatibility with both CANN 8.5 and 9.0. A unified helper function, `_resolve_triton_ascend_op`, has been introduced in `vllm_ascend/ops/triton/triton_utils.py`. This function dynamically resolves these operators by first attempting to import them from the `triton.language.extra.cann.extension` module, which is present in newer CANN versions. If that fails, it falls back to the standard `triton.language` module. This approach centralizes operator dispatch logic, allowing individual Triton kernels to use these functions without being aware of the underlying Triton/CANN version. All call sites have been updated to use these new unified functions. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring of operator implementations and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with existing tests. Testing Context: - vLLM version: v0.16.0 - vLLM main: `15d76f74e2fdb12a95ea00f0ca283acf6219a2b7` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-03 17:10:30 +08:00
SILONG ZENG	78af0c30a3	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #12 ) (#6177 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/ops/triton/activation/swiglu_quant.py` \| \| `vllm_ascend/ops/triton/batch_invariant/matmul.py` \| \| `vllm_ascend/ops/triton/batch_invariant/mean.py` \| \| `vllm_ascend/ops/triton/batch_invariant/rmsnorm.py` \| \| `vllm_ascend/ops/triton/fla/chunk.py` \| \| `vllm_ascend/ops/triton/fla/chunk_delta_h.py` \| \| `vllm_ascend/ops/triton/fla/chunk_o.py` \| \| `vllm_ascend/ops/triton/fla/chunk_scaled_dot_kkt.py` \| \| `vllm_ascend/ops/triton/fla/cumsum.py` \| \| `vllm_ascend/ops/triton/fla/fused_qkvzba_split_reshape.py` \| \| `vllm_ascend/ops/triton/fla/l2norm.py` \| \| `vllm_ascend/ops/triton/fla/layernorm_guard.py` \| \| `vllm_ascend/ops/triton/fla/sigmoid_gating.py` \| \| `vllm_ascend/ops/triton/fla/solve_tril.py` \| \| `vllm_ascend/ops/triton/fla/utils.py` \| \| `vllm_ascend/ops/triton/fla/wy_fast.py` \| \| `vllm_ascend/ops/triton/fused_gdn_gating.py` \| \| `vllm_ascend/ops/triton/layernorm_gated.py` \| \| `vllm_ascend/ops/triton/linearnorm/split_qkv_rmsnorm_rope.py` \| \| `vllm_ascend/ops/triton/mamba/causal_conv1d.py` \| \| `vllm_ascend/ops/triton/reject_sample.py` \| \| `vllm_ascend/ops/triton/rope.py` \| \| `vllm_ascend/ops/triton/spec_decode/utils.py` \| \| `vllm_ascend/ops/triton/triton_utils.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-23 14:59:19 +08:00
Aoxuan Chen	8763953f56	[Feature] add the magicmtp speculative decoding acceleration algorithm (#5542 ) ### What this PR does / why we need it? 1. MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy. 2. Added Triton and PyTorch implementations, and added E2E test cases. ### Does this PR introduce _any_ user-facing change? MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: chenaoxuan <cax1165@163.com>	2026-01-08 09:15:55 +08:00
daniel	8ffe3f5d78	feat: implement high-performance Triton kernels for rejection sampling: optimization for rejection_random_sample_kernel (#5259 ) ### What this PR does / why we need it? This PR introduces optimized Triton implementations for the rejection_random_sample_kernel delivering superior performance compared to the existing Triton implementations. The new Triton kernels maintain full functional accuracy while delivering significant performance improvements across various batch sizes and MTP configurations. ### Does this PR introduce _any_ user-facing change? Yes, this PR modifies rejection_sampler.py to use optimized Triton kernels: rejection_random_sample_kernel is modified and optimized ### How was this patch tested? performance benchmark results: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=Generator content="Microsoft Excel"> <!--[if !mso]> </head> <body> <!--StartFragment--> Batch Size \| MTP \| origin implementation(us) \| optimized version(us) -- \| -- \| -- \| -- 1 \| 1 \| 2.934 \| 3.64 8 \| 1 \| 4.467 \| 4 32 \| 1 \| 6.98 \| 4.54 64 \| 1 \| 11.087 \| 6.42 128 \| 1 \| 13.414 \| 7.84 256 \| 1 \| 19.66 \| 8.487 512 \| 1 \| 39.908 \| 11.62 1024 \| 1 \| 81.781 \| 18.16 2048 \| 1 \| 137.923 \| 32.934 1 \| 2 \| 3.4 \| 4.02 8 \| 2 \| 3.74 \| 4.24 32 \| 2 \| 6.373 \| 7.394 64 \| 2 \| 9.747 \| 6.46 128 \| 2 \| 12.98 \| 7.76 256 \| 2 \| 20.834 \| 9.787 512 \| 2 \| 39.314 \| 13.56 1024 \| 2 \| 83.135 \| 22.387 2048 \| 2 \| 157.563 \| 40.607 <!--EndFragment--> </body> </html> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: 1024daniel <xxltju324@gmail.com>	2026-01-05 16:03:02 +08:00
whx	28b7614322	[Refactor][Triton] Move reject sample triton kernels into ops/triton (#5324 ) ### What this PR does / why we need it? This PR moves reject sample related triton kernels into `ops/triton`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-12-29 16:15:41 +08:00

7 Commits