xc-llm-ascend

Author	SHA1	Message	Date
songjianquan	43c8da3574	[Feat]fused_qkvzba_split_reshape supports token number greater than 65536 (#6740 ) ### What this PR does / why we need it? This pull request optimizes the fused_qkvzba_split_reshape_cat Triton kernel for Qwen3-Next GatedDeltaNet model and removes the previous conditional restrictions in the forward pass. Key changes: 1. Refactored Triton kernel implementation: The fused_qkvzba_split_reshape_cat_kernel has been optimized with a new loop-based approach that supports arbitrary num_v_heads / num_k_heads ratios and batch sizes. The kernel now uses configurable ROWS_PER_ITER for better memory utilization . 2. The optimized kernel now handles all scenarios directly without requiring a fallback path using fix_query_key_value_ordering and torch.cat. ### Does this PR introduce _any_ user-facing change? No. This is an internal optimization of the Triton kernel implementation and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with existing tests. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: songjianquan <songjianquan1@huawei.com> Co-authored-by: songjianquan <songjianquan1@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-05 14:41:38 +08:00
linfeng-yuan	700423156f	[Triton] Centralize Ascend extension op dispatch in triton_utils (#6937 ) ### What this PR does / why we need it? This pull request refactors the dispatch mechanism for the triton-ascend-specific operators `insert_slice`, `extract_slice`, and `get_element` to ensure compatibility with both CANN 8.5 and 9.0. A unified helper function, `_resolve_triton_ascend_op`, has been introduced in `vllm_ascend/ops/triton/triton_utils.py`. This function dynamically resolves these operators by first attempting to import them from the `triton.language.extra.cann.extension` module, which is present in newer CANN versions. If that fails, it falls back to the standard `triton.language` module. This approach centralizes operator dispatch logic, allowing individual Triton kernels to use these functions without being aware of the underlying Triton/CANN version. All call sites have been updated to use these new unified functions. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring of operator implementations and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with existing tests. Testing Context: - vLLM version: v0.16.0 - vLLM main: `15d76f74e2fdb12a95ea00f0ca283acf6219a2b7` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-03 17:10:30 +08:00
Bai Yongbin	9d09488b4a	[Feat] support basic pcp&dcp for qwen3next (#6091 ) ### What this PR does / why we need it? This PR implements Context Parallelism (CP) support for the Qwen3-Next model, including PCP (Parallel Context Parallelism) and DCP (Dynamic/Data Context Parallelism). - vLLM version: v0.15.0 - vLLM main: `f176443446` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Co-authored-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-02-28 21:44:08 +08:00
SILONG ZENG	78af0c30a3	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #12 ) (#6177 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/ops/triton/activation/swiglu_quant.py` \| \| `vllm_ascend/ops/triton/batch_invariant/matmul.py` \| \| `vllm_ascend/ops/triton/batch_invariant/mean.py` \| \| `vllm_ascend/ops/triton/batch_invariant/rmsnorm.py` \| \| `vllm_ascend/ops/triton/fla/chunk.py` \| \| `vllm_ascend/ops/triton/fla/chunk_delta_h.py` \| \| `vllm_ascend/ops/triton/fla/chunk_o.py` \| \| `vllm_ascend/ops/triton/fla/chunk_scaled_dot_kkt.py` \| \| `vllm_ascend/ops/triton/fla/cumsum.py` \| \| `vllm_ascend/ops/triton/fla/fused_qkvzba_split_reshape.py` \| \| `vllm_ascend/ops/triton/fla/l2norm.py` \| \| `vllm_ascend/ops/triton/fla/layernorm_guard.py` \| \| `vllm_ascend/ops/triton/fla/sigmoid_gating.py` \| \| `vllm_ascend/ops/triton/fla/solve_tril.py` \| \| `vllm_ascend/ops/triton/fla/utils.py` \| \| `vllm_ascend/ops/triton/fla/wy_fast.py` \| \| `vllm_ascend/ops/triton/fused_gdn_gating.py` \| \| `vllm_ascend/ops/triton/layernorm_gated.py` \| \| `vllm_ascend/ops/triton/linearnorm/split_qkv_rmsnorm_rope.py` \| \| `vllm_ascend/ops/triton/mamba/causal_conv1d.py` \| \| `vllm_ascend/ops/triton/reject_sample.py` \| \| `vllm_ascend/ops/triton/rope.py` \| \| `vllm_ascend/ops/triton/spec_decode/utils.py` \| \| `vllm_ascend/ops/triton/triton_utils.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-23 14:59:19 +08:00
meihanc	16b1bee804	[bugfix] fix test_camem failed with triton-ascend (#5492 ) ### What this PR does / why we need it? This fixes a bug that occurred when running `test_camem.py` in the triton-ascend environment `NPU function error: aclrtGetMemInfo(ACL_HBM_MEM, &device_free, &device_total)` - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-05 20:10:54 +08:00
XiaoxinWang	320877d488	move contiguous in fused_sigmoid_gating_delta_rule_update to model_runner_v1 (#5274 ) ### What this PR does / why we need it? The contiguous() operation temporarily increases memory usage, leading to higher peak GPU memory, which necessitates reducing gpu_memory_utilization. However, making tensors contiguous in modelrunnerv1 significantly enhances operator performance, resulting in greater end-to-end model benefits despite the memory overhead. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-26 09:19:47 +08:00
Ascendyh	a90482803d	[Kernel] add l2norm triton kernel (#4595 ) ### What this PR does / why we need it? This pull request introduces an L2 normalization kernel implemented in Triton, specifically optimized for Ascend NPUs. ### Does this PR introduce _any_ user-facing change? No, this PR does not introduce any user-facing changes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-25 06:06:18 +08:00
XiaoxinWang	0cc3fc357f	[pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (#4818 ) ### What this PR does / why we need it? qwen3_next add fused_sigmoid_gating_delta_rule_update op which fused fused_gdn_gating+fused_recurrent_gated_delta_rule - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-19 16:34:11 +08:00
ZT-AIA	39fb9e7c83	qwen3_next add triton ops : fused_qkvzba_split_reshape (#4788 ) ### What this PR does / why we need it? add triton ops fused_qkvzba_split_reshape_cat for qwen3_next GatedDeltaNet ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2025-12-18 11:31:04 +08:00
shiyuan680	1c4a0468ee	【OPS】qwen3-next support triton chunk_gated_delta_rule ops (#4070 ) ### What this PR does / why we need it? qwen3-next suppot triton chunk_gated_delta_rule ops ### co-owners @OsirisDuan - vLLM version: v0.11.2 Signed-off-by: shiyuan680 <917935075@qq.com>	2025-11-28 20:55:43 +08:00
wangxiyuan	bc69d7cfe1	upgrade to vllm 0.11.2 (#4400 ) Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by https://github.com/vllm-project/vllm/pull/26866 2. get_mrope_input_positions is broken by https://github.com/vllm-project/vllm/pull/28399 3. graph mode is broken by https://github.com/vllm-project/vllm/pull/25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by https://github.com/vllm-project/vllm/pull/27583 5. `get_attn_backend_cls` and attention backend is broken are broken by https://github.com/vllm-project/vllm/pull/28534 6. spec decode is broken by https://github.com/vllm-project/vllm/pull/28771 7. sp feature is broken by https://github.com/vllm-project/vllm/pull/27126 8. mtp is broken by https://github.com/vllm-project/vllm/pull/27922 9. lora is broken by https://github.com/vllm-project/vllm/pull/21068 10. execute_model is broken by https://github.com/vllm-project/vllm/pull/26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by https://github.com/vllm-project/vllm/pull/28159 12. kv cahe is broken by https://github.com/vllm-project/vllm/pull/27753 13. dp is broken by https://github.com/vllm-project/vllm/pull/25110 What's broken and changed by ourself: 1. qwen vl is broken by https://github.com/vllm-project/vllm/pull/28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by https://github.com/vllm-project/vllm/pull/23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by https://github.com/vllm-project/vllm/pull/28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by https://github.com/vllm-project/vllm/pull/28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by https://github.com/vllm-project/vllm/pull/27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-11-26 11:48:58 +08:00
shiyuan680	d5f77f14d0	mkdir triton package and move triton files (#4420 ) ### What this PR does / why we need it? mkdir triton package and move triton files - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: shiyuan680 <917935075@qq.com>	2025-11-26 11:06:12 +08:00

12 Commits