xc-llm-ascend

Author	SHA1	Message	Date
zxr2333	239683c7a6	[P/D]Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups (#7022 ) ### What this PR does / why we need it? Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-10 23:59:20 +08:00
ZT-AIA	ee5347e824	[qwen3 next ]add ascend c casual_conv1d_fn (#6661 ) ### What this PR does / why we need it? add ascend c casual_conv1d_fn - vLLM version: v0.15.0 - vLLM main: `13397841ab` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-09 23:29:49 +08:00
LeeWenquan	65eae6de7b	Add Ascend Ops recurrent_gated_delta_rule (#6725 ) ### What this PR does / why we need it? Change recurrent_gated_delta_rule ops from triton to ascend C version for better performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-03-09 14:14:14 +08:00
ZhaoJiangJiang	a51d6366b9	[Bugfix] Qwen3Next support FlashComm1 (#6830 ) ### What this PR does / why we need it? Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence Parallel (SP) and resolve precision problems in shared_out when both FlashComm1 is enabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com> Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>	2026-03-06 17:14:08 +08:00
songjianquan	43c8da3574	[Feat]fused_qkvzba_split_reshape supports token number greater than 65536 (#6740 ) ### What this PR does / why we need it? This pull request optimizes the fused_qkvzba_split_reshape_cat Triton kernel for Qwen3-Next GatedDeltaNet model and removes the previous conditional restrictions in the forward pass. Key changes: 1. Refactored Triton kernel implementation: The fused_qkvzba_split_reshape_cat_kernel has been optimized with a new loop-based approach that supports arbitrary num_v_heads / num_k_heads ratios and batch sizes. The kernel now uses configurable ROWS_PER_ITER for better memory utilization . 2. The optimized kernel now handles all scenarios directly without requiring a fallback path using fix_query_key_value_ordering and torch.cat. ### Does this PR introduce _any_ user-facing change? No. This is an internal optimization of the Triton kernel implementation and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with existing tests. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: songjianquan <songjianquan1@huawei.com> Co-authored-by: songjianquan <songjianquan1@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-05 14:41:38 +08:00
SILONG ZENG	19b5d44ea8	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #10 ) (#6173 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|`vllm_ascend/ops/layer_shard_linear.py`\| \|`vllm_ascend/ops/linear.py`\| \|`vllm_ascend/ops/linear_op.py`\| \|`vllm_ascend/worker/worker.py`\| \| ` vllm_ascend/patch/worker/patch_bert.py` \| \| ` vllm_ascend/patch/worker/patch_deepseek.py` \| \| ` vllm_ascend/patch/worker/patch_distributed.py` \| \| ` vllm_ascend/patch/worker/patch_module.py` \| \| ` vllm_ascend/patch/worker/patch_multimodal_merge.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next_mtp.py` \| \| ` vllm_ascend/patch/worker/patch_rejection_sampler.py` \| \| ` vllm_ascend/patch/worker/patch_rope.py` \| \| ` vllm_ascend/patch/worker/patch_triton.py` \| \| ` vllm_ascend/patch/worker/patch_unquantized_gemm.py` \| \| ` vllm_ascend/patch/worker/patch_v2_egale.py` \| \|` vllm_ascend/worker/npu_input_batch.py`\| \|` vllm_ascend/worker/v2/aclgraph_utils.py`\| \|` vllm_ascend/worker/v2/attn_utils.py`\| \|` vllm_ascend/worker/v2/model_runner.py`\| \|` vllm_ascend/worker/v2/sample/gumbel.py`\| \|` vllm_ascend/worker/v2/sample/penalties.py`\| \|` vllm_ascend/worker/v2/sample/sampler.py`\| \|` vllm_ascend/worker/v2/spec_decode/__init__.py`\| \|` vllm_ascend/worker/v2/spec_decode/eagle.py`\| \|` vllm_ascend/worker/v2/states.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-06 15:35:06 +08:00
LeeWenquan	b1de6cbb31	[Bugfix][CI]Add qwen3Next MTP+Full Decode (#6047 ) ### What this PR does / why we need it? Fix a bug in the repo and add a test case for MTP + Full Decode Only + Qwen3Next. The _build_dummy_attn_metadata function in NPUModelRunner seems losed a query_star_loc.copy_to_gpu operation, which will lead to difference between query_start_loc and query_start_loc_cpu, and they are required to be same in MTP + Full Decode Only + Qwen3Next case. Before this pr: `self.query_start_loc = [0, 0, 0, 0, ... , 0] self.query_start_loc_cpu = [0, 2, 4, 6, ... ,128]` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-02-03 14:26:21 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
Ascendyh	b2c121637f	[task] Add fused gdn gating triton kernel (#4304 ) ### What this PR does / why we need it? This commit introduces a Triton-based fused GDN gating kernel for Ascend NPU, aimed at improving performance in the Gated Delta Net workflow. ### Does this PR introduce _any_ user-facing change? It only adds and refactors internal Triton kernels and wrappers for Ascend. These are backend implementation details. There are no new APIs, flags, CLI options, or behavior changes visible to end users. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ascendyh <hw7osiris@outlook.com>	2025-12-22 14:09:19 +08:00
XiaoxinWang	0cc3fc357f	[pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (#4818 ) ### What this PR does / why we need it? qwen3_next add fused_sigmoid_gating_delta_rule_update op which fused fused_gdn_gating+fused_recurrent_gated_delta_rule - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-12-19 16:34:11 +08:00
ZT-AIA	39fb9e7c83	qwen3_next add triton ops : fused_qkvzba_split_reshape (#4788 ) ### What this PR does / why we need it? add triton ops fused_qkvzba_split_reshape_cat for qwen3_next GatedDeltaNet ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2025-12-18 11:31:04 +08:00

12 Commits