xc-llm-ascend

Author	SHA1	Message	Date
chenxi-hh	737dfcf638	[MOE] commit GMM custom operator (#7010 ) ### What this PR does / why we need it? GMM custom operator optimization in small batch scenarios ### How was this patch tested? Submit the GMM custom operator for subsequent integration into the MOE process. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>	2026-03-09 09:56:31 +08:00
lilinsiman	01d3515dcf	[eagle][cp][bugfix] Fix the bug in eagle and cp enabled (#6981 ) ### What this PR does / why we need it? When eagle and cp are enabled at the same time, there is an error in pcp_allgather due to hidden_states. This PR fixes this issue. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-06 20:49:49 +08:00
aipaes	1c0ecf806a	[bugfix] fix pass bug: pass really rope dim for npu_rotary_embedding (#6880 ) ### What this PR does / why we need it? pass really rope dim for npu_rotary_embedding before： q_rope, k_rope = torch.ops.vllm.npu_rotary_embedding( positions, q_flat, k_flat, cos_sin_cache, self.head_dim, self.head_dim, True ) after： q_rope, k_rope = torch.ops.vllm.npu_rotary_embedding( positions, q_flat, k_flat, cos_sin_cache, self.head_dim, self.rope_dim, True ) ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Signed-off-by: aipaes <82140963+aipaes@users.noreply.github.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-06 19:35:17 +08:00
tanhaoan333	094eb0eff9	[bugfix]Qwen-Omni quantization bugfix (#7042 ) ### What this PR does / why we need it? [bugfix]Qwen-Omni quantization bugfix fix Qwen-Omni quantization weight mapping to float weight ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-06 17:24:22 +08:00
ZhaoJiangJiang	a51d6366b9	[Bugfix] Qwen3Next support FlashComm1 (#6830 ) ### What this PR does / why we need it? Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence Parallel (SP) and resolve precision problems in shared_out when both FlashComm1 is enabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com> Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>	2026-03-06 17:14:08 +08:00
Zetong Li	a2696006d1	[Refactor][EAGLE] 8/N delete mtp_proposer (re-pull) (#7033 ) ### What this PR does / why we need it? NOTE: This PR is re-pull of #7016 since ci mistakenly marked unfinished pr as having passed. This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and glm5, now it should be ok to remove mtp_proposer. The bug is actually about unnecessary slicing of `slot_mapping`. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-06 17:11:22 +08:00
Fager10086	c5dfa8d645	[OPS]add split_qkv_rmsnorm_mrope ops (#6730 ) ### What this PR does / why we need it? This PR adds split_qkv_rmsnorm_mrope kernel with interleaved for qwen3.5 and qwen3-vl to improve performance. ### Does this PR introduce _any_ user-facing change? Does not. ### How to use? ```python real_q, real_k, real_v, real_gate = torch.ops.vllm.triton_split_qkv_rmsnorm_mrope( qkv=qkv, q_weight=q_weight, k_weight=k_weight, cos_sin=cos_sin, num_q_heads=num_q_heads, num_kv_heads=num_kv_heads, head_size=head_size, eps=eps, mrope_section=mrope_section, is_interleaved=is_interleaved, rope_dim=rope_dim, has_gate=has_gate, ) ``` ### How was this patch tested? - vLLM version: v0.16.0 - Accuracy test script： ```shell pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_rmsnorm_mrope.py ``` --------- Signed-off-by: Fager <865071616@qq.com> Signed-off-by: Fager10086 <77871921+Fager10086@users.noreply.github.com> Signed-off-by: fager <865071616@qq.com>	2026-03-06 16:18:37 +08:00
xiaocongtou6	bc0fd7ca72	[Feat]Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for DeepSeek v3.2. (#6940 ) ### What this PR does / why we need it? Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for DeepSeek v3.2. ### How was this patch tested? Test output: {"object":"text_completion","model":"deepeek_v3","choices":[{"index":0,"text":" the head of state and head of government of the United States, indirectly elected to a four-year term by the American people through the Electoral College. The officeholder leads the executive branch of the federal government and is the commander-in-chief of the United States","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":1,"text":" Paris. This is the largest city in France and its main political, cultural and commercial center. The modern location of the city is the north of the central part of the country, on the banks of the Seine River Seine River Seine in 3\n\n","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":2,"text":" now\n\n# AI future is now\n\nThe world is changing at a rapid pace, and artificial intelligence (AI) is at the forefront of this transformation. From self-driving cars to virtual assistants, AI is already making a significant impact on our daily lives","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":3,"text":" a 3rd year student at the University of Lincoln studying Media Production. This blog is about my work throughout my final year on the course.\n\n## Tuesday 3 May 2016\n### Final Major Project - Evaluation\n\nFor my final project I","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":27,"total_tokens":227,"completion_tokens":200,"prompt_tokens_details":null},"kv_transfer_params":null} - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: xiaocongtou6 <2066962956@qq.com> Signed-off-by: xiaocongtou6 <105542647+xiaocongtou6@users.noreply.github.com>	2026-03-06 16:10:24 +08:00
Shanshan Shen	a813eadd2d	[MM][Perf] Enable 2.7x faster for convolution computation with aclnn BatchMatMulV2 (#7017 ) ### What this PR does / why we need it? Currently, we are using `e2b31243c0/vllm/model_executor/layers/conv.py (L219-L232)` for convolution computation, which is used in patch embedding for VL models. After profiling, we find that this linear method will take about 6.87 ms, which is much slower than just using `F.conv3d()`. In `F.conv3d()`, it will call aclnn `BatchMatMulV2` with optimization on Ascend NPU, which only take about 2.50 ms and is 2.7x faster than linear method. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-03-06 14:26:37 +08:00
wanghengkang	c49ce18ea5	[Test] Add e2e test cases for the Qwen-VL model adaptation to Ascend 310p (#6977 ) ### What this PR does / why we need it? Add e2e test cases for the Qwen-VL model adaptation to Ascend 310p - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: gcw_61wqY8cy <wanghengkang1@huawei.com>	2026-03-06 14:25:10 +08:00
aipaes	620076b76a	[bugs] fix install FIA sh (#6989 ) ### What this PR does / why we need it? Update the replacement shell script for the FIA operator FD feature in CANN 8.5.1 - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Signed-off-by: aipaes <82140963+aipaes@users.noreply.github.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-06 11:42:32 +08:00
wangxiyuan	16c3b0b822	Revert "[Refactor][EAGLE] 8/N delete mtp_proposer" (#7030 ) Reverts vllm-project/vllm-ascend#7016 It breaks E2E test - vLLM version: v0.16.0 - vLLM main: `4034c3d32e`	2026-03-06 11:24:05 +08:00
panchao-hub	8c2c82f3e1	[Bugfix] Fix the moe_forward error when setting enable_static_kernel … (#6964 ) ### What this PR does / why we need it? Fix the moe_forward error when setting enable_static_kernel to true. When static kernels are enabled, the forward pass runs twice (compilation + capture), causing moe_layer_index to overflow. Wrap the index to prevent out-of-bounds errors. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com>	2026-03-06 10:36:10 +08:00
pz1116	a7820d20f4	[Doc][KV Pool]Update Memcache local service config example: increase default world size to 256 and update description (#7025 ) ### What this PR does / why we need it? Update Memcache local service config example: increase default world size to 256 and update the description for better clarity. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>	2026-03-06 10:23:55 +08:00
MengLong Chen	a838a89630	[v0.16.0][P/D][Bugfix] Support ALL D-Nodes in fullgraph when running MTP in PD (#6948 ) ### What this PR does / why we need it? Fix the bug for v0.16.0 recompute_scheduler, the same way as https://github.com/vllm-project/vllm-ascend/pull/5472. Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2026-03-06 10:01:33 +08:00
LI SHENGYONG	ccd00798f3	[EPLB] Display the expert hotness comparison before and after eplb. (#6877 ) ### What this PR does / why we need it? To intuitively show the effect of the eplb algorithm, we print the expert heat before and after eplb. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-28_17-23-42](https://github.com/user-attachments/assets/db1dadd1-cf96-44da-af34-57d41ccf412f) - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-06 09:53:29 +08:00
frank	18b52afe2b	[Ops][Misc] Optimize split_qkv_rmsnorm_rope op (#6827 ) ### What this PR does / why we need it? This PR optimizes the `split_qkv_rmsnorm_rope` operator by introducing a new Triton kernel, `split_qkv_rmsnorm_rope_prefill_kernel`, for the prefill stage (i.e., large batch sizes). The implementation now dynamically selects between the existing decode kernel and the new prefill kernel based on the batch size, which improves performance for large batch scenarios. Additionally, the RoPE implementation is updated to support partial rotation dimensions (`rope_dim`), making the operator more flexible. ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization and is not expected to introduce any user-facing changes. ### How was this patch tested? CI should pass with existing tests. The new prefill path is triggered when the batch size is larger than the number of available vector cores. The partial RoPE feature can be tested by passing the `rope_dim` argument. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: guzhiyong <guzhiyong5@h-partners.com> Signed-off-by: frank <2547457096@qq.com> Co-authored-by: guzhiyong <guzhiyong5@h-partners.com>	2026-03-06 09:30:31 +08:00
Zetong Li	a60e179c7f	[Refactor][EAGLE] 8/N delete mtp_proposer (#7016 ) ### What this PR does / why we need it? This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and glm5, now it should be ok to remove mtp_proposer. The bug is actually about unnecessary slicing of `slot_mapping`. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-06 09:10:57 +08:00
SILONG ZENG	bd571cf6d6	[Main2Main] Upgrade vLLM to 0303 (#6944 ) ### What this PR does / why we need it? break: - https://github.com/vllm-project/vllm/pull/34102 Disable_full param replaced with valid_modes/invalid_modes API - https://github.com/vllm-project/vllm/pull/35503 Now must return float compilation_time - https://github.com/vllm-project/vllm/pull/35564 New sequence_lengths param added - https://github.com/vllm-project/vllm/pull/33807 A check was performed (if runner_backend != "auto") - https://github.com/vllm-project/vllm/pull/34861 `BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to check process group state - https://github.com/vllm-project/vllm/pull/35274 Important change: - https://github.com/vllm-project/vllm/pull/28672 `matcher_utils` directly accesses `torch.ops._C.*` during the import phase. In the Ascend environment, some unregistered ops trigger `AttributeError`, causing e2e initialization failure. https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323 https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29 This PR adds temporary compatibility placeholders (rms_norm, fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant, silu_and_mul) to `vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to ensure no crashes during the import phase. Upstream repairs will be considered later. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>	2026-03-06 09:08:52 +08:00
liuchen2026fly	640ecd1b77	[BugFix] Fix muls_add fusion not working for GLM5 models (#6928 ) ### What this PR does / why we need it? fix: support model-specific routed_scaling_factor in muls_add fusion Previously, MulsAddFusionPass used a hardcoded scale=1.0, which failed to match the x * routed_scaling_factor + y pattern in models like GLM5 that use routed_scaling_factor=2.5. This caused the muls_add fusion to be skipped, leaving unoptimized mul+add operations. This fix reads routed_scaling_factor from model config (defaulting to 1.0 for backward compatibility) and uses it as the pattern scale, enabling correct fusion for GLM5 and other models with custom scaling factors. Fixes: Unoptimized mul+add in GLM5 attention blocks Tested: GLM5-W8A8 with routed_scaling_factor=2.5 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: liuchenbing <chenliumail@163.com> Co-authored-by: liuchenbing <chenliumail@163.com>	2026-03-05 22:35:54 +08:00
fems14	ae394767d4	【main】ADXL/HIXL supports FabricMem Mode (#6806 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-03-05 21:04:11 +08:00
Cao Yi	50441e4650	[BugFix][MTP] Fix prefill misclassified as decode when prompt tokens == num_spec_tokens + 1 (#6835 ) ## Problem When MTP is enabled, prefill requests with `prompt_tokens == num_spec_tokens + 1` are incorrectly classified as decode requests, causing accuracy issues. ## Root Cause The `uniform_decode` condition only checked: - `max_num_scheduled_tokens == uniform_decode_query_len` - `num_tokens == max_num_scheduled_tokens * num_reqs` This is insufficient because a prefill request with specific prompt length satisfies these conditions as well. ## Fix Add `is_all_decode` check to ensure all requests have `num_computed_tokens > 0` before classifying as uniform decode, since decode requests must have computed at least one token. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-03-05 17:33:10 +08:00
dsxsteven	91c39ebae6	[BugFix] [dcp] Fix GQA Model Error when Enable both DP and DCP (#7012 ) ### What this PR does / why we need it? For GQA model, when we enable both dp and dcp (disable pcp), the key-value pairs were not being captured correctly; we have now fixed it. Signed-off-by: dsxsteven <dsxsteven@sina.com>	2026-03-05 16:51:08 +08:00
zhangxinyuehfad	1e4017e3fa	[CI] support nightly ci for per pr by labels (#6483 ) ### What this PR does / why we need it? This PR refactors the nightly CI workflows (A2 and A3) to support running tests against a specific PR's code, in addition to the existing scheduled/dispatch runs using pre-built images. #### Motivation: Previously, nightly tests could only be triggered by schedule or workflow_dispatch, always using the pre-built nightly image. This change allows developers to trigger nightly tests against their own PR's source code, enabling early validation without waiting for a nightly build. #### Changes Trigger logic (parse-trigger job) A new parse-trigger job is introduced in both schedule_nightly_test_a2.yaml and schedule_nightly_test_a3.yaml to centralize trigger evaluation: `schedule / workflow_dispatch`: runs all tests with the pre-built image (existing behavior preserved) `pull_request (labeled + synchronize)`: runs only when:The PR has the nightly-test label, and /nightly [test-names] comment exists (latest one wins) 1. /nightly or /nightly all — runs all tests 2. /nightly test1 test2 — runs only named tests (comma-wrapped for exact matching) #### How to trigger 1. Add the nightly-test label to your PR 2. Comment /nightly (all tests) or /nightly test1 test2 (specific tests) 4. Re-triggering: add another /nightly comment and push a new commit (synchronize event) ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-05 16:46:37 +08:00
zhangxinyuehfad	a6745b8577	[CI] fix test_qwen3_moe_external_launcher_ep_tp2 (#6951 ) ### What this PR does / why we need it? fix test_qwen3_moe_external_launcher_ep_tp2 by wait_until_npu_memory_free ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-05 16:43:45 +08:00
tanhaoan333	1f2a083597	[bugfix]Qwen-Omni quantization model_type bugfix (#7007 ) ### What this PR does / why we need it? [bugfix]Qwen-Omni quantization model_type bugfix ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-05 16:34:34 +08:00
realliujiaxu	1a7f845696	[Feat][Worker] NPUWorker Profiler profile_prefix full adaptation (RFC #6954 ) (#6968 ) ## What this PR does / why we need it? Implements [RFC #6954](https://github.com/vllm-project/vllm-ascend/issues/6954): NPUWorker Profiler profile_prefix full adaptation for API parity with upstream vLLM. ### Changes - Lazy profiler init: Defer profiler creation until first `profile(is_start=True)` call - profile_prefix param: Add `profile_prefix` to `profile()`; compute `trace_name` from prefix + `get_worker_rank_suffix()` - Refactor `_init_profiler` → `_create_profiler(trace_name)`: Pass `worker_name` to `tensorboard_trace_handler` for unique trace files per worker - Unique trace files per worker; no collision in multi-worker setups ### Testing - Unit tests updated/added in `tests/ut/worker/test_worker_v1.py` - `pytest tests/ut/worker/test_worker_v1.py::TestNPUWorker` passed ## Does this PR introduce _any_ user-facing change? Yes. Trace file naming may differ (more descriptive with worker rank suffix). `profile(is_start=True, profile_prefix="warmup")` now supported. ## How was this patch tested? - Unit tests:`pytest tests/ut/worker/test_worker_v1.py::TestNPUWorker` - Manual: vLLM serve with profiler config, start/stop profile, verified trace files - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-05 16:18:34 +08:00
LeeWenquan	3047b724b3	Add GemmaRmsNorm ACLGraph Support (#6473 ) ### What this PR does / why we need it? 1. New Custom NPU Operation: Introduced npu_gemma_rms_norm in csrc/torch_binding.cpp to provide optimized Gemma RMS Normalization support for Ascend NPUs. This function includes logic to handle dynamic shapes for the gamma tensor. 2. PyTorch Operator Registration: The new npu_gemma_rms_norm operation has been registered with the PyTorch custom operator library, making it accessible from Python. Meta-Implementation for ACLGraph: A corresponding meta-implementation, npu_gemma_rms_norm_meta, was added in csrc/torch_binding_meta.cpp. This is crucial for symbolic tracing and allowing the custom kernel to be captured and optimized by ACLGraph. 3. Python Frontend Integration: The vllm_ascend/ops/layernorm.py file was updated to utilize the newly added torch.ops._C_ascend.npu_gemma_rms_norm for Gemma RMS Normalization, replacing the generic torch_npu.npu_rms_norm ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com> Signed-off-by: LeeWenquan <83354342+SunnyLee151064@users.noreply.github.com>	2026-03-05 16:15:07 +08:00
LI SHENGYONG	5a3744c542	[EPLB] The profiling can collect the time required for adjusting the eplb. (#7001 ) ### What this PR does / why we need it? To analyze the overhead of the dynamic eplb adjustment framework in detail, we added the time consumption of the adjustment to the print information in profiling mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-03-05_11-42-28](https://github.com/user-attachments/assets/41c2b82a-5dfa-4e39-8b50-f4649deed30c) - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-05 16:10:57 +08:00
songjianquan	43c8da3574	[Feat]fused_qkvzba_split_reshape supports token number greater than 65536 (#6740 ) ### What this PR does / why we need it? This pull request optimizes the fused_qkvzba_split_reshape_cat Triton kernel for Qwen3-Next GatedDeltaNet model and removes the previous conditional restrictions in the forward pass. Key changes: 1. Refactored Triton kernel implementation: The fused_qkvzba_split_reshape_cat_kernel has been optimized with a new loop-based approach that supports arbitrary num_v_heads / num_k_heads ratios and batch sizes. The kernel now uses configurable ROWS_PER_ITER for better memory utilization . 2. The optimized kernel now handles all scenarios directly without requiring a fallback path using fix_query_key_value_ordering and torch.cat. ### Does this PR introduce _any_ user-facing change? No. This is an internal optimization of the Triton kernel implementation and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with existing tests. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: songjianquan <songjianquan1@huawei.com> Co-authored-by: songjianquan <songjianquan1@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-05 14:41:38 +08:00
wangxiyuan	13777bf3f0	[Spec Decode]clean up spec decode interface (#6947 ) This pull request refactors the speculative decoding proposer interface to align with upstream vLLM, removing the local `Proposer` interface and renaming methods to `propose`. This is the first step. In the future we should remove the class register and just add few Ascend specified method once the arch in vLLM is ready. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-05 14:30:10 +08:00
rjg-lyh	2bd9c35788	[perf][refactor] Refactor and optimize sfa_v1.py for dsv3.2/glm5 (#6874 ) ### What this PR does / why we need it? This PR refactors sfa_v1.py to improve code readability and usability, fixes a code bug, and enhances performance through the replacement of certain operators. ### changes - improve code readability: Optimizes parts of the code structure in sfa_v1.py, supplementary comments for key code blocks, removes some unused variables, and improves the naming of certain functions and variables. - resolved a duplicated double write to k_cache: Fixed redundant double writes of k_cache in the indexer_select module (in both the `forward` function and `indexer_select_post_process`), improving performance to some extent. - replace `scatter` ops with `reshape_and_cache`: This optimization replaces two separate cache storage operations on `k_nope` and `k_pe` with a single call to the `reshape_and_cache` operator, improving performance. The original `scatter` operator involves reordering slot_mapping for generality, introducing significant scalar computations. In contrast, the `reshape_and_cache` operator eliminates this redundant reordering step, thus reducing unnecessary computation time and enhancing the operator's performance. ### performance comparison 4A3, 1P1D, P dp2tp16, D dp8tp4, input/output: 64K/3K origin: TTFT: 28s, TPOT: 26ms, TPS: 820 token/s* fixed redundant double writes of k_cache: TTFT: 24s, TPOT: 26ms, TPS: 840 token/s replace scatter ops with reshape_and_cache: TTFT: 24s, TPOT: 26ms, TPS: 850 token/s ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-05 14:27:11 +08:00
Ronald	77e009d9fc	[Feature] Add docs of batch invariance and make some extra operators patch (#6910 ) ### What this PR does / why we need it? This PR add docs of batch invariance and make some extra operators according to validation result. please see https://github.com/vllm-project/vllm-ascend/issues/5487 to track progress. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-05 09:12:40 +08:00
tanhaoan333	f8315f5717	[bugfix]Qwen2.5VL accurate question (#6975 ) ### What this PR does / why we need it? The attention mechanism in the ViT model architecture of Qwen2.5VL consists of two parts and does not support using cache to pass sequence lengths. ### Does this PR introduce _any_ user-facing change? remove seq_lens_cache ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-04 22:02:29 +08:00
zhangxinyuehfad	566c367a10	[CI] Add DeepSeek-V3.2 large EP nightly ci (#6378 ) ### What this PR does / why we need it? Add DeepSeek-V3.2 nightly ci Fix PD routing to exclude headless nodes when collecting prefiller/decoder IPs - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-04 16:15:56 +08:00
Zhujiyang2	c3c265648f	[Ops][BugFix] Fix RoPE shape mismatch for mtp models with flashcomm v1 enabled (#6939 ) What this PR does / why we need it? When using a draft model (e.g., in MTP speculative decoding) with shared expert data parallelism (enabled via flashcomm), a shape mismatch error occurs in the rotary embedding calculation for models like GLM-4.7. This is because the positions tensor has an incorrect shape for this specific configuration. This PR fixes the issue by adding a check in AscendRotaryEmbedding.forward_oot. If the model is a draft model and shared expert DP is enabled, it processes the positions tensor using torch.ops.vllm.maybe_all_gather_and_maybe_unpad to ensure its shape is correct before applying the rotary embedding. This resolves the shape mismatch error. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com>	2026-03-04 16:02:08 +08:00
SILONG ZENG	95b44d7b73	[bugfix]fix file not found error in nightly of single-node (#6976 ) ### What this PR does / why we need it? 1. The main image build takes approximately two hours. The main image build time needs to be moved forward to 21pm(UTC+8) to ensure that the nightly image build can use the latest main image. ``` bash schedule: # UTC+8: 8am, 12pm, 16pm, 22pm - cron: '0 0,4,8,14 * * ' ``` ---> ``` bash schedule: # UTC+8: 8am, 12pm, 16pm, 21pm - cron: '0 0,4,8,13 * *' ``` Link: https://github.com/vllm-project/vllm-ascend/actions/runs/22632712302/job/65641055135#step:8:26 2. The nightly test is encountering the following error: ``` bash ImportError: ascend_transport.so: cannot open shared object file: No such file or directory. ``` Path need to be added： ``` bash export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib" >> ~/.bashrc ``` Link: https://github.com/vllm-project/vllm-ascend/actions/runs/22632712302/job/65641054911#step:7:529 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-04 11:47:26 +08:00
zhaomingyu13	52d9086f64	[Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (#6914 ) ### What this PR does / why we need it? When using the target model after rotational quantization, the acceptance rate decreases because the fc weight of the draft model has not undergone rotational quantization(issue: #6445). We fixed this issue by performing rotation quantization on the fc weight of the draft model in the same way as the main model when loading draft model. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-03-04 11:29:49 +08:00
Li Wang	d431d7d526	[CI] Enable auto upgrade e2e estimated time for auto-partition suites (#6840 ) ### What this PR does / why we need it? This patch add a schedule triggered workflow for auto upgrade e2e estimated-time for batter load balance 1. The workflow will run the full e2e test to get the duration of each test. 2. The script `update_estimated_time.py` will upgrade the [config.json](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/scripts/config.yaml) according to the latest time 3. The workflow will submit a pull request that includes changes to `config.json` automatically <img width="2484" height="764" alt="image" src="https://github.com/user-attachments/assets/02f3459c-bb3b-4f8e-9966-8bb2e5c1bbea" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` - ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-04 10:38:34 +08:00
NJX	c7fd7a25f7	[Doc][Misc] Fix msprobe_guide.md documentation issues (#6965 ) ## What this PR does / why we need it? Fixes several documentation issues in the msprobe debugging guide as reported in #6065: 1. Remove unnecessary `cat` heredoc wrapper: The example configuration section used a `cat <<'JSON'` bash wrapper around the JSON config. Simplified to a plain JSON code block. 2. Fix duplicate chapter numbering: Two sections were both numbered '2'. Renumbered sections sequentially (0-6). 3. Fix msprobe command: Changed `msprobe graph_visualize` to `msprobe -f pytorch graph` in section 5.2 Visualization. 4. Remove backward-related content: Since vllm is inference-only (no training), removed all backward pass references including backward tensor examples, parameter gradient examples, and backward descriptions from dump.json explanations. ## Does this PR introduce _any_ user-facing change? Documentation improvement only. No code changes. ## How was this patch tested? Manual review of the markdown file to verify all 4 issues from #6065 are addressed. Closes #6065 - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: NJX-njx <3771829673@qq.com>	2026-03-04 10:28:31 +08:00
SILONG ZENG	859f2c25b9	[Nightly][Refactor]Migrate nightly single-node model tests from `.py` to `.yaml` (#6503 ) ### What this PR does / why we need it? This PR refactors the nightly single-node model test by migrating test configurations from Python scripts to a more maintainable `YAML-based` format. \| Original PR \| Python (`.py`) \| YAML (`.yaml`) \| \| :--- \| :--- \| :--- \| \| [#3568](https://github.com/vllm-project/vllm-ascend/pull/3568) \| `test_deepseek_r1_0528_w8a8_eplb.py` \| `DeepSeek-R1-0528-W8A8.yaml` \| \| [#3631](https://github.com/vllm-project/vllm-ascend/pull/3631) \| `test_deepseek_r1_0528_w8a8.py` \| `DeepSeek-R1-0528-W8A8.yaml` \| \| [#5874](https://github.com/vllm-project/vllm-ascend/pull/5874) \| `test_deepseek_r1_w8a8_hbm.py` \| `DeepSeek-R1-W8A8-HBM.yaml` \| \| [#3908](https://github.com/vllm-project/vllm-ascend/pull/3908) \| `test_deepseek_v3_2_w8a8.py` \| `DeepSeek-V3.2-W8A8.yaml` \| \| [#5682](https://github.com/vllm-project/vllm-ascend/pull/5682) \| `test_kimi_k2_thinking.py` \| `Kimi-K2-Thinking.yaml` \| \| [#4111](https://github.com/vllm-project/vllm-ascend/pull/4111) \| `test_mtpx_deepseek_r1_0528_w8a8.py` \| `MTPX-DeepSeek-R1-0528-W8A8.yaml` \| \| [#3733](https://github.com/vllm-project/vllm-ascend/pull/3733) \| `test_prefix_cache_deepseek_r1_0528_w8a8.py` \| `Prefix-Cache-DeepSeek-R1-0528-W8A8.yaml` \| \| [#6543](https://github.com/vllm-project/vllm-ascend/pull/6543) \| `test_qwen3_235b_w8a8.py` \| `Qwen3-235B-A22B-W8A8.yaml` \| \| [#6543](https://github.com/vllm-project/vllm-ascend/pull/6543) \| `test_qwen3_235b_a22b_w8a8_eplb.py` \| `Qwen3-235B-A22B-W8A8.yaml` \| \| [#3973](https://github.com/vllm-project/vllm-ascend/pull/3973) \| `test_qwen3_30b_w8a8.py` \| `Qwen3-30B-A3B-W8A8.yaml` \| \| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) \| `test_qwen3_32b_int8.py` \| `Qwen3-32B-Int8.yaml` \| \| [#3757](https://github.com/vllm-project/vllm-ascend/pull/3757) \| `test_qwq_32b.py` \| `QwQ-32B.yaml` \| \| [#5616](https://github.com/vllm-project/vllm-ascend/pull/5616) \| `test_qwen3_next_w8a8.py` \| `Qwen3-Next-80B-A3B-Instruct-W8A8.yaml` \| \| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) \| `test_qwen2_5_vl_7b.py` \| `Qwen2.5-VL-7B-Instruct.yaml` \| \| [#5301](https://github.com/vllm-project/vllm-ascend/pull/5301) \| `test_qwen2_5_vl_7b_epd.py` \| `Qwen2.5-VL-7B-Instruct-EPD.yaml` \| \| [#3707](https://github.com/vllm-project/vllm-ascend/pull/3707) \| `test_qwen2_5_vl_32b.py` \| `Qwen2.5-VL-32B-Instruct.yaml` \| \| [#3676](https://github.com/vllm-project/vllm-ascend/pull/3676) \| `test_qwen3_32b_int8_a3_feature_stack3.py` \| `Qwen3-32B-Int8-A3-Feature-Stack3.yaml` \| \| [#3709](https://github.com/vllm-project/vllm-ascend/pull/3709) \| `test_prefix_cache_qwen3_32b_int8.py` \| `Prefix-Cache-Qwen3-32B-Int8.yaml` \| \| [#5395](https://github.com/vllm-project/vllm-ascend/pull/5395) \| `test_qwen3_next.py` \| `Qwen3-Next-80B-A3B-Instruct-A2.yaml` \| \| [#3474](https://github.com/vllm-project/vllm-ascend/pull/3474) \| `test_qwen3_32b.py` \| `Qwen3-32B.yaml` \| \| [#3541](https://github.com/vllm-project/vllm-ascend/pull/3541) \| `test_qwen3_32b_int8.py` \| `Qwen3-32B-Int8-A2.yaml` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-03 20:13:43 +08:00
Cao Yi	a0a904a3d4	[BugFix] Improve GDN layer detection for multimodal models (#6941 ) ## Summary - Enhanced `check_gdn_layer()` function to properly detect GDN layers in multimodal models - Added support for checking `text_config.layer_types` in addition to root-level `layer_types` - Fixed potential None reference errors when `layer_types` attribute is missing ## Changes - Modified `vllm_ascend/utils.py`: - Replaced `hasattr()` check with safer `getattr()` approach - Added fallback to empty list when `layer_types` is None - Added secondary check for `text_config.layer_types` to support models like Qwen-Omni ## Motivation Previous implementation only checked `layer_types` at the root config level, which failed to detect GDN layers in multimodal models where this information is nested under `text_config`. Additionally, it could raise errors when `layer_types` was None. --- Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>	2026-03-03 20:08:39 +08:00
weiguihua2	5b05b3a090	[feat]ds3.2 pcp support mtp and chunkprefill (#6917 ) ### What this PR does / why we need it? ds3.2 pcp supports the combination of MTP and chunkprefill features. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-03-03 19:03:50 +08:00
Frank Chen	b771ca9a47	[CPU binding] Implement global CPU slicing and improve IRQ binding for Ascend NPUs (#6945 ) ### What this PR does / why we need it? This PR introduces global CPU slicing for Ascend NPUs to ensure non-overlapping CPU partitions, addresses IRQ binding logical errors on A3, and enhances the logic for determining total NPUs in CPU allocation. These changes are necessary to optimize CPU resource management and improve system stability. - Global CPU Slicing: Introduced a global CPU slicing mechanism for Ascend NPUs to ensure non-overlapping CPU partitions across multiple processes or data parallel groups, preventing resource contention. - Improved IRQ Binding for A3 Devices: Refined the IRQ binding logic specifically for Ascend A3 devices, correctly mapping logical NPU IDs to physical card and chip IDs for accurate npu-smi queries and preventing multi-process overwrite of IRQ settings. - Enhanced NPU Count Determination: Improved the logic for determining the total number of logical NPUs, prioritizing NPU mapping information to ensure more accurate CPU allocation. - Minimum CPU Requirement: Established a minimum requirement of 5 CPUs per NPU for binding, reserving specific cores for IRQ, main, ACL, and release operations to ensure stable operation. ### Does this PR introduce _any_ user-facing change? No user-facing changes are introduced. ### How was this patch tested? CI passed with new added/existing tests. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: c00818886 <chenchuwei@huawei.com>	2026-03-03 17:20:52 +08:00
linfeng-yuan	700423156f	[Triton] Centralize Ascend extension op dispatch in triton_utils (#6937 ) ### What this PR does / why we need it? This pull request refactors the dispatch mechanism for the triton-ascend-specific operators `insert_slice`, `extract_slice`, and `get_element` to ensure compatibility with both CANN 8.5 and 9.0. A unified helper function, `_resolve_triton_ascend_op`, has been introduced in `vllm_ascend/ops/triton/triton_utils.py`. This function dynamically resolves these operators by first attempting to import them from the `triton.language.extra.cann.extension` module, which is present in newer CANN versions. If that fails, it falls back to the standard `triton.language` module. This approach centralizes operator dispatch logic, allowing individual Triton kernels to use these functions without being aware of the underlying Triton/CANN version. All call sites have been updated to use these new unified functions. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring of operator implementations and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with existing tests. Testing Context: - vLLM version: v0.16.0 - vLLM main: `15d76f74e2fdb12a95ea00f0ca283acf6219a2b7` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-03 17:10:30 +08:00
linfeng-yuan	cb893bcdb0	[csrc][bugfix] Add compile-time Ascend950/910_95 compatibility for custom ops between CANN8.5 and 9.0 (#6936 ) ### What this PR does / why we need it? Remove hardcoded ASCEND910_95 usage in csrc custom-op host/tiling code and select the SoC target at CMake configure time. - Probe CANN headers with check_cxx_source_compiles: prefer platform_ascendc::SocVersion::ASCEND950, fallback to ASCEND910_95. - Export the selected enum/config string via shared compile definitions (VLLM_ASCEND_950_SOC_ENUM / VLLM_ASCEND_950_SOC_CONFIG). - Apply the shared macros to affected paths (moe_gating_top_k, add_rms_norm_bias) to avoid per-file hardcoding. - Keep behavior unchanged; this is an internal build-compatibility fix for CANN 8.5 and 9.x. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-03 17:08:22 +08:00
Shaoxu Cheng	2064afe380	[300I][Bugfix] fix unquant model weight nd2nz error (#6851 ) ### What this PR does / why we need it? - This PR fixes an issue with weight format conversion for unquantized models running on Ascend 310P devices. - The changes refactor the logic for converting weights to the FRACTAL_NZ format. Previously, this was handled in a 310P-specific linear layer implementation (`AscendUnquantizedLinearMethod310`). This implementation has been removed, and the logic is now centralized in the `maybe_trans_nz` utility function. This function now checks if the device is a 310P and applies the NZ format cast accordingly for `float16`/`bfloat16` weights. - This refactoring simplifies the code by removing platform-specific duplication and ensures correct weight handling for unquantized models on 310P. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ut and local test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-03 15:57:26 +08:00
zzzzwwjj	f19f7b1fe2	[doc] fix supported_models (#6930 ) ### What this PR does / why we need it? Add Experimental supported model/feature for supported_models.md ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2026-03-03 09:47:50 +08:00
starmountain1997	248d07566f	[CI] nightly test timeout (#6912 ) ### What this PR does / why we need it? The nightly test is currently failing due to a [timeout](https://github.com/vllm-project/vllm-ascend/actions/runs/22547280169/job/65326335134). As noted in #6778, this issue can be resolved by applying this fix. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? run nightly test. Co-authored-by: guozr <guozr1997@hotmail.com>	2026-03-03 09:31:46 +08:00
Xiaoshuang Wang	f7a8befc20	[CI] Upgrade CANN to 8.5.1 (#6897 ) ### What this PR does / why we need it? [CI] Upgrade CANN to 8.5.1 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: wxsIcey <1790571317@qq.com>	2026-03-03 09:02:42 +08:00

1 2 3 4 5 ...

2518 Commits