xc-llm-ascend

Author	SHA1	Message	Date
GoCHug	80e5812b39	[BugFix] Add support for rotary_dim parameter when using partial rope in rotary_embedding (#6581 ) ### What this PR does / why we need it? Issue: If a model such as Ling-1T adopts partial rotary position embedding (partial RoPE), but config.json uses the rotary_dim parameter instead of partial_rotary_factor, it will trigger a RuntimeError: The expanded size of the tensor (128) must match the existing size (64) at non-singleton dimension 3. <img width="1681" height="472" alt="image" src="https://github.com/user-attachments/assets/ba03d7df-ecba-4d6f-9ec1-4dc55f59799e" /> This PR addresses an issue where models using partial rotary position embedding (partial RoPE) with the `rotary_dim` parameter in `config.json` (instead of `partial_rotary_factor`) would encounter a `RuntimeError`. This change adds support for the `rotary_dim` parameter in `vllm_ascend/ops/rotary_embedding.py` to correctly calculate the `rope_dim`, resolving the tensor size mismatch error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The patch was tested successfully with the Ling-1T model, which previously triggered the error. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: GoCHug <93277779+GoCHug@users.noreply.github.com>	2026-02-09 17:17:52 +08:00
lhp-deep	d060c797ed	[fix bug] fix tensor mismatch bug in sigmoid operate test case (#6619 ) ### What this PR does / why we need it? This PR fixes a bug in the `test_triton_fusion_ops` test case. The test compares a fused kernel (`fused_sigmoid_gating_delta_rule_update`) with a split implementation. Both paths use a recurrent state tensor. The bug was that the state tensor was being modified in-place by the fused kernel call, and this modified tensor was then reused for the split implementation path. This led to an incorrect comparison and test failure. This fix ensures that each path starts with an identical, clean initial state by creating separate tensors. It also changes the state initialization from `torch.randn` to `torch.ones` to make the test deterministic. ### Does this PR introduce _any_ user-facing change? No, this change only affects a test case and has no user-facing impact. ### How was this patch tested? The fix is applied directly to the test case. The CI passing for `test_fused_sigmoid_gating_delta_rule.py` will confirm that the fix is working as expected. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>	2026-02-09 16:43:27 +08:00
xulei	8325528368	[Kernel]: Optimize DispatchFFNCombine performance (#6468 ) ### What this PR does / why we need it? This PR focuses on performance optimization for the DispatchFFNCombine operator. The key optimizations include: 1. Improving communication efficiency by merging the transmission of tokens and scales; 2. Decoupling multi-core dependencies and reducing waiting bubbles in the combine process through tile-granularity communication; 3. Optimizing the full-card synchronization overhead before the umpermute operation. These optimizations aim to reduce the overall execution latency of the DispatchFFNCombine operator and enhance the runtime performance of the model inference process on Ascend devices. ### Does this PR introduce _any_ user-facing change? No. This PR only involves internal performance optimization of the DispatchFFNCombine operator and does not introduce any changes to user-facing APIs, interfaces, or behaviors. ### How was this patch tested? 1. Enable the DispatchFFNCombine operator by setting the environment variable: ``` export VLLM_ASCEND_ENABLE_FUSED_MC2=1 ``` 2. Run the standard model inference test suite with the above environment variable enabled; 4. Verify the correctness of model outputs (ensuring no functional regression) and measure the performance improvement of the DispatchFFNCombine operator (reduced latency and improved throughput). - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: xulei_ict <xulei292@huawei.com> Co-authored-by: xulei_ict <xulei292@huawei.com>	2026-02-09 16:30:34 +08:00
wangxiyuan	9c6d031797	[MISC] Clean up useless env USE_OPTIMIZED_MODEL (#6618 ) Clean up uesless env `USE_OPTIMIZED_MODEL` - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-09 15:38:58 +08:00
Canlin Guo	b7aa511daa	[Patch] Remove the patch of MiniCPM (#5975 ) ### What this PR does / why we need it? Part of #5304. After https://github.com/vllm-project/vllm/pull/32523 merge, we could remove the patch of `MiniCPMAttention`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Test it locally. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-02-09 14:07:44 +08:00
liziyu	e5f0e0eaf7	[P/D] layerwise connector support recompute scheduler (#5900 ) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE： Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support https://github.com/vllm-project/vllm-ascend/issues/4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-02-07 15:24:42 +08:00
wangxiyuan	d266fd7b47	[CI] Add workflow support for lint image build (#6489 ) Support specify commit hash with lint image build workflow - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-07 09:32:01 +08:00
Zetong Li	4fa7cf6f50	[Bugfix] Fix problematic dummy_run & improper input_batch_size in eagle (#6517 ) ### What this PR does / why we need it? This PR aims to fix problematic dummy_run that will cause excessive npu memory and to fix improper input_batch_size that will degrade running performance. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: Zetong Li <slippersss@126.com> Signed-off-by: lilinsiman <lilinsiman@gmail.com> Co-authored-by: lilinsiman <lilinsiman@gmail.com>	2026-02-07 09:30:10 +08:00
pu-zhe	1cc225711d	[Refactor]310p_e2e test case update (#6539 ) ### What this PR does / why we need it? This pull request significantly enhances the test suite by adding new end-to-end test cases for Qwen3 models on the 310P hardware platform. The primary goal is to ensure the stability and correctness of these models under diverse operational conditions, including various parallelism strategies, data types, and quantization methods. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:28:37 +08:00
lty	c3db1aca2f	[Refactor]refactor p2p connector (#6551 ) ### What this PR does / why we need it? Redundant code is removed, and repeated logic is combined through the p2p connector refactor, making the code easy to extend. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? P节点： ``` vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \ --host 0.0.0.0 \ --port 8002 \ --data-parallel-size 2 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --seed 1024 \ --served-model-name model \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --max-num-seqs 16 \ --enforce-eager \ --trust-remote-code \ --gpu-memory-utilization 0.92 \ --quantization ascend \ --async-scheduling \ --additional-config '{"ascend_scheduler_config":{"enabled":true}}' \ --kv-transfer-config \ '{ "kv_connector": "MultiConnector", "kv_role": "kv_producer", "kv_connector_extra_config": { "use_layerwise": false, "connectors": [ { "kv_connector": "MooncakeConnectorV1", "kv_role": "kv_producer", "kv_port": "30000", "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 2, "tp_size": 8 }, "decode": { "dp_size": 4, "tp_size": 4 } } }, { "kv_connector": "AscendStoreConnector", "kv_role": "kv_producer", "kv_connector_extra_config": { "backend": "mooncake", "mooncake_rpc_port":"0" } } ] } }' ``` D节点： ``` vllm serve /mnt/share/DeepSeek-V3.2-Exp-W8A8 \ --host 0.0.0.0 \ --port 8003 \ --data-parallel-size 4 \ --tensor-parallel-size 4 \ --enable-expert-parallel \ --seed 1024 \ --served-model-name model \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --max-num-seqs 16 \ --enforce-eager \ --trust-remote-code \ --gpu-memory-utilization 0.92 \ --quantization ascend \ --async-scheduling \ --additional-config '{"ascend_scheduler_config":{"enabled":true}}' \ --kv-transfer-config \ '{ "kv_connector": "MultiConnector", "kv_role": "kv_consumer", "kv_connector_extra_config": { "use_layerwise": false, "connectors": [ { "kv_connector": "MooncakeConnectorV1", "kv_role": "kv_consumer", "kv_port": "30100", "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 2, "tp_size": 8 }, "decode": { "dp_size": 4, "tp_size": 4 } } },{ "kv_connector": "AscendStoreConnector", "kv_role": "kv_consumer", "kv_connector_extra_config": { "backend": "mooncake", "mooncake_rpc_port":"1" } } ] } }' ``` - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-02-07 09:27:15 +08:00
pu-zhe	4f33e25046	[Refactor]refactor 310p attention impl and add ut (#6579 ) ### What this PR does / why we need it? This pull request significantly refactors the attention mechanism for the Ascend 310P hardware, enhancing its architecture by separating mask generation concerns from the core attention implementation. It introduces a dedicated mask builder class capable of handling various mask types, including causal, splitfuse, and sliding window attention masks, all optimized for the NPU's fractal data format. This change not only cleans up the codebase but also lays the groundwork for more robust and feature-rich attention operations on Ascend devices, backed by new, extensive unit tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test with qwen3 and qwen3-moe - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:26:26 +08:00
pu-zhe	23524f2ca4	[Refactor]refactor 310p ops and add ut (#6591 ) ### What this PR does / why we need it? This pull request focuses on a significant refactoring effort within the vllm-ascend project, specifically targeting operations optimized for the Ascend 310P hardware. The changes aim to streamline the implementation of core components like quantization and multi-head attention, making the codebase more maintainable and robust. Concurrently, new unit tests have been introduced to ensure the correctness and reliability of these refactored modules. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test with qwen3-32b w8a8 - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:25:17 +08:00
wangxiyuan	6c49f95da2	[Ops][Refactor] Remove custom rotary_embedding operator (#6523 ) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-07 09:24:05 +08:00
SILONG ZENG	06aa6036f6	[Lint]Style: Convert `vllm-ascend/` to ruff format(new Batch #8 ) (#6604 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| vllm_ascend/ops/\_\_init\_\_.py \| \| vllm_ascend/ops/activation.py \| \| vllm_ascend/ops/flashcomm2_oshard_manager.py \| \| vllm_ascend/ops/layernorm.py \| \| vllm_ascend/ops/mla.py \| \| vllm_ascend/ops/mm_encoder_attention.py \| \| vllm_ascend/ops/register_custom_ops.py \| \| vllm_ascend/ops/vocab_parallel_embedding.py \| \| vllm_ascend/ops/weight_prefetch.py \| \| vllm_ascend/spec_decode/\_\_init\_\_.py \| \| vllm_ascend/spec_decode/eagle_proposer.py \| \| vllm_ascend/spec_decode/interface.py \| \| vllm_ascend/spec_decode/mtp_proposer.py \| \| vllm_ascend/spec_decode/ngram_proposer.py \| \| vllm_ascend/spec_decode/suffix_proposer.py \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-07 09:16:07 +08:00
wangyu	c63b7a1188	[Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301 ) ### What this PR does / why we need it? This PR adds disaggregated encoder tests for Qwen2.5-VL-7B-Instruct ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test by running ci - vLLM version: release/v0.12.0 --------- Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>	2026-02-06 17:30:17 +08:00
wangxiyuan	06c0aed124	[CI] Fix broken CI (#6599 ) Revert `4fb3d5e1b2` it breaks E2E Test - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd`	2026-02-06 17:23:58 +08:00
SILONG ZENG	19b5d44ea8	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #10 ) (#6173 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|`vllm_ascend/ops/layer_shard_linear.py`\| \|`vllm_ascend/ops/linear.py`\| \|`vllm_ascend/ops/linear_op.py`\| \|`vllm_ascend/worker/worker.py`\| \| ` vllm_ascend/patch/worker/patch_bert.py` \| \| ` vllm_ascend/patch/worker/patch_deepseek.py` \| \| ` vllm_ascend/patch/worker/patch_distributed.py` \| \| ` vllm_ascend/patch/worker/patch_module.py` \| \| ` vllm_ascend/patch/worker/patch_multimodal_merge.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next_mtp.py` \| \| ` vllm_ascend/patch/worker/patch_rejection_sampler.py` \| \| ` vllm_ascend/patch/worker/patch_rope.py` \| \| ` vllm_ascend/patch/worker/patch_triton.py` \| \| ` vllm_ascend/patch/worker/patch_unquantized_gemm.py` \| \| ` vllm_ascend/patch/worker/patch_v2_egale.py` \| \|` vllm_ascend/worker/npu_input_batch.py`\| \|` vllm_ascend/worker/v2/aclgraph_utils.py`\| \|` vllm_ascend/worker/v2/attn_utils.py`\| \|` vllm_ascend/worker/v2/model_runner.py`\| \|` vllm_ascend/worker/v2/sample/gumbel.py`\| \|` vllm_ascend/worker/v2/sample/penalties.py`\| \|` vllm_ascend/worker/v2/sample/sampler.py`\| \|` vllm_ascend/worker/v2/spec_decode/__init__.py`\| \|` vllm_ascend/worker/v2/spec_decode/eagle.py`\| \|` vllm_ascend/worker/v2/states.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-06 15:35:06 +08:00
SILONG ZENG	65b7f716e6	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #11 ) (#6176 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/ops/fused_moe/comm_utils.py` \| \| `vllm_ascend/ops/fused_moe/experts_selector.py` \| \| `vllm_ascend/ops/fused_moe/fused_moe.py` \| \| `vllm_ascend/ops/fused_moe/moe_comm_method.py` \| \| `vllm_ascend/ops/fused_moe/moe_mlp.py` \| \| `vllm_ascend/ops/fused_moe/prepare_finalize.py` \| \| `vllm_ascend/ops/fused_moe/token_dispatcher.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com>	2026-02-06 15:28:49 +08:00
SILONG ZENG	4fb3d5e1b2	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #8 ) (#6129 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| vllm_ascend/ops/\_\_init\_\_.py \| \| vllm_ascend/ops/activation.py \| \| vllm_ascend/ops/flashcomm2_oshard_manager.py \| \| vllm_ascend/ops/layernorm.py \| \| vllm_ascend/ops/mla.py \| \| vllm_ascend/ops/mm_encoder_attention.py \| \| vllm_ascend/ops/register_custom_ops.py \| \| vllm_ascend/ops/vocab_parallel_embedding.py \| \| vllm_ascend/ops/weight_prefetch.py \| \| vllm_ascend/spec_decode/\_\_init\_\_.py \| \| vllm_ascend/spec_decode/eagle_proposer.py \| \| vllm_ascend/spec_decode/interface.py \| \| vllm_ascend/spec_decode/mtp_proposer.py \| \| vllm_ascend/spec_decode/ngram_proposer.py \| \| vllm_ascend/spec_decode/suffix_proposer.py \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com>	2026-02-06 15:25:08 +08:00
SILONG ZENG	99aedaff63	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #7 ) (#6023 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|` vllm_ascend/quantization/compressed_tensors/compressed_tensors.py`\| \|` vllm_ascend/quantization/quant_config.py`\| \|` vllm_ascend/quantization/utils.py`\| \|` vllm_ascend/quantization/w4a16.py`\| \|` vllm_ascend/quantization/w4a4_flatquant_dynamic.py`\| \|` vllm_ascend/quantization/w4a8_dynamic.py`\| \|` vllm_ascend/quantization/w8a16.py`\| \|` vllm_ascend/quantization/w8a8.py`\| \|` vllm_ascend/quantization/w8a8_dynamic.py`\| \|` vllm_ascend/quantization/w8a8_pdmix.py`\| \|` vllm_ascend/quantization/w8a8mxfp8.py`\| \|` vllm_ascend/sample/rejection_sampler.py`\| \|` vllm_ascend/sample/sampler.py`\| \|` vllm_ascend/worker/block_table.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-06 14:56:53 +08:00
wangxiyuan	d0bc16859c	[CI][Misc] Some improvement for github action (#6587 ) ### What this PR does / why we need it? - This PR removes several self-hosted runner labels from the `actionlint.yaml` configuration file. These runners are likely no longer in use, so this change cleans up the configuration and ensures `actionlint` has an accurate list of available runners. - Move all Action dockerfiles to one folder - remove useless `runner` input for e2e test. - update workflow option version ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a configuration change for the CI linter. The correctness will be verified by `actionlint` running in CI on subsequent pull requests. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-06 14:06:27 +08:00
Li Wang	d018aeb5fa	[Image] Bump mooncake version to v0.3.8.post1 (#6428 ) ### What this PR does / why we need it? This patch bump the mooncake version to the latest [release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test is locally >>> from mooncake.engine import TransferEngine - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-02-06 10:54:03 +08:00
pu-zhe	85e33941e8	[Feat.]: 310p support MOE models (#6530 ) ### What this PR does / why we need it? This pull request integrates comprehensive support for Mixture of Experts (MoE) models on the Ascend 310P device within the vllm-ascend framework. It achieves this by introducing specialized modules for expert selection, fused MoE layers, and optimized all-gather communication. The changes also refine existing NPU operations, making them more consistent and efficient for 310P, ultimately enhancing the performance and compatibility of MoE models on this hardware. Highlights 310P MoE Support: Introduces dedicated implementations for Mixture of Experts (MoE) models on Ascend 310P devices, including new modules for expert selection, fused MoE layers, and communication. All-Gather Communication: Enforces the use of ALLGATHER communication for MoE operations on 310P, optimizing data transfer and leveraging NPU-specific token dispatching. Simplified NPU Operations: Removes conditional type casting for npu_swiglu and enables custom rotary embedding kernels unconditionally, suggesting improved native support for 310P. New MoE Classes Registered: Registers AscendFusedMoE310 and AscendSharedFusedMoE310 to integrate 310P-specific MoE layers into the system's custom operation registry. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? offline test and server test, with qwen3-30b-a3b,tp/ep 4 on 310p - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-06 10:30:56 +08:00
wangxiyuan	c38166eefa	[Doc] backport 0.13.0 release note (#6584 ) ### What this PR does / why we need it? Backport 0.13.0 release note to main branch and update related doc link ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? by doc CI - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-06 10:29:15 +08:00
Nengjun Ma	11339eb48a	[CI] Update UT CANN version to 8.5.0 for main branch (#6564 ) ### What this PR does / why we need it? Update UT CANN version to 8.5.0 ### Does this PR introduce _any_ user-facing change? NA - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-06 10:28:42 +08:00
zhangxinyuehfad	81f3c09d6d	[CI] Change A2 runner (#6557 ) ### What this PR does / why we need it? This PR updates the CI runner from `linux-aarch64-a2-` to `linux-aarch64-a2b3-` in various test configuration files. This change is necessary to adapt to updates in the CI infrastructure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The changes are configuration updates for CI tests. The correctness will be verified by the CI pipeline. Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 23:43:57 +08:00
Ruowei Zheng	8e66299bf1	[Bugfix] Fix the incorrect use of the output parameter in _forward_fia_slidingwindow (#6469 ) ### What this PR does / why we need it? Fix the incorrect use of the `output` parameter in `_forward_fia_slidingwindow`: ``` # Original (incorrect) output, _ = torch_npu.npu_fused_infer_attention_score(...) output= output.view(batch_size, self.num_heads, self.head_size) ``` In the original writing, the `output `parameter was directly assigned a new value, which is inconsistent with the interface definition, resulting in the inability to directly update `output `when calling externally. ``` attn_output, _ = torch_npu.npu_fused_infer_attention_score(...) attn_output = attn_output.view(batch_size, self.num_heads, self.head_size) output[:batch_size] = attn_output[:batch_size] ``` ### Does this PR introduce _any_ user-facing change? No change. Co-authored-by: GoCHug<gch59135228@163.com> ### How was this patch tested? vLLM ascend version: v0.13.0rc1 Signed-off-by: acat-rw <892882856@qq.com>	2026-02-05 20:58:54 +08:00
meihanc	922e5c163b	[main2main] upgrade vllm main 0202 (#6560 ) ### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to https://github.com/vllm-project/vllm/pull/32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to https://github.com/vllm-project/vllm/pull/33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to https://github.com/vllm-project/vllm/pull/33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to https://github.com/vllm-project/vllm/pull/32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to https://github.com/vllm-project/vllm/pull/32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to https://github.com/vllm-project/vllm/pull/27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to https://github.com/vllm-project/vllm/pull/33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to https://github.com/vllm-project/vllm/pull/32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 19:31:17 +08:00
ChenCangtao	2c1608265b	[CI][npugraph_ex]Fix npugraph ex e2e test (#6553 ) ### What this PR does / why we need it? When running the Qwen3-0.6B model using the npugraph_ex backend, the last few characters of the generated results changed. We have modified the relevant test cases to ensure the CI runs smoothly. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-02-05 14:03:10 +08:00
lty	33b8ca4e96	[Feature]KV pool supports sparse attention (#6339 ) ### What this PR does / why we need it? The kv pooling feature is adapted to Sparse Attention to support models such as Deepseek V3.2. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? ``` vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \ --host $local_ip \ --port 8002 \ --served-model-name model \ --data-parallel-size 1 \ --tensor-parallel-size 8 \ --prefill-context-parallel-size 2 \ --decode-context-parallel-size 1 \ --cp-kv-cache-interleave-size 128 \ --block-size 128 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --no-enable-chunked-prefill \ --max-num-seqs 4 \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.95 \ --trust-remote-code \ --enforce-eager \ --quantization ascend \ --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \ --kv-transfer-config \ '{ "kv_connector": "AscendStoreConnector", "kv_role": "kv_both", "kv_connector_extra_config": { "backend": "mooncake", "lookup_rpc_port":"0", "use_layerwise": false } }' ``` - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: lty <linhebiwen@gmail.com>	2026-02-05 10:36:52 +08:00
Wang Kunpeng	13c4a9c78b	[bugfix]Fix accuracy issue in PCP/DCP with speculative decoding (#6491 ) ### What this PR does / why we need it? This PR fixes an accuracy issue that occurs when using Prefill/Decode Context Parallelism (PCP/DCP) in conjunction with speculative decoding (MTP). The issue is caused by an irregular attention mask shape when both features are enabled. The fix involves flattening the `block_table` for speculative decoding requests under PCP/DCP to ensure a regular attention mask. This PR also introduces a `use_cp` property for cleaner code and updates dummy runs to handle this scenario correctly. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix that improves accuracy and should not have user-facing API changes. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-02-05 10:06:14 +08:00
Zhijun Chen	0ead5e8681	perf: adaptive block size selection in linear_persistent kernel (#6537 ) ### What this PR does / why we need it? Optimization: Replaces fixed block sizes (128x128x128) in `linear_persistent_kernel` with adaptive selection logic that considers: - Matrix dimensions (M, N, K) - Device NPU vector core count - Data type (float32 vs others) Why: Fixed block sizes lead to suboptimal hardware utilization across different matrix shapes. Adaptive sizing maximizes occupancy and memory efficiency for varied workload patterns, improving throughput for batch-invariant linear operations in LLM inference. Details: - Small matrices (M < 256): Size-proportional allocation - Medium matrices (256 ≤ M < 1024): Balanced distribution based on grid capacity - Large matrices (M ≥ 1024): Optimized for dominant dimension ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization. The API and numerical results remain unchanged; only kernel execution efficiency improves. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: DDCHY <843049740@qq.com> Signed-off-by: zjchenn <zjchenn@gmail.com> Co-authored-by: DDCHY <843049740@qq.com>	2026-02-04 21:36:26 +08:00
Yizhou	2ee4f23f28	[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6475 ) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint #6459 (commit `5b0a6bcfe9`)" and fixes a check in `model_runner_v1`. A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm. This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-02-04 21:11:08 +08:00
DreamerLeader	2dac18afea	[Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (#6126 ) ### What this PR does / why we need it? Fix of Pooling Code and Update of Pooling Usage Guide ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? pr:[[Bugfix]Fixed precision issues caused by pooled request pooling](https://github.com/vllm-project/vllm-ascend/pull/6049) readyhttps://github.com/vllm-project/vllm-ascend/pull/6049 read for review - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Signed-off-by: fangjianwei <f30058701@china.huawei.com> Signed-off-by: DreamerLeader <88812830+DreamerLeader@users.noreply.github.com> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: fangjianwei <f30058701@china.huawei.com>	2026-02-04 16:35:41 +08:00
Zhang-Bryan	804a9ec4e6	[Fusion] Add rmsnorm dynamic quant fusion pass (#6274 ) ### What this PR does / why we need it? This PR introduces four new patterns to support the fusion of RMSNorm and DynamicQuant operators. After replacing the fusion operators, the execution time has been reduced from 22.8us to 16.9us. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d7de043d55` Signed-off-by: Bryan <250470359+Zhang-Bryan@users.noreply.github.com>	2026-02-04 15:53:53 +08:00
IWantFight	e7a13beedb	[Bugfix] Synchronize only the current stream to avoid device sync (#6432 ) ### What this PR does / why we need it? Following [PR #4233](https://github.com/vllm-project/vllm-ascend/pull/4233), a synchronization mechanism was introduced between steps in asynchronous scheduling with ACL Graph to address a hanging issue. However, full device-level synchronization is unnecessary—only the operations on the current stream need to be synchronized. Otherwise, if other background operations (such as send and recv) are running concurrently, they may negatively impact inference performance for the instance. hang problem ![c4bbfac9a9088acec0ad335b4c2af437](https://github.com/user-attachments/assets/b7c8c612-4d45-48ec-9465-954869f9643d) Synchronizing only the current stream can also resolve the hang issue. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: For_YL <zhangtangwei@huawei.com> Co-authored-by: For_YL <zhangtangwei@huawei.com>	2026-02-04 10:59:45 +08:00
starmountain1997	bfcc372f75	[CI] Add long and short prompt tests for DeepSeek-V3.2 (#6499 ) ### What this PR does / why we need it? This PR enhances the test_deepseek3_2_w8a8_pruning_mtp_tp2_ep E2E test by adding both short and long prompt test cases: - Short test: Validates basic functionality with minimal input ("Hello ") - Long test: Validates the model can handle prompts near its maximum context length (~163K tokens, approaching the max_position_embeddings limit of 163,840) Additionally, explicitly sets max_model_len=163840 to ensure the test properly exercises the model's full context window capability. ### Does this PR introduce _any_ user-facing change? No. This change only affects internal E2E testing infrastructure. ### How was this patch tested? The modified test case will be executed as part of the E2E test suite and has been validated [here](https://github.com/vllm-project/vllm-ascend/actions/runs/21620195055/job/62308026205?pr=6499). - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-04 09:10:50 +08:00
Nengjun Ma	78fad4e348	[Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442 ) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-04 09:08:18 +08:00
ChenCangtao	fa56abea9f	[bugfix][npugraph_ex]duplicate pattern issue (#6513 ) ### What this PR does / why we need it? When the draft model also uses vllmbackend for graph compilation, the fusion pass registration occurs again, resulting in errors due to duplicate patterns. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-02-04 08:49:13 +08:00
ChenCangtao	7b3921c498	[bugfix][npugraph_ex]add the extra check for allreduce rmsnorm fusion pass (#6430 ) ### What this PR does / why we need it? Allreduce rmsnorm fusion pass has an additional check condition, which requires fusion of the Fx graph only when the start of compile_range is greater than 512. We previously overlooked this check. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-02-04 08:48:28 +08:00
dsxsteven	a80e524fbc	[Quant] GLM4.7-Flash Support W8A8 (#6492 ) ### What this PR does / why we need it? support W8A8 quant for model GLM4.7-flash ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: dsxsteven <dsxsteven@sina.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>	2026-02-03 19:49:58 +08:00
whx	4d6444d5fd	[Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (#6505 ) ### What this PR does / why we need it? Remove kv_cache nz test case for test_mla_preprocess_nq.py. This case is added by https://github.com/vllm-project/vllm-ascend/pull/3072 but has not been tested on bf16 scenario. Results show that this is not currently supported. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-02-03 18:26:51 +08:00
SILONG ZENG	b804eb12f6	[CI]Nightly test use `main` (#6502 ) ### What this PR does / why we need it? Nightly test use `main` - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-03 15:40:59 +08:00
zhangyiming	41d48cb974	[CI] Update doctest from 0.9.1 to 0.13.0, and copy doc test workflow to nightly CI for better monitor. (#6452 ) ### What this PR does / why we need it? [CI] Update doctest from 0.9.1 to 0.13.0, and copy doc test workflow to nightly CI for better monitor. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: menogrey <1299267905@qq.com>	2026-02-03 15:19:03 +08:00
Feng Liu	03a18ad6fd	[E2E] add E2E for Prefix Caching cp & Chunked Prefill cp (#5149 ) ### What this PR does / why we need it? Add E2E for Prefix Caching cp & Chunked Prefill cp ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Signed-off-by: Feng Liu <46866849+ader47@users.noreply.github.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2026-02-03 15:04:14 +08:00
zhangguinan	be5b66de6d	[Doc] Contributing a Benchmark Tutorial for Suffix Speculative Decoding (#6323 ) ### What this PR does / why we need it? Suffix Decoding is a CPU-based speculative decoding optimization that accelerates inference by pattern matching and frequency-based prediction from both prompts and generated content. This document provides a step-by-step guide for deploying and evaluating Suffix Speculative Decoding on the Ascend platform. By analyzing performance gains across diverse datasets, it demonstrates the significant advantages of this technology in inference acceleration. Our goal is to empower developers to achieve high-efficiency model optimization using Ascend hardware. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: zhangmuzhibangde <1037640609@qq.com>	2026-02-03 14:52:38 +08:00
LeeWenquan	b1de6cbb31	[Bugfix][CI]Add qwen3Next MTP+Full Decode (#6047 ) ### What this PR does / why we need it? Fix a bug in the repo and add a test case for MTP + Full Decode Only + Qwen3Next. The _build_dummy_attn_metadata function in NPUModelRunner seems losed a query_star_loc.copy_to_gpu operation, which will lead to difference between query_start_loc and query_start_loc_cpu, and they are required to be same in MTP + Full Decode Only + Qwen3Next case. Before this pr: `self.query_start_loc = [0, 0, 0, 0, ... , 0] self.query_start_loc_cpu = [0, 2, 4, 6, ... ,128]` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-02-03 14:26:21 +08:00
Shaoxu Cheng	39e77fb9e4	[Feat.]: support 310p w8a8 (#6454 ) ### What this PR does / why we need it? Introduced 310P W8A8 Quantization Support: New modules and methods have been added to enable W8A8 static quantization specifically for the Ascend 310P platform. Platform-Specific Quantization Configuration Loading: The system now dynamically loads the appropriate quantization configurations (AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether the current hardware is an Ascend 310P device. Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization method for 310P is provided, handling the specifics of weight and activation quantization, including input parameter broadcasting and weight data manipulation. Extended AscendModelSlimConfig for 310P: A specialized configuration class for 310P integrates the new W8A8 linear method for both standard linear layers and vocabulary parallel embeddings, ensuring proper quantization application. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-02-03 14:13:06 +08:00
lidenghui1110	79803932e2	[Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366 ) ### What this PR does / why we need it? As #2947 describe, we need to transpose kv cache layout after GQA kv transfer when prefill and decode tensor parallel size are heterogeneous, in the previous implementation, we use `npu_paged_cache_load ` + `tranpose` + `_npu_reshape_and_cache` to do this work. But obviously, it is not an efficient plan, the ops above need to be called for each layer, which introduces 3 * layer_num kernel launch, and 6 * layer_num data movement between L1 Cache and HBM for one request on decode node. Usually, decode node uses graph mode, so these op kernels will be called between decode forward launched by an async thread in mooncacke connector, this kernels maybe last for several decode forward and TTFT will increase by 3~4 decode forward time. In this PR, we implement an AscendC fused op `transpose_kv_cache_by_block` to do this with only once kernel launch and move data between L1 Cache and HBM only once. After using this fused op, the time cost in transpose kv cacke layout can be decreased to 0.24ms from 7ms in UT on 910C, and in PD disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in qwen3-235B. \| request_num \| original \| fused_op\| \|:----------------------:\|:---------------:\|:-------------------:\| \| 1 \| 643 ms \| 578 ms \| \| 128 \| 1480 ms \| 1368 ms \| ### Does this PR introduce _any_ user-facing change? Use fused op by default, incase the op has bug in any scenario, provide fallback choice using env to disable it. DISABLE fused op by add following env `export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0` ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-02-03 14:10:01 +08:00
SILONG ZENG	f4a72f0d16	[CI]Disable early exit to complete all tests (#6482 ) ### What this PR does / why we need it? 1. Disable the feature to exit early upon encountering an error in order to complete all tests. 2. Within each partition, tests are re-sorted by `estimated_time` in ascending order. This allows the CI to cover as many test cases as possible in the early stages. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-03 11:25:51 +08:00

1 2 3 4 5 ...

2357 Commits