xc-llm-ascend

Author	SHA1	Message	Date
zhangxinyuehfad	81f3c09d6d	[CI] Change A2 runner (#6557 ) ### What this PR does / why we need it? This PR updates the CI runner from `linux-aarch64-a2-` to `linux-aarch64-a2b3-` in various test configuration files. This change is necessary to adapt to updates in the CI infrastructure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The changes are configuration updates for CI tests. The correctness will be verified by the CI pipeline. Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 23:43:57 +08:00
meihanc	922e5c163b	[main2main] upgrade vllm main 0202 (#6560 ) ### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to https://github.com/vllm-project/vllm/pull/32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to https://github.com/vllm-project/vllm/pull/33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to https://github.com/vllm-project/vllm/pull/33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to https://github.com/vllm-project/vllm/pull/32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to https://github.com/vllm-project/vllm/pull/32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to https://github.com/vllm-project/vllm/pull/27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to https://github.com/vllm-project/vllm/pull/33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to https://github.com/vllm-project/vllm/pull/32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 19:31:17 +08:00
ChenCangtao	2c1608265b	[CI][npugraph_ex]Fix npugraph ex e2e test (#6553 ) ### What this PR does / why we need it? When running the Qwen3-0.6B model using the npugraph_ex backend, the last few characters of the generated results changed. We have modified the relevant test cases to ensure the CI runs smoothly. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-02-05 14:03:10 +08:00
Yizhou	2ee4f23f28	[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6475 ) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint #6459 (commit `5b0a6bcfe9`)" and fixes a check in `model_runner_v1`. A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm. This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-02-04 21:11:08 +08:00
starmountain1997	bfcc372f75	[CI] Add long and short prompt tests for DeepSeek-V3.2 (#6499 ) ### What this PR does / why we need it? This PR enhances the test_deepseek3_2_w8a8_pruning_mtp_tp2_ep E2E test by adding both short and long prompt test cases: - Short test: Validates basic functionality with minimal input ("Hello ") - Long test: Validates the model can handle prompts near its maximum context length (~163K tokens, approaching the max_position_embeddings limit of 163,840) Additionally, explicitly sets max_model_len=163840 to ensure the test properly exercises the model's full context window capability. ### Does this PR introduce _any_ user-facing change? No. This change only affects internal E2E testing infrastructure. ### How was this patch tested? The modified test case will be executed as part of the E2E test suite and has been validated [here](https://github.com/vllm-project/vllm-ascend/actions/runs/21620195055/job/62308026205?pr=6499). - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-04 09:10:50 +08:00
Nengjun Ma	78fad4e348	[Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442 ) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-04 09:08:18 +08:00
whx	4d6444d5fd	[Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (#6505 ) ### What this PR does / why we need it? Remove kv_cache nz test case for test_mla_preprocess_nq.py. This case is added by https://github.com/vllm-project/vllm-ascend/pull/3072 but has not been tested on bf16 scenario. Results show that this is not currently supported. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-02-03 18:26:51 +08:00
Feng Liu	03a18ad6fd	[E2E] add E2E for Prefix Caching cp & Chunked Prefill cp (#5149 ) ### What this PR does / why we need it? Add E2E for Prefix Caching cp & Chunked Prefill cp ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Signed-off-by: Feng Liu <46866849+ader47@users.noreply.github.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2026-02-03 15:04:14 +08:00
LeeWenquan	b1de6cbb31	[Bugfix][CI]Add qwen3Next MTP+Full Decode (#6047 ) ### What this PR does / why we need it? Fix a bug in the repo and add a test case for MTP + Full Decode Only + Qwen3Next. The _build_dummy_attn_metadata function in NPUModelRunner seems losed a query_star_loc.copy_to_gpu operation, which will lead to difference between query_start_loc and query_start_loc_cpu, and they are required to be same in MTP + Full Decode Only + Qwen3Next case. Before this pr: `self.query_start_loc = [0, 0, 0, 0, ... , 0] self.query_start_loc_cpu = [0, 2, 4, 6, ... ,128]` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-02-03 14:26:21 +08:00
Shaoxu Cheng	39e77fb9e4	[Feat.]: support 310p w8a8 (#6454 ) ### What this PR does / why we need it? Introduced 310P W8A8 Quantization Support: New modules and methods have been added to enable W8A8 static quantization specifically for the Ascend 310P platform. Platform-Specific Quantization Configuration Loading: The system now dynamically loads the appropriate quantization configurations (AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether the current hardware is an Ascend 310P device. Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization method for 310P is provided, handling the specifics of weight and activation quantization, including input parameter broadcasting and weight data manipulation. Extended AscendModelSlimConfig for 310P: A specialized configuration class for 310P integrates the new W8A8 linear method for both standard linear layers and vocabulary parallel embeddings, ensuring proper quantization application. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-02-03 14:13:06 +08:00
lidenghui1110	79803932e2	[Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366 ) ### What this PR does / why we need it? As #2947 describe, we need to transpose kv cache layout after GQA kv transfer when prefill and decode tensor parallel size are heterogeneous, in the previous implementation, we use `npu_paged_cache_load ` + `tranpose` + `_npu_reshape_and_cache` to do this work. But obviously, it is not an efficient plan, the ops above need to be called for each layer, which introduces 3 * layer_num kernel launch, and 6 * layer_num data movement between L1 Cache and HBM for one request on decode node. Usually, decode node uses graph mode, so these op kernels will be called between decode forward launched by an async thread in mooncacke connector, this kernels maybe last for several decode forward and TTFT will increase by 3~4 decode forward time. In this PR, we implement an AscendC fused op `transpose_kv_cache_by_block` to do this with only once kernel launch and move data between L1 Cache and HBM only once. After using this fused op, the time cost in transpose kv cacke layout can be decreased to 0.24ms from 7ms in UT on 910C, and in PD disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in qwen3-235B. \| request_num \| original \| fused_op\| \|:----------------------:\|:---------------:\|:-------------------:\| \| 1 \| 643 ms \| 578 ms \| \| 128 \| 1480 ms \| 1368 ms \| ### Does this PR introduce _any_ user-facing change? Use fused op by default, incase the op has bug in any scenario, provide fallback choice using env to disable it. DISABLE fused op by add following env `export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0` ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-02-03 14:10:01 +08:00
guanguan0308	dffac6db73	[Refactor] Add expert processed token count output for DispatchFFNCombine/DispatchFFNCombineBF16 (#6402 ) ### What this PR does / why we need it? Add New Output for Expert Token Count An additional output tensor expert_token_nums is added to both operators to meet the requirement of tracking token distribution among experts: Tensor Name: expert_token_nums Dimension: 1D tensor Shape: (local_expert_num,) Data Type: int32 Semantics: Represents the number of tokens actually received by each expert on the current card. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: guanguan0308 <1546542263@qq.com> Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>	2026-02-03 10:41:06 +08:00
starmountain1997	b6256e8bc9	Revert "[CI] fix DS3.2 single node cudagraph_sizes config (#6241 )" (#6497 ) # What this PR does / why we need it? This PR reverts commit `8134146ab6`, which modified the DeepSeek V3.2 (W8A8) single-node nightly test configuration. as there is no limit between tp_size and MTP. # Does this PR introduce any user-facing change? No. This PR only affects CI/CD test configurations and does not introduce any user-facing changes. # How was this patch tested? N/A for a revert PR. The changes restore the previously known working configuration. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-03 08:42:58 +08:00
meihanc	c08364f761	[Bugfix] Fix intermittent kv_port conflict with AscendDirectTransport (#6455 ) ### What this PR does / why we need it? When using Mooncake on Ascend NPU, AscendDirectTransport randomly allocates ports within range `[20000, 20000 + npu_per_node × 1000)`. Reference: [ascend_direct_transport.cpp#L554](https://github.com/kvcache-ai/Mooncake/blob/v0.3.7.post2/mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport/ascend_direct_transport.cpp#L475) If `kv_port` overlaps with this range, users may encounter intermittent startup failures: ```bash zmq.error.ZMQError: Address already in use (addr='tcp://x.x.x.x:30012') RuntimeError: KV Cache sending/receiving thread failed to start. ``` This pr fix intermittent kv_port conflict with AscendDirectTransport in `Qwen3-235B-W8A8-EPLB.yaml`, and add Added `kv_port Configuration Guide` section in `pd_disaggregation_mooncake_multi_node.md`. test Results(tests/e2e/nightly/multi_node/config/Qwen3-235B-W8A8-EPLB.yaml): https://github.com/vllm-project/vllm-ascend/actions/runs/21540138907/job/62073265259 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-02-02 17:31:21 +08:00
LHXuuu	45a573cff1	[Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W4A8 dynamic weight. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: menogrey <1299267905@qq.com>	2026-02-02 16:39:32 +08:00
wangxiyuan	eeedf7c503	[Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470 ) ### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to https://github.com/vllm-project/vllm/pull/31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to https://github.com/vllm-project/vllm/pull/32082 by https://github.com/vllm-project/vllm-ascend/pull/6335. - Fix `ReshapeAndCacheOperation setup failed!` due to https://github.com/vllm-project/vllm/pull/25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-02 15:57:55 +08:00
starmountain1997	8134146ab6	[CI] fix DS3.2 single node cudagraph_sizes config (#6241 ) # What this PR does / why we need it? This PR fixes the single-node nightly test for DeepSeek V3.2 (W8A8) model to ensure CI stability. The changes include: 1. Simplified nightly test matrix (nightly_test_a3.yaml): - Temporarily reduced to only run deepseek3_2-w8a8 test case for debugging - Changed trigger from schedule/workflow_dispatch to support push/pull_request for faster iteration 2. Updated DeepSeek V3.2 test configuration (test_deepseek_v3_2_w8a8.py): - Adjusted cudagraph_capture_sizes from [3, 6, 9, 12] to [8, 16, 24, 32] for better performance - Increased max-num-seqs from 4 to 8 - Increased gpu-memory-utilization from 0.92 to 0.98 - Increased num_speculative_tokens from 2 to 3 3. Added PR checkout step (_e2e_nightly_single_node.yaml): - Added ability to checkout a specific PR (#6241) for testing # Does this PR introduce any user-facing change? No. This PR only affects CI/CD test configurations and does not introduce any user-facing changes. # How was this patch tested? Mock nightly test has passed, see [here](https://github.com/vllm-project/vllm-ascend/actions/runs/21574655952/job/62159656622?pr=6241). <img width="1053" height="714" alt="a2f2ee359febb13e1f6330b1bd3c116b" src="https://github.com/user-attachments/assets/3262ad0f-adec-4c71-871f-d9cf2db06fbc" /> - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-02 11:47:32 +08:00
wangxiyuan	b4aafd4293	[Core][Misc] Clean up ProfileExecuteDuration (#6461 ) ### What this PR does / why we need it? This PR removes the custom `ProfileExecuteDuration` utility and its usages across the codebase. This utility was used for profiling execution duration of different stages in the inference process. It is replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`, which integrates with PyTorch's profiler. This change simplifies the code by removing a custom implementation in favor of an upstream utility, improving maintainability. Associated documentation and tests for `ProfileExecuteDuration` are also removed. ### Does this PR introduce _any_ user-facing change? `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now. ### How was this patch tested? CI passed. The changes are a cleanup and replacement with a standard utility. Existing tests cover the functionality. The removed feature had its own tests which are also removed. Related RFC: #5304 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-01 20:06:01 +08:00
Li Wang	5b0a6bcfe9	[ModelRunner] Revert "[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6459 ) This reverts commit `56f5d3bd49`. ### What this PR does / why we need it? The patch https://github.com/vllm-project/vllm-ascend/pull/6357 which break the functionality availability in the spec_decode scenario, let's revert and make CI happy first ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-31 16:33:34 +08:00
Qiu	638cae824d	[bugfix](CP) Fix and unify the PD request discrimination logic. (#5939 ) ### What this PR does / why we need it? Since the PR (https://github.com/vllm-project/vllm/pull/32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-31 10:26:02 +08:00
Yizhou	56f5d3bd49	[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6357 ) ### What this PR does / why we need it? This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building the attention metadata. Together, these changes prevent kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-30 16:41:44 +08:00
ChenCangtao	f2990f7741	[e2e Test][npugraph_ex]add static kernel e2e test case (#6320 ) ### What this PR does / why we need it? Added an E2E test case for the scenario of enabling a static kernel for npugraph_ex, monitoring its compilation and unloading process. Also fixed the previously existing spelling errors - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-30 16:24:48 +08:00
CodeCat	b2857de43f	[ST]Add e2e test for Npugraphex_pass (#6388 ) ### What this PR does / why we need it? We found the custom passes of NPUGraphEX have implemented fusion operator features, which still require E2E test case validation and guard. This PR implements E2E test cases for the AddRMSNormQuant and SplitQKVNormRope operator fusions under NPUGraphEX that are already in the codebase. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-30 09:14:07 +08:00
wjunLu	4970de4242	[CI] Enable the skipped cases when HDK is upgraded to 25.5.0 (#6195 ) ### What this PR does / why we need it? Enable the tests that were skipped due to an outdated driver version: - tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py - tests/e2e/multicard/4-cards/long_sequence/test_basic.py - tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py and some cases in - tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py - tests/e2e/multicard/2-cards/test_external_launcher.py - tests/e2e/multicard/2-cards/test_offline_weight_load.py - tests/e2e/multicard/2-cards/test_quantization.py - tests/e2e/multicard/4-cards/test_data_parallel_tp2.py TODO: - tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py - tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-29 22:41:41 +08:00
Qiu	50e0e87646	[bugfix](CP,MLA) fix wrong slot_mapping of decode for mixed p/d batch (#6344 ) ### What this PR does / why we need it? PR #5672 attempted to remove the -1 padding for duplicate tokens in the decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler slicing approach. However, in the single-ops logic and mixed PD batches, the decode slot_mapping did not eliminate the -1 and also shared the slicing method, resulting in incorrect slot_mapping. This PR resolves this issue, and the logic will be further consolidated in subsequent refactoring PRs. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-29 16:48:37 +08:00
InSec	86b6ecac4c	[CI][BugFix] Import error fix. (#6293 ) ### What this PR does / why we need it? Fix the import error of qwen3-next nightly test. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: InSec <1790766300@qq.com>	2026-01-28 22:07:47 +08:00
linfeng-yuan	e25ee65729	[Misc][Test] add e2e test for apply_top_k_top_p_custom kernel (#6348 ) ### What this PR does / why we need it? Add e2e test case for apply_top_k_top_p_custom kernel and eliminate chinese comments. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest passed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-01-28 17:25:57 +08:00
dsxsteven	325cb16e3f	[BugFix][CI]Fix DeepSeek-R1-W8A8-longseq nightly CI (#6297 ) ### What this PR does / why we need it? The precision issue arose because the kv cache of the p-node had not been fetched for an extended period(>6min) and was forcibly freed. To avoid this problem, the batch size was reduced and the timeout period has also been extended. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: dsxsteven <dsxsteven@sina.com>	2026-01-28 16:36:24 +08:00
wangxiyuan	f8e76a49fa	[CI] Upgrade trasnformers version (#6307 ) Upgrade transformers to >=4.56.4 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-28 14:06:39 +08:00
meihanc	fea197ad50	[Main2Main] Upgrade vllm commit to 0123 (#6169 ) ### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27df97c3eb79f891802fc0e858f8f7ac6a0) Modify import paths due to the refactors： https://github.com/vllm-project/vllm/pull/32245 https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16da1e423ede2c2f52a9850cbfbb39cefe96) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to https://github.com/vllm-project/vllm/pull/28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117ea2e689cd43df4be6892671a17cdae5833) 1. Add `skip_compiled` param in `set_forward_context` due to https://github.com/vllm-project/vllm/pull/30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to https://github.com/vllm-project/vllm/pull/24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors：https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. https://github.com/vllm-project/vllm/pull/32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor https://github.com/vllm-project/vllm/pull/30143 3. Remove unused `maybe_setup_kv_connector` due to https://github.com/vllm-project/vllm/pull/32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to https://github.com/vllm-project/vllm/pull/32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cceb877dfd13f98c538c4c96158047d98bd) Setting temperature=0.0 due to the removal of the default temperature value in https://github.com/vllm-project/vllm/pull/32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>	2026-01-27 08:44:36 +08:00
InSec	595b57c4d4	[CI][BugFix] Qwen3-Next nightly test fix. (#6247 ) ### What this PR does / why we need it? Qwen3-Next nightly test fix. Temporarily avoid the accuracy issue in the full graph mode. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: InSec <1790766300@qq.com>	2026-01-26 19:53:53 +08:00
huangning1995	ce11fd49f3	[Feature] Batch invariant torch.compile (#6107 ) ### What this PR does / why we need it? Building upon https://github.com/vllm-project/vllm-ascend/pull/5517 to enable batch-invariant in vllm-ascend, we observed that the performance of BI in eager mode remains suboptimal. This PR further integrates batch-invariant with torch.compile, which improves inference performance by 350% when tested with Qwen3-0.6B. ### Does this PR introduce _any_ user-facing change? Previously, enabling both aclgraph and Batch-Invariant would cause an "ub overflow" error. This occurred because transposed input tensors could produce incorrect stride() values. To fix this, we now call .contiguous() on the input tensors before passing them to Triton kernels. This ensures a contiguous memory layout and prevents transposed tensors from causing incorrect stride calculations. ### Test Plan pytest -sv --durations=0 tests/e2e/singlecard/test_aclgraph_batch_invariant.py ### Test Result ``` ============================================================================ slowest durations ============================================================================ 87.37s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_v1_generation_is_deterministic_across_batch_sizes_with_needle 77.39s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN 74.04s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_without_batch_invariance_should_fail 73.59s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_simple_generation (8 durations < 0.005s hidden. Use -vv to show these durations.) ================================================================ 4 passed, 3 warnings in 312.45s (0:05:12) ================================================================ ``` ### Performance export VLLM_BATCH_INVARIANT=1 vllm serve /home/Qwen3-0.6B \ --served-model-name qwen \ --port 8000 \ --max-num-seqs 256 \ --tensor-parallel-size 1 \ --max-model-len 5500 \ --max-num-batched-tokens 5500 \ --reasoning-parser qwen3 \ --gpu-memory-utilization 0.9 \ --compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,32]}' \ --additional-config '{"ascend_scheduler_config":{"enabled":true},"enable_weight_nz_layout":true}' vllm bench serve --served-model-name qwen --trust-remote-code --backend vllm --model /home/Qwen3-0.6B/ --endpoint /v1/completions --dataset-name random --random-input-len 512 --random-output-len 256 --num-prompts 800 --max-concurrency 8 torch.compile batch invariant performance: ``` ============ Serving Benchmark Result ============ Successful requests: 800 Failed requests: 0 Maximum request concurrency: 8 Benchmark duration (s): 477.21 Total input tokens: 409600 Total generated tokens: 204800 Request throughput (req/s): 1.68 Output token throughput (tok/s): 429.16 Peak output token throughput (tok/s): 472.00 Peak concurrent requests: 16.00 Total token throughput (tok/s): 1287.48 ---------------Time to First Token---------------- Mean TTFT (ms): 285.53 Median TTFT (ms): 312.70 P99 TTFT (ms): 324.22 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.59 Median TPOT (ms): 17.50 P99 TPOT (ms): 18.44 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.59 Median ITL (ms): 17.45 P99 ITL (ms): 18.76 ================================================== ``` Eager ``` ============ Serving Benchmark Result ============ Successful requests: 800 Failed requests: 0 Maximum request concurrency: 8 Benchmark duration (s): 1694.70 Total input tokens: 409600 Total generated tokens: 204800 Request throughput (req/s): 0.47 Output token throughput (tok/s): 120.85 Peak output token throughput (tok/s): 136.00 Peak concurrent requests: 16.00 Total token throughput (tok/s): 362.54 ---------------Time to First Token---------------- Mean TTFT (ms): 164.29 Median TTFT (ms): 129.71 P99 TTFT (ms): 1961.66 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 65.81 Median TPOT (ms): 65.15 P99 TPOT (ms): 72.27 ---------------Inter-token Latency---------------- Mean ITL (ms): 65.81 Median ITL (ms): 64.64 P99 ITL (ms): 75.72 ================================================== ``` - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: huangning1995 <huangning12@huawei.com>	2026-01-26 09:15:06 +08:00
Li Wang	c38c838d03	[CI] Decrease Qwen3 dense model output throughput baseline to make ci happy (#6233 ) ### What this PR does / why we need it? As https://github.com/vllm-project/vllm-ascend/actions/runs/21327913593/job/61388195448 shows, I encountered two CI failures., The results consistently pointed to the reduced outcome 1600 -> 1514 - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:04:13 +08:00
Li Wang	63adbedb7a	[Worker] Implement update max_model_len interface for NPUWorker (#6193 ) ### What this PR does / why we need it? This patch purpose to add the `update_max_model_len` interface. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:03:33 +08:00
Li Wang	ca297eb57f	[CI] Migrate e2e test runner to hk (#5344 ) ### What this PR does / why we need it? This patch add new runner labels for the HK region, and e2e single-card testing has been migrated to this runner. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:00:51 +08:00
Angazenn	5b746f3e83	[Inductor]change pass to adapt to new addrmsnormBias operator (#6094 ) ### What this PR does / why we need it? #5790 changes default addrmsnormBias operator if custom ops is enabled. This PR modifies AddRmsNormQuant pass to align with addrmsnormBias. --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-24 20:16:44 +08:00
yjmyl	e90b14140b	[feature] add_rms_norm support bias (#5790 ) ### What this PR does / why we need it? This PR is to replace addRmsNorm and Add With addRmsNormBias. This way can lead to a more effecient result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Full Test Pass - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com> Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>	2026-01-23 21:09:54 +08:00
starmountain1997	6c73b88dd6	[CI] Enable FLASHCOMM1 with layer_sharding and FULL_DECODE_ONLY in ds32 testing (#6115 ) ### What this PR does / why we need it? This PR enables FLASHCOMM1 communication optimization with layer sharding for DeepSeek-V3.2 W8A8 model testing to validate PR #5702. The changes include: 1. Enable FLASHCOMM1: Set VLLM_ASCEND_ENABLE_FLASHCOMM1=1 improves performance for distributed inference 2. Add layer sharding: Configure layer_sharding: ["q_b_proj", "o_proj"] 4. Update baselines: Adjust performance baselines to reflect the improvements from FLASHCOMM1 and layer sharding ### Does this PR introduce _any_ user-facing change? No. This is a CI/test-only change that enables new communication optimization features for testing purposes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-01-23 19:48:37 +08:00
jiangyunfan1	bdf65e6bd3	[TEST]Add mooncake common method for tests (#6194 ) ### What this PR does / why we need it? This PR adds mooncake common method to conftest, we need it to add more test cases later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running a test - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2026-01-23 17:14:15 +08:00
wjunLu	a3079cd253	[Tests] Skip unstable eagle cases to keep CI success (#6180 ) ### What this PR does / why we need it? The test case `tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance` fails occasionally, such result seems not stable with method `eagle`, for example: [tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance](https://github.com/vllm-project/vllm-ascend/actions/runs/21249578476/job/61147453980?pr=6151) This PR skips the `eagle` tests to keep CI success - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-23 15:33:53 +08:00
zhangxinyuehfad	193acc2c19	[CI] Add nightly ci test for deepseek v3.1 (#5386 ) ### What this PR does / why we need it? Add nightly ci test for deepseek v3.1 - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 14:36:49 +08:00
dsxsteven	8378bc28b0	[Misc] Remove CP Redundant Variables after FIA operator enables for CANN 8.5 (#6013 ) ### What this PR does / why we need it? PCP/DCP splits the kv-cache onto different cards. After introducing the parameter cp-kv-cache-interleave-size, the first size tokens will be cached at Card 0, and so on. However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem. After we integrate FIA operator in mla_cp._forward_decode and CANN updates to 8.5.0, we now can remove these additional operations. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? passed all CI by CANN 8.5.0 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: dsxsteven <dsxsteven@sina.com> Signed-off-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>	2026-01-23 14:13:12 +08:00
wjunLu	f4a361fcc3	[CI] Re-open skipped cases due to PTA upgrading and update the golden results (#6144 ) ### What this PR does / why we need it? Re-open `tests/e2e/singlecard/test_aclgraph_accuracy.py` and update its golden results to match PTA 2.9.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-23 10:46:31 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
meihanc	e54d294df3	[CI]Install clang in dokerfile for triton ascend (#4409 ) ### What this PR does / why we need it? Install clang in dokerfile for triton ascend - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-22 19:01:28 +08:00
wjunLu	a7d781f135	[Main] Upgrade PTA to 2.9.0 (#6112 ) ### What this PR does / why we need it? Upgrade PTA to 2.9.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-22 17:59:06 +08:00
Li Wang	484e7c59dc	[CI] optimize lint term (#5986 ) ### What this PR does / why we need it? This patch purpose to optimize the lint check term. The main idea is to reduce unnecessary installation time. 1. The installation of vllm is not must, only append the path of vllm src to the `PATHONPATH` is effective 2. This installation of `requirements-dev.txt` is not must, we have a pre-built image `quay.io/ascend-ci/vllm-ascend:lint` with all the requirements installed in advance. NOTE: the conditions for triggering image builds are: 1).Daily scheduled build; 2) Build when requirements are modified; 3) Manual build. This ensures that the dependencies in our image are up-to-date to the greatest extent possible. 3. The `mypy` was separated from the `pre-commit` hook for performance reasons; we found that integrating `mypy` into the `pre-commit` hook resulted in poor performance. 4. Reduce the CPU core consumption from 16 -> 8 ### Does this PR introduce _any_ user-facing change? The end-to-end lint time was optimized from 20min/per PR to 8min/per PR ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-22 15:46:59 +08:00
zhaomingyu13	34fb628248	[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#6097 ) According to the official documentation, the parameter "draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3 model. However, based on actual debugging, it was found that the number of tensor parallelisms (tp) of the Eagle model is consistent with that of the target model. The setting of tp for the draft model did not take effect as expected. Note: This feature has not been superimposed and tested with `sp` and `dp`. It will be adapted later No ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, }, ) outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Fixes vllm-project/vllm#31345 ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarksblood@qq.com>	2026-01-22 11:36:23 +08:00
maxmgrdv	ef9d8367f5	[Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method (#5143 ) Introduce W4A4 LAOS Quantization for better model compression and inference efficiency on Ascend devices. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-22 10:34:58 +08:00
wangxiyuan	69740039b7	[CI] Upgrade CANN to 8.5.0 (#6070 ) ### What this PR does / why we need it? 1. Upgrade CANN to 8.5.0 2. move triton-ascend 3.2.0 to requirements note: we skipped the two failed e2e test, see https://github.com/vllm-project/vllm-ascend/issues/6076 for more detail. We'll fix it soon. ### How was this patch tested? Closes: https://github.com/vllm-project/vllm-ascend/issues/5494 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-22 09:29:50 +08:00

1 2 3 4 5 ...

510 Commits