xc-llm-ascend

Author	SHA1	Message	Date
ChenCangtao	f2990f7741	[e2e Test][npugraph_ex]add static kernel e2e test case (#6320 ) ### What this PR does / why we need it? Added an E2E test case for the scenario of enabling a static kernel for npugraph_ex, monitoring its compilation and unloading process. Also fixed the previously existing spelling errors - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-30 16:24:48 +08:00
Li Wang	8969b94a14	[Nightly] Correct nightly image build ref (#6420 ) ### What this PR does / why we need it? The underlying tags for nightly image builds have been corrected, and some useless and confusing workflow fields have been removed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-30 15:55:58 +08:00
liziyu	d252e4f5ec	[P/D] Using the cache load operator to replace the index select operator. (#6295 ) ### What this PR does / why we need it? Using the cache load operator to replace the index select operator. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-30 14:27:53 +08:00
Wang Kunpeng	70cc5f7969	[bugfix]fix rope_forward_triton error (#6404 ) ### What this PR does / why we need it? The rope_forward_triton method reports an error. For example: ``` (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=True) (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/triton/rope.py", line 155, in rope_forward_triton (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] cos = cos.view(num_tokens, -1) (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] RuntimeError: shape '[14, -1]' is invalid for input of size 768 ``` This is because an incorrect num_tokens_padded was passed in. Related-RFC: https://github.com/vllm-project/vllm-ascend/issues/5449 Co-authored-by: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-01-30 14:09:00 +08:00
ChenCangtao	46cee945b3	[doc][npugraph_ex]add npugraph_ex introduction doc (#6306 ) ### What this PR does / why we need it? As part of the preparation work for the [RFC](https://github.com/vllm-project/vllm-ascend/issues/6214) We have added a documentation about npugraph_ex, which mainly explains and introduces its usage and FX graph optimization. The introduction to FX graph optimization also includes specific explanations of the default passes, the implementation methods for custom fusion passes, and how to capture the FX graph during the optimization process through environment variable configuration. --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-30 11:21:37 +08:00
zhangxinyuehfad	1d661bb279	[Bugfix] Specify tensorflow version in accuracy test to avoid segmentation fault (#6292 ) ### What this PR does / why we need it? Specify tensorflow version in accuracy test to avoid segmentation fault - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-30 09:28:24 +08:00
CodeCat	b2857de43f	[ST]Add e2e test for Npugraphex_pass (#6388 ) ### What this PR does / why we need it? We found the custom passes of NPUGraphEX have implemented fusion operator features, which still require E2E test case validation and guard. This PR implements E2E test cases for the AddRMSNormQuant and SplitQKVNormRope operator fusions under NPUGraphEX that are already in the codebase. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-30 09:14:07 +08:00
wjunLu	4970de4242	[CI] Enable the skipped cases when HDK is upgraded to 25.5.0 (#6195 ) ### What this PR does / why we need it? Enable the tests that were skipped due to an outdated driver version: - tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py - tests/e2e/multicard/4-cards/long_sequence/test_basic.py - tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py and some cases in - tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py - tests/e2e/multicard/2-cards/test_external_launcher.py - tests/e2e/multicard/2-cards/test_offline_weight_load.py - tests/e2e/multicard/2-cards/test_quantization.py - tests/e2e/multicard/4-cards/test_data_parallel_tp2.py TODO: - tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py - tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-29 22:41:41 +08:00
Li Wang	e35f304419	[CI] Auto partition for test cases (#6379 ) ### What this PR does / why we need it? This patch add auto-partition feat for tests, for example, before this pr, we are running e2e single card test for 2h40min, after the auto partition, test case is automatically allocated into the required n parts based on its test duration (greedy strategy) and run in parallel. The advantage of doing this is that our overall test duration will become 1/n of the original. ### Does this PR introduce _any_ user-facing change? Before: e2e single card test spend 2h40min After: e2e single card test spend 1h13min ### How was this patch tested? ```shell python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 0 args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=0, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600) +----------------+--------------------+ \| Suite \| Partition \| \|----------------+--------------------\| \| e2e-singlecard \| 1/2 (0-based id=0) \| +----------------+--------------------+ ✅ Enabled 13 test(s) (est total 4020.0s): - tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py (est_time=1800) - tests/e2e/singlecard/test_aclgraph_accuracy.py (est_time=480) - tests/e2e/singlecard/test_guided_decoding.py (est_time=354) - tests/e2e/singlecard/test_batch_invariant.py (est_time=320) - tests/e2e/singlecard/pooling/test_embedding.py (est_time=270) - tests/e2e/singlecard/test_quantization.py (est_time=200) - tests/e2e/singlecard/test_llama32_lora.py (est_time=162) - tests/e2e/singlecard/test_cpu_offloading.py (est_time=132) - tests/e2e/singlecard/pooling/test_classification.py (est_time=120) - tests/e2e/singlecard/test_camem.py (est_time=77) - tests/e2e/singlecard/compile/test_norm_quant_fusion.py (est_time=70) - tests/e2e/singlecard/test_auto_fit_max_mode_len.py (est_time=25) - tests/e2e/singlecard/test_profile_execute_duration.py (est_time=10) (base) wangli@Mac-mini vllm-ascend % python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 1 args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=1, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600) +----------------+--------------------+ \| Suite \| Partition \| \|----------------+--------------------\| \| e2e-singlecard \| 2/2 (0-based id=1) \| +----------------+--------------------+ ✅ Enabled 13 test(s) (est total 4025.0s): - tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py (est_time=1500) - tests/e2e/singlecard/pooling/test_scoring.py (est_time=500) - tests/e2e/singlecard/test_aclgraph_batch_invariant.py (est_time=410) - tests/e2e/singlecard/test_vlm.py (est_time=354) - tests/e2e/singlecard/test_models.py (est_time=300) - tests/e2e/singlecard/test_multistream_overlap_shared_expert.py (est_time=200) - tests/e2e/singlecard/test_sampler.py (est_time=200) - tests/e2e/singlecard/test_async_scheduling.py (est_time=150) - tests/e2e/singlecard/test_aclgraph_mem.py (est_time=130) - tests/e2e/singlecard/test_ilama_lora.py (est_time=95) - tests/e2e/singlecard/test_completion_with_prompt_embeds.py (est_time=76) - tests/e2e/singlecard/test_qwen3_multi_loras.py (est_time=65) - tests/e2e/singlecard/test_xlite.py (est_time=45) ``` - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-29 20:28:10 +08:00
zxr2333	14bd55f30c	[P/D][BugFix] Fix layerwise P/D request_id error (#6360 ) ### What this PR does / why we need it? Fix layerwise Connector P/D request_id error, due to vllm pr: https://github.com/vllm-project/vllm/pull/27987, which will add a random suffix to request_id in EngineCore. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-01-29 20:19:05 +08:00
Qiu	feab047084	[bugfix](pcp,gqa) set kv_inverse_idx_for_chunk and cp_kv_recover_idx_for_chunk to None when dcp only (#6317 ) ### What this PR does / why we need it? We only do restore and recover for pcp, so we should set `kv_inverse_idx_for_chunk` and `cp_kv_recover_idx_for_chunk` to `None` when only using dcp. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-29 19:35:52 +08:00
Qiu	50e0e87646	[bugfix](CP,MLA) fix wrong slot_mapping of decode for mixed p/d batch (#6344 ) ### What this PR does / why we need it? PR #5672 attempted to remove the -1 padding for duplicate tokens in the decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler slicing approach. However, in the single-ops logic and mixed PD batches, the decode slot_mapping did not eliminate the -1 and also shared the slicing method, resulting in incorrect slot_mapping. This PR resolves this issue, and the logic will be further consolidated in subsequent refactoring PRs. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-29 16:48:37 +08:00
Sergey-Zlobin	6a7b3bc29c	Qwen3-VL-MoE EAGLE support for vLLM-Ascend (#6327 ) ### What this PR does / why we need it? Qwen3-VL-MoE EAGLE support for vLLM-Ascend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The patch tested with Qwen3-VL-30B-A3B-Instruct model - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Sergey_Zlobin <sirg_zlobin@mail.ru>	2026-01-29 16:44:30 +08:00
JiangWeixiang	41a52beb26	[bugfix] resolve kv cache leak on P-side due to incorrect req_id (#6325 ) ### What this PR does / why we need it? This PR fixes a critical bug in the PD-separated inference pipeline where KV cache on the Prefill (P) side was not being properly released. The issue arises when multiple clients use the same x-request-id: to avoid request ID collisions, both Prefill and Decode nodes append a random suffix to the incoming x-request-id. A previous PR ensured consistency by having the P-side pass its final request_id as remote_request_id to the D-side via kv_transfer_param. However, during KV cache cleanup, the D-side incorrectly used the local req_id (instead of remote_request_id) to select the target P-side rank. This mismatch caused the P-side KV cache to remain unreleased on certain ranks, leading to memory leaks. This PR corrects the logic to use remote_request_id consistently when determining the P-side rank. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The fix was validated by running multiple concurrent benchmark instances - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: ghphotoframe <854746559@qq.com>	2026-01-29 16:05:56 +08:00
Nengjun Ma	597091be9f	[Doc] Reranker guide remove deprecated task option (#6385 ) ### What this PR does / why we need it? Reranker guide remove deprecated task option. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-29 16:00:26 +08:00
wangxiyuan	7a5b345dc4	[Misc] Drop deepseek patch (#6288 ) We patched deepseek before since we notice asserterror raised by transformers. Now due to transformers upgrade, the patch looks useless now. Let's remove it. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-29 14:45:50 +08:00
whx	39f8af9d96	[Main2Main][BugFix] Add shared_experts check for AscendSharedFusedMoE (#6335 ) ### What this PR does / why we need it? PR https://github.com/vllm-project/vllm/pull/32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-01-29 08:47:20 +08:00
Li Wang	f0ff2cc22d	[CI] hot fix for nightly image build tag (#6367 ) ### What this PR does / why we need it? The base image of `releases/v0.13.0` should tagged as `releases/v0.13.0-**` - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-28 23:29:50 +08:00
InSec	86b6ecac4c	[CI][BugFix] Import error fix. (#6293 ) ### What this PR does / why we need it? Fix the import error of qwen3-next nightly test. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: InSec <1790766300@qq.com>	2026-01-28 22:07:47 +08:00
hucong	df588ed488	[BugFix] Disable enable_shared_expert_dp by default if tensor_parallel_size=1 (#6361 ) ### What this PR does / why we need it? Disable enable_shared_expert_dp by default if tensor_parallel_size=1 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: underfituu <hzhucong@163.com>	2026-01-28 22:01:01 +08:00
Li Wang	8b0a7b6d80	[CI] Nightly tests use `releases/v0.13.0` (#6355 ) ### What this PR does / why we need it? The pre-requirement pr is https://github.com/vllm-project/vllm-ascend/pull/6353, this patch aims to transfer nightly tests to `releases/v0.13.0`, what we need to do is just use the branch built image for nightly - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-28 21:46:13 +08:00
Li Wang	501bb395b1	[CI] Fix image build (#6333 ) Try to fix schedule image build CI - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-28 21:36:44 +08:00
linfeng-yuan	245c1ca241	[0.14.1][bugfix][sched] fix incompatibility of RecomputeScheduler with vllm v0.14.1 (#6286 ) ### What this PR does / why we need it? This PR rebases RecomputeScheduler codebase to vllm tags/v0.14.1 in order to fix the incompatibility with vllm's original Scheduler and AsyncScheduler. Main changes focus on multimodal model and speculative decoding parts. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We tested this PR with 2P1D E2E serving test case. - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-01-28 20:16:58 +08:00
linfeng-yuan	e25ee65729	[Misc][Test] add e2e test for apply_top_k_top_p_custom kernel (#6348 ) ### What this PR does / why we need it? Add e2e test case for apply_top_k_top_p_custom kernel and eliminate chinese comments. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest passed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-01-28 17:25:57 +08:00
Shaoxu Cheng	857c533e27	[CI]: add production safeguards for 300I (#6343 ) Update 310p files tracker to enable 310p e2e test per PR. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-28 16:43:48 +08:00
Shaoxu Cheng	9fadc8df4f	[Fixbugs]: fix refactor cause to 310p chunkprefill error (#6340 ) Adapt modelrunner refactor change to make 310p work - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-28 16:41:32 +08:00
dsxsteven	325cb16e3f	[BugFix][CI]Fix DeepSeek-R1-W8A8-longseq nightly CI (#6297 ) ### What this PR does / why we need it? The precision issue arose because the kv cache of the p-node had not been fetched for an extended period(>6min) and was forcibly freed. To avoid this problem, the batch size was reduced and the timeout period has also been extended. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: dsxsteven <dsxsteven@sina.com>	2026-01-28 16:36:24 +08:00
Yizhou	ac963f1519	[Fix] Adds CUDA graph stats to execution state (#6331 ) ### What this PR does / why we need it? Adds a CUDA graph profiling stats field to the execution state and updates the NPU model runner to set, unpack, and forward those stats during execution. This preserves CUDA graph metrics across state transitions, improving observability for later use and diagnostics. ### Does this PR introduce _any_ user-facing change? Enable this by set ```python llm = LLM( ... disable_log_stats=False, cudagraph_metrics=True, ... ) ``` or `--cudagraph-metrics` and make sure do not disable log stats. After this, you should be able to see something like this, which is really helpful for some light debugging: ``` [loggers.py:257] Engine 000: Avg prompt throughput: 32.3 tokens/s, Avg generation throughput: 114.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% [cuda_graph.py:117] CUDAGraph Config Settings: [cuda_graph.py:117] [cuda_graph.py:117] - Mode: FULL_DECODE_ONLY [cuda_graph.py:117] - Capture sizes: [1, 2, 4, 8, 16, 24, 32] [cuda_graph.py:117] [cuda_graph.py:117] CUDAGraph Stats: [cuda_graph.py:117] [cuda_graph.py:117] \| Unpadded Tokens \| Padded Tokens \| Num Paddings \| Runtime Mode \| Count \| [cuda_graph.py:117] \|-----------------\|---------------\|--------------\|--------------\|-------\| [cuda_graph.py:117] \| 4 \| 4 \| 0 \| FULL \| 18 \| [cuda_graph.py:117] \| 5 \| 5 \| 0 \| NONE \| 1 \| [cuda_graph.py:117] \| 1 \| 1 \| 0 \| FULL \| 1 \| [cuda_graph.py:117] \| 18 \| 18 \| 0 \| NONE \| 1 \| ``` ### How was this patch tested? None. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-28 16:34:20 +08:00
LICO67373	379ce599d0	[Bugfix] Add missing draft_attn_metadatas parameter to fix MTP test (#6232 ) ### What this PR does / why we need it? Fix the MTP test failure caused by accessing non-existent attribute `forward_context.draft_attn_metadatas`. Root cause: In `AscendAttentionBackendImpl.update_graph_params`, the code incorrectly accessed `forward_context.draft_attn_metadatas`, but `ForwardContext` class doesn't have this attribute. The original code passed this value via function parameter. Fix: Add `draft_attn_metadatas` parameter to the entire call chain: - `update_full_graph_params` function in `acl_graph.py` - All `update_graph_params` methods in attention backends - Pass the parameter correctly in `eagle_proposer.py` Also applied Gemini's suggestion to make `vllm_config=None` in `AscendAttentionCPImpl.update_graph_params` for API consistency. Related to item 9 in #5463 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This fixes the CI test failure: `test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` Signed-off-by: lico67373 <918688502@qq.com>	2026-01-28 14:41:18 +08:00
wangxiyuan	f8e76a49fa	[CI] Upgrade trasnformers version (#6307 ) Upgrade transformers to >=4.56.4 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-28 14:06:39 +08:00
Wang Kunpeng	c498cea22d	[refactor] refactor excute_model and _dymmy_run method (#6043 ) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase Related-RFC: https://github.com/vllm-project/vllm-ascend/issues/5449 Co-authored-by: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-27 22:27:01 +08:00
TMC	41eb71d665	[Refactor] profiler config optimze (#6141 ) ### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: Enable Data Simplification: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. Use Lightweight Stack Tracing: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. Code Simplification: Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. Test setup: max length = 50, profiler + stack enabled Before optimization: Profiler data size: 651 MB Generate time: 3 seconds After optimization: Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: mengchengTang <745274877@qq.com>	2026-01-27 22:09:50 +08:00
CodeCat	54e8389f8e	[Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (#6006 ) ### What this PR does / why we need it? This PR builds upon PR https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to further enhance the npu_graph_ex_passes module. Based on prior work, we have added graph optimization support for the add_rms_quant fused operator in scenarios where a bias term is present—ensuring the fusion pattern is correctly registered and matched into the computation graph. This time, we performed the operator fusion of MatmulAllReduceAddRMSNorm and added corresponding ST test cases for regression monitoring. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-27 16:41:48 +08:00
pu-zhe	21b6779a33	[UT]: refactoring 310p ops ut (#6296 ) ### What this PR does / why we need it? Refactor swiglu and rms_norm unittest case for 310P and 910B. Apply attention_v1 get_kv_cache_shape and build metadata on all of platforms ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? CI UT test - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-01-27 16:31:51 +08:00
pu-zhe	57fd6e4bd9	[Refact.]: refactoring 310p-kv cache allocator, align with main branch (#6270 ) ### What this PR does / why we need it? refactoring 310p-kv cache allocator, align with main branch vLLM version: v0.14.0 vLLM main: https://github.com/vllm-project/vllm-ascend/pull/6270 Qwen2.5-7B E2E Test --------- Signed-off-by: pu-zhe <puzhe1@h-partners.com> Signed-off-by: pu-zhe <zpuaa@outlook.com> Co-authored-by: pu-zhe <puzhe1@h-partners.com>	2026-01-27 16:26:48 +08:00
Angazenn	5e34c70ffc	[Misc] Removes unnecessary graph size re-initialization (#6280 ) ### What this PR does / why we need it? This PR removes `update_default_aclgraph_sizes`. In earlier versions, we add this function to change default `cudagraph_capture_sizes` because `_npu_paged_attention` degrades significantly on certain shapes (which is included in default `cudagraph_capture_sizes` of VLLM). Now since we use FIA as default attention op (which does not contain such performance degradation), there is no need to add this default change. Otherwise, it could cause some conflicts if we set a small `cudagraph_capture_sizes` that < 20 now. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-27 14:38:07 +08:00
meihanc	fea197ad50	[Main2Main] Upgrade vllm commit to 0123 (#6169 ) ### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27df97c3eb79f891802fc0e858f8f7ac6a0) Modify import paths due to the refactors： https://github.com/vllm-project/vllm/pull/32245 https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16da1e423ede2c2f52a9850cbfbb39cefe96) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to https://github.com/vllm-project/vllm/pull/28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117ea2e689cd43df4be6892671a17cdae5833) 1. Add `skip_compiled` param in `set_forward_context` due to https://github.com/vllm-project/vllm/pull/30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to https://github.com/vllm-project/vllm/pull/24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors：https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. https://github.com/vllm-project/vllm/pull/32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor https://github.com/vllm-project/vllm/pull/30143 3. Remove unused `maybe_setup_kv_connector` due to https://github.com/vllm-project/vllm/pull/32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to https://github.com/vllm-project/vllm/pull/32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cceb877dfd13f98c538c4c96158047d98bd) Setting temperature=0.0 due to the removal of the default temperature value in https://github.com/vllm-project/vllm/pull/32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>	2026-01-27 08:44:36 +08:00
Icey	9780a995e1	[BugFix] Fix wheel package build workflow (#6276 ) ### What this PR does / why we need it? Fixes https://github.com/vllm-project/vllm-ascend/actions/runs/21348357385/job/61440051717 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-26 20:42:17 +08:00
InSec	595b57c4d4	[CI][BugFix] Qwen3-Next nightly test fix. (#6247 ) ### What this PR does / why we need it? Qwen3-Next nightly test fix. Temporarily avoid the accuracy issue in the full graph mode. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: InSec <1790766300@qq.com>	2026-01-26 19:53:53 +08:00
wangxiyuan	d9979f4d13	[Doc] quick fix for vllm-ascend version (#6278 ) Correct vllm-ascend version name in doc - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-26 19:33:18 +08:00
wangxiyuan	cb553f8eee	[Community] Nominate whx-sjtu as maintainer (#6268 ) Since the first release v0.13.0rc2 and v0.14.0rc1 in 2026 are released. We consider to refresh the maintainer team. I nominate whx-sjtu as the new maintainer. - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-26 19:22:26 +08:00
Li Wang	43be004379	[Lint] Fix mypy issue to make CI happy (#6272 ) ### What this PR does / why we need it? The variables `self.prefiller_heap` `self.decoder_heap` are used as `List[tuple[float, int, ServerState]]` but defined as `List[tuple[int, int, ServerState]]`, which leads to the failed of mypy, see https://github.com/vllm-project/vllm-ascend/actions/runs/21351411010/job/61448739554?pr=6265 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 17:54:00 +08:00
Mercykid-bash	29fb27d3bb	BugFix: Fix moe_load accumulation error in ACL graph mode (#6182 ) This PR fixes the numerical error in moe_load accumulation under ACL graph mode on NPU: using += for NPU tensors in graph mode does not throw errors but leads to incorrect values, so we replace it with the in-place add_() method to ensure accurate calculation. Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>	2026-01-26 17:18:46 +08:00
Canlin Guo	2d3b8a51f9	[Patch] Remove the patch of ECExampleConnector (#5976 ) ### What this PR does / why we need it? Part of #5304. https://github.com/vllm-project/vllm/pull/30225 has been merged now. We don't need this patch anymore. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-26 17:10:03 +08:00
Jingchun Gao	b390e0ef78	[Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (#5416 ) - Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>	2026-01-26 16:53:07 +08:00
yuxinshan	7d119df2a9	[Feat] proxy delay to remove instances (#5934 ) ### What this PR does / why we need it? For the proxy, we should remove instances when the proxy are not processing requests. But sometimes, We need to isolate some faulty nodes when a large number of requests are coming in. So we support to isolate faulty nodes by lowering their priority and deleted them when the proxy does not process requests. ### Does this PR introduce _any_ user-facing change? For `examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`, when using `/instances/remove` API to delete the node from the proxy server: ```txt curl -X POST http://localhost:9000/instances/remove \ -H "Content-Type: application/json" \ -d '{ "type": "decode", "instances": "127.0.0.1:8201" }' ``` There are 2 situations: * 【New】When the proxy is processing requests, isolate the nodes and remove them when the proxy is free. ```txt {"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` * When the proxy is free, remove the nodes directly. ```txt {"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']} ``` ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: yuxinshan <syx_ctyg@126.com>	2026-01-26 16:29:45 +08:00
Li Wang	de095c5fed	[CI] Add workfolw_dispatch for nightly image build (#6269 ) ### What this PR does / why we need it? Currently, the nightly image is built at 20 PM and 23 PM UTC+8. Due to some timeliness requirements, we need to add a new trigger method for nightly image builds. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 15:56:38 +08:00
ChenCangtao	1645546661	[bugfix][npugraph_ex]fix static kernel uninstall issue (#6128 ) ### What this PR does / why we need it? The static kernel in torch_npu is uninstalled through Python's atexit mechanism. However, in vllm-ascend, when inference ends or the service stops, the worker process is terminated. This way, ending the process does not trigger the atexit mechanism, causing the static kernel not to be unloaded. When using the nougraph_ex backend and enabling the static kernel, we registered a signal handler to explicitly unload the static kernel. When there are many static kernels, unloading usually takes some time, whereas vllm will directly kill the process after sending a terminate event. Therefore, we choose to handle it by starting a new process. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-26 15:03:18 +08:00
Nengjun Ma	f910cebe04	[Doc] 310P Documents update (#6246 ) ### What this PR does / why we need it? 310P support guides updates, as currently has supported in main branch. --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-26 14:33:21 +08:00
yuxinshan	0bb1f91c2c	[Feature] Mooncake connector get remote ptp size (#5822 ) ### What this PR does / why we need it? To support elastic scaling when using mooncake connector, we should support to configure different tp sizes for different nodes. As a result, we transfer the prefill node information, such as tp size, through the request's kv_transfer_params. The decode nodes get the prefill tp size through the request's kv_transfer_params, instead of getting it from the configuration of the mooncake connector . - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>	2026-01-26 14:28:33 +08:00

1 2 3 4 5 ...

2281 Commits