xc-llm-ascend

Author	SHA1	Message	Date
whx	39f8af9d96	[Main2Main][BugFix] Add shared_experts check for AscendSharedFusedMoE (#6335 ) ### What this PR does / why we need it? PR https://github.com/vllm-project/vllm/pull/32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-01-29 08:47:20 +08:00
hucong	df588ed488	[BugFix] Disable enable_shared_expert_dp by default if tensor_parallel_size=1 (#6361 ) ### What this PR does / why we need it? Disable enable_shared_expert_dp by default if tensor_parallel_size=1 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: underfituu <hzhucong@163.com>	2026-01-28 22:01:01 +08:00
linfeng-yuan	245c1ca241	[0.14.1][bugfix][sched] fix incompatibility of RecomputeScheduler with vllm v0.14.1 (#6286 ) ### What this PR does / why we need it? This PR rebases RecomputeScheduler codebase to vllm tags/v0.14.1 in order to fix the incompatibility with vllm's original Scheduler and AsyncScheduler. Main changes focus on multimodal model and speculative decoding parts. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We tested this PR with 2P1D E2E serving test case. - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-01-28 20:16:58 +08:00
Shaoxu Cheng	9fadc8df4f	[Fixbugs]: fix refactor cause to 310p chunkprefill error (#6340 ) Adapt modelrunner refactor change to make 310p work - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-28 16:41:32 +08:00
Yizhou	ac963f1519	[Fix] Adds CUDA graph stats to execution state (#6331 ) ### What this PR does / why we need it? Adds a CUDA graph profiling stats field to the execution state and updates the NPU model runner to set, unpack, and forward those stats during execution. This preserves CUDA graph metrics across state transitions, improving observability for later use and diagnostics. ### Does this PR introduce _any_ user-facing change? Enable this by set ```python llm = LLM( ... disable_log_stats=False, cudagraph_metrics=True, ... ) ``` or `--cudagraph-metrics` and make sure do not disable log stats. After this, you should be able to see something like this, which is really helpful for some light debugging: ``` [loggers.py:257] Engine 000: Avg prompt throughput: 32.3 tokens/s, Avg generation throughput: 114.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% [cuda_graph.py:117] CUDAGraph Config Settings: [cuda_graph.py:117] [cuda_graph.py:117] - Mode: FULL_DECODE_ONLY [cuda_graph.py:117] - Capture sizes: [1, 2, 4, 8, 16, 24, 32] [cuda_graph.py:117] [cuda_graph.py:117] CUDAGraph Stats: [cuda_graph.py:117] [cuda_graph.py:117] \| Unpadded Tokens \| Padded Tokens \| Num Paddings \| Runtime Mode \| Count \| [cuda_graph.py:117] \|-----------------\|---------------\|--------------\|--------------\|-------\| [cuda_graph.py:117] \| 4 \| 4 \| 0 \| FULL \| 18 \| [cuda_graph.py:117] \| 5 \| 5 \| 0 \| NONE \| 1 \| [cuda_graph.py:117] \| 1 \| 1 \| 0 \| FULL \| 1 \| [cuda_graph.py:117] \| 18 \| 18 \| 0 \| NONE \| 1 \| ``` ### How was this patch tested? None. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-28 16:34:20 +08:00
LICO67373	379ce599d0	[Bugfix] Add missing draft_attn_metadatas parameter to fix MTP test (#6232 ) ### What this PR does / why we need it? Fix the MTP test failure caused by accessing non-existent attribute `forward_context.draft_attn_metadatas`. Root cause: In `AscendAttentionBackendImpl.update_graph_params`, the code incorrectly accessed `forward_context.draft_attn_metadatas`, but `ForwardContext` class doesn't have this attribute. The original code passed this value via function parameter. Fix: Add `draft_attn_metadatas` parameter to the entire call chain: - `update_full_graph_params` function in `acl_graph.py` - All `update_graph_params` methods in attention backends - Pass the parameter correctly in `eagle_proposer.py` Also applied Gemini's suggestion to make `vllm_config=None` in `AscendAttentionCPImpl.update_graph_params` for API consistency. Related to item 9 in #5463 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This fixes the CI test failure: `test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` Signed-off-by: lico67373 <918688502@qq.com>	2026-01-28 14:41:18 +08:00
Wang Kunpeng	c498cea22d	[refactor] refactor excute_model and _dymmy_run method (#6043 ) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase Related-RFC: https://github.com/vllm-project/vllm-ascend/issues/5449 Co-authored-by: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-27 22:27:01 +08:00
TMC	41eb71d665	[Refactor] profiler config optimze (#6141 ) ### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: Enable Data Simplification: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. Use Lightweight Stack Tracing: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. Code Simplification: Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. Test setup: max length = 50, profiler + stack enabled Before optimization: Profiler data size: 651 MB Generate time: 3 seconds After optimization: Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: mengchengTang <745274877@qq.com>	2026-01-27 22:09:50 +08:00
CodeCat	54e8389f8e	[Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (#6006 ) ### What this PR does / why we need it? This PR builds upon PR https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to further enhance the npu_graph_ex_passes module. Based on prior work, we have added graph optimization support for the add_rms_quant fused operator in scenarios where a bias term is present—ensuring the fusion pattern is correctly registered and matched into the computation graph. This time, we performed the operator fusion of MatmulAllReduceAddRMSNorm and added corresponding ST test cases for regression monitoring. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-27 16:41:48 +08:00
pu-zhe	57fd6e4bd9	[Refact.]: refactoring 310p-kv cache allocator, align with main branch (#6270 ) ### What this PR does / why we need it? refactoring 310p-kv cache allocator, align with main branch vLLM version: v0.14.0 vLLM main: https://github.com/vllm-project/vllm-ascend/pull/6270 Qwen2.5-7B E2E Test --------- Signed-off-by: pu-zhe <puzhe1@h-partners.com> Signed-off-by: pu-zhe <zpuaa@outlook.com> Co-authored-by: pu-zhe <puzhe1@h-partners.com>	2026-01-27 16:26:48 +08:00
Angazenn	5e34c70ffc	[Misc] Removes unnecessary graph size re-initialization (#6280 ) ### What this PR does / why we need it? This PR removes `update_default_aclgraph_sizes`. In earlier versions, we add this function to change default `cudagraph_capture_sizes` because `_npu_paged_attention` degrades significantly on certain shapes (which is included in default `cudagraph_capture_sizes` of VLLM). Now since we use FIA as default attention op (which does not contain such performance degradation), there is no need to add this default change. Otherwise, it could cause some conflicts if we set a small `cudagraph_capture_sizes` that < 20 now. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-27 14:38:07 +08:00
meihanc	fea197ad50	[Main2Main] Upgrade vllm commit to 0123 (#6169 ) ### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27df97c3eb79f891802fc0e858f8f7ac6a0) Modify import paths due to the refactors： https://github.com/vllm-project/vllm/pull/32245 https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16da1e423ede2c2f52a9850cbfbb39cefe96) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to https://github.com/vllm-project/vllm/pull/28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117ea2e689cd43df4be6892671a17cdae5833) 1. Add `skip_compiled` param in `set_forward_context` due to https://github.com/vllm-project/vllm/pull/30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to https://github.com/vllm-project/vllm/pull/24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors：https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. https://github.com/vllm-project/vllm/pull/32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor https://github.com/vllm-project/vllm/pull/30143 3. Remove unused `maybe_setup_kv_connector` due to https://github.com/vllm-project/vllm/pull/32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to https://github.com/vllm-project/vllm/pull/32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cceb877dfd13f98c538c4c96158047d98bd) Setting temperature=0.0 due to the removal of the default temperature value in https://github.com/vllm-project/vllm/pull/32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>	2026-01-27 08:44:36 +08:00
Mercykid-bash	29fb27d3bb	BugFix: Fix moe_load accumulation error in ACL graph mode (#6182 ) This PR fixes the numerical error in moe_load accumulation under ACL graph mode on NPU: using += for NPU tensors in graph mode does not throw errors but leads to incorrect values, so we replace it with the in-place add_() method to ensure accurate calculation. Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>	2026-01-26 17:18:46 +08:00
Canlin Guo	2d3b8a51f9	[Patch] Remove the patch of ECExampleConnector (#5976 ) ### What this PR does / why we need it? Part of #5304. https://github.com/vllm-project/vllm/pull/30225 has been merged now. We don't need this patch anymore. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-26 17:10:03 +08:00
Jingchun Gao	b390e0ef78	[Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (#5416 ) - Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>	2026-01-26 16:53:07 +08:00
ChenCangtao	1645546661	[bugfix][npugraph_ex]fix static kernel uninstall issue (#6128 ) ### What this PR does / why we need it? The static kernel in torch_npu is uninstalled through Python's atexit mechanism. However, in vllm-ascend, when inference ends or the service stops, the worker process is terminated. This way, ending the process does not trigger the atexit mechanism, causing the static kernel not to be unloaded. When using the nougraph_ex backend and enabling the static kernel, we registered a signal handler to explicitly unload the static kernel. When there are many static kernels, unloading usually takes some time, whereas vllm will directly kill the process after sending a terminate event. Therefore, we choose to handle it by starting a new process. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-26 15:03:18 +08:00
Nengjun Ma	f910cebe04	[Doc] 310P Documents update (#6246 ) ### What this PR does / why we need it? 310P support guides updates, as currently has supported in main branch. --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-26 14:33:21 +08:00
yuxinshan	0bb1f91c2c	[Feature] Mooncake connector get remote ptp size (#5822 ) ### What this PR does / why we need it? To support elastic scaling when using mooncake connector, we should support to configure different tp sizes for different nodes. As a result, we transfer the prefill node information, such as tp size, through the request's kv_transfer_params. The decode nodes get the prefill tp size through the request's kv_transfer_params, instead of getting it from the configuration of the mooncake connector . - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>	2026-01-26 14:28:33 +08:00
LI SHENGYONG	611e223b7d	[EPLB][Bugfix] EPLB support fp/bf16 (#5531 ) ### What this PR does / why we need it? EPLB support dtype of fp/bf16. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? w8a8_dynamic Baseline: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| w8a8_dynamic eplb: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| The fp16 conversation is normal. The fp16 test is in progress. Baseline fp16 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| eplb fp16 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-26 14:28:16 +08:00
Li Wang	c26ad78f86	[CI][lint] Add rule `codespell` back (#6236 ) ### What this PR does / why we need it? After removing codepsell a while, we discovered that typo had a problem correctly recognizing certain misspelled words, so I suggested adding it back. - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 14:12:33 +08:00
Canlin Guo	65289676b4	[Refactor] Separate `_prepare_inputs` to `_prepare_inputs` and `_preprocess` (#6191 ) ### What this PR does / why we need it? Align with upstream vLLM. This PR will help downstream vLLM-Omni reduce the cost for maintaining the _prepare_inputs. Besides, it helps vLLM-Ascend code more readable. In the future, we can follow closer to vLLM. The `preprocess` logic is same as GPUModelRunner. We don't need to maintain it anymore. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-26 14:05:23 +08:00
Shanshan Shen	76ac688388	[MM][Perf] Parallelize Q/K/V padding in AscendMMEncoderAttention for better performance (#6204 ) ### What this PR does / why we need it? Currently, we pad the last dim of qkv to 128 before flash attention (in `AscendMMEncoderAttention`) to get better performance on Ascend NPU. However, the qkv padding is executed serially, which may lead to more overhead when launching `aclnnConstantPadNd` (launch 3 times). Since the three operations are mutually independent, we stack qkv first and then pad them in one kernel launch. With this optimization, TTFT has been reduced by 3.15%, peak throughput has been increased by 4.20%. --- ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Launch the server: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --dtype bfloat16 \ --limit-mm-per-prompt '{"image": 1}' \ --max-model-len 16384 \ --max-num-batched-tokens 16384 ``` Run benchmark: ```bash vllm bench serve \ --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --hf-split train \ --dataset-path lmarena-ai/vision-arena-bench-v0.1 \ --num-prompts 1000 \ --no-stream ``` Before this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Failed requests: 0 Benchmark duration (s): 122.33 Total input tokens: 66638 Total generated tokens: 122845 Request throughput (req/s): 8.17 Output token throughput (tok/s): 1004.18 Peak output token throughput (tok/s): 3073.00 Peak concurrent requests: 1000.00 Total token throughput (tok/s): 1548.90 ---------------Time to First Token---------------- Mean TTFT (ms): 51757.16 Median TTFT (ms): 44853.42 P99 TTFT (ms): 110700.14 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 226.06 Median TPOT (ms): 206.85 P99 TPOT (ms): 935.31 ---------------Inter-token Latency---------------- Mean ITL (ms): 208.82 Median ITL (ms): 96.37 P99 ITL (ms): 2183.13 ================================================== ``` After this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Failed requests: 0 Benchmark duration (s): 121.47 Total input tokens: 66638 Total generated tokens: 122860 Request throughput (req/s): 8.23 Output token throughput (tok/s): 1011.47 Peak output token throughput (tok/s): 3202.00 Peak concurrent requests: 1000.00 Total token throughput (tok/s): 1560.08 ---------------Time to First Token---------------- Mean TTFT (ms): 50125.08 Median TTFT (ms): 46270.85 P99 TTFT (ms): 108107.12 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 227.11 Median TPOT (ms): 205.13 P99 TPOT (ms): 816.08 ---------------Inter-token Latency---------------- Mean ITL (ms): 204.60 Median ITL (ms): 92.66 P99 ITL (ms): 2219.02 ================================================== ``` - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-26 10:20:24 +08:00
huangning1995	ce11fd49f3	[Feature] Batch invariant torch.compile (#6107 ) ### What this PR does / why we need it? Building upon https://github.com/vllm-project/vllm-ascend/pull/5517 to enable batch-invariant in vllm-ascend, we observed that the performance of BI in eager mode remains suboptimal. This PR further integrates batch-invariant with torch.compile, which improves inference performance by 350% when tested with Qwen3-0.6B. ### Does this PR introduce _any_ user-facing change? Previously, enabling both aclgraph and Batch-Invariant would cause an "ub overflow" error. This occurred because transposed input tensors could produce incorrect stride() values. To fix this, we now call .contiguous() on the input tensors before passing them to Triton kernels. This ensures a contiguous memory layout and prevents transposed tensors from causing incorrect stride calculations. ### Test Plan pytest -sv --durations=0 tests/e2e/singlecard/test_aclgraph_batch_invariant.py ### Test Result ``` ============================================================================ slowest durations ============================================================================ 87.37s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_v1_generation_is_deterministic_across_batch_sizes_with_needle 77.39s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN 74.04s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_without_batch_invariance_should_fail 73.59s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_simple_generation (8 durations < 0.005s hidden. Use -vv to show these durations.) ================================================================ 4 passed, 3 warnings in 312.45s (0:05:12) ================================================================ ``` ### Performance export VLLM_BATCH_INVARIANT=1 vllm serve /home/Qwen3-0.6B \ --served-model-name qwen \ --port 8000 \ --max-num-seqs 256 \ --tensor-parallel-size 1 \ --max-model-len 5500 \ --max-num-batched-tokens 5500 \ --reasoning-parser qwen3 \ --gpu-memory-utilization 0.9 \ --compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,32]}' \ --additional-config '{"ascend_scheduler_config":{"enabled":true},"enable_weight_nz_layout":true}' vllm bench serve --served-model-name qwen --trust-remote-code --backend vllm --model /home/Qwen3-0.6B/ --endpoint /v1/completions --dataset-name random --random-input-len 512 --random-output-len 256 --num-prompts 800 --max-concurrency 8 torch.compile batch invariant performance: ``` ============ Serving Benchmark Result ============ Successful requests: 800 Failed requests: 0 Maximum request concurrency: 8 Benchmark duration (s): 477.21 Total input tokens: 409600 Total generated tokens: 204800 Request throughput (req/s): 1.68 Output token throughput (tok/s): 429.16 Peak output token throughput (tok/s): 472.00 Peak concurrent requests: 16.00 Total token throughput (tok/s): 1287.48 ---------------Time to First Token---------------- Mean TTFT (ms): 285.53 Median TTFT (ms): 312.70 P99 TTFT (ms): 324.22 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.59 Median TPOT (ms): 17.50 P99 TPOT (ms): 18.44 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.59 Median ITL (ms): 17.45 P99 ITL (ms): 18.76 ================================================== ``` Eager ``` ============ Serving Benchmark Result ============ Successful requests: 800 Failed requests: 0 Maximum request concurrency: 8 Benchmark duration (s): 1694.70 Total input tokens: 409600 Total generated tokens: 204800 Request throughput (req/s): 0.47 Output token throughput (tok/s): 120.85 Peak output token throughput (tok/s): 136.00 Peak concurrent requests: 16.00 Total token throughput (tok/s): 362.54 ---------------Time to First Token---------------- Mean TTFT (ms): 164.29 Median TTFT (ms): 129.71 P99 TTFT (ms): 1961.66 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 65.81 Median TPOT (ms): 65.15 P99 TPOT (ms): 72.27 ---------------Inter-token Latency---------------- Mean ITL (ms): 65.81 Median ITL (ms): 64.64 P99 ITL (ms): 75.72 ================================================== ``` - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: huangning1995 <huangning12@huawei.com>	2026-01-26 09:15:06 +08:00
linfeng-yuan	96309e2b79	[ops] support advanced apply_top_k_top_p without top_k constraint (#6098 ) ### What this PR does / why we need it? Implement `apply_top_k_top_p` via ascendC to eliminate the constraint of k [1,1024]. It enables high performance TopKTopP calculation and avoid D2H synchronization introduced by k validation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E serving with `k=4096` and `p=0.95` - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>	2026-01-26 09:08:42 +08:00
wangxiyuan	4e3919e965	Reapply "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) (#6231 ) This reverts commit `95649344aa`. The CI failure doesn't related to this change. Let's reapply it. - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-26 09:04:54 +08:00
Li Wang	63adbedb7a	[Worker] Implement update max_model_len interface for NPUWorker (#6193 ) ### What this PR does / why we need it? This patch purpose to add the `update_max_model_len` interface. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:03:33 +08:00
drslark	384d84c7ef	[Bugfix] Avoided a bug of drafter when `dp` and `sp` are enabled (#6226 ) ### What this PR does / why we need it? Avoided a bug of drafter when `dp` and `sp` are enabled. Specifically, disable `sp` when drafter is dense. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? An aisbench test: ```shell python3 aisbench_test.py --input_len 3500 --output_len 1000 --data_num 100 --concurrency 320 --request_rate 8 ``` The result is okay. ```text [2026-01-24 22:38:20,256] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Calculate global interval offsets time: 0.5922 s 01/24 22:38:20 - AISBench - INFO - Process 0 using precomputed sleep offsets with 100 requests Process-0 pid:220279: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 100/100 [09:40<00:00, 5.81s/it] Pid: 220279 \| Post: 100 \| Received: 100 \| Failed: 0 \| Post Time:12.51s \| Receive Time:580.92s: Encoding output text...: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 100/100 [00:01<00:00, 93.75it/s] 01/24 22:48:02 - AISBench - INFO - Start converting origin data to detailed data ... 01/24 22:48:02 - AISBench - INFO - Finish converting origin data to detailed data█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 100/100 [00:01<00:00, 95.08it/s] 01/24 22:48:02 - AISBench - INFO - Added 'Actual RPS: After Excluding Anomalies' to group 'Time - RPS: ' in legend explanation table 01/24 22:48:02 - AISBench - INFO - Successfully merged chart into position (1, 1) 01/24 22:48:02 - AISBench - INFO - RPS distribution charts saved to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_rps_distribution_plot_with_actual_rps.html 01/24 22:48:02 - AISBench - INFO - Updated chart with actual RPS saved to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_rps_distribution_plot_with_actual_rps.html [2026-01-24 22:48:02,557] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Start extracting pref datas ... [2026-01-24 22:48:02,558] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Finish extracting pref datas! [2026-01-24 22:48:02,558] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Dumping detail perf data ... Dumping data to h5: 100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:00<00:00, 75.31it/s] [2026-01-24 22:48:02,588] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Dump detail perf data cost: 0.02995561994612217(s) [2026-01-24 22:48:02,588] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Performance task finished, results saved in outputs/default/20260124_223809/performances/vllm-api-stream-chat 01/24 22:48:02 - AISBench - INFO - time elapsed: 586.32s Running tasks: 100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [09:55<00:00, 595.91s/it] 01/24 22:48:05 - AISBench - INFO - Performance evaluation tasks completed. 01/24 22:48:05 - AISBench - INFO - Loading detail perf data of model='vllm-api-stream-chat' dataset='gsm8kdataset' ... 01/24 22:48:05 - AISBench - INFO - Starting request timeline processing... 01/24 22:48:05 - AISBench - INFO - Data preprocessing completed in 0.0004s 01/24 22:48:05 - AISBench - INFO - Generating timeline traces for 100 requests... 01/24 22:48:05 - AISBench - INFO - Generated timeline trace chunks in 0.0441s 01/24 22:48:05 - AISBench - INFO - Generating concurrency traces... 01/24 22:48:05 - AISBench - INFO - Generated concurrency trace chunks in 0.0011s 01/24 22:48:05 - AISBench - INFO - Creating figure layout... 01/24 22:48:05 - AISBench - INFO - Figure layout created in 0.0504s 01/24 22:48:05 - AISBench - INFO - Writing to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_plot.html... 01/24 22:48:05 - AISBench - INFO - HTML written in 0.0181s 01/24 22:48:05 - AISBench - INFO - Completed! Total execution time: 0.1148s 01/24 22:48:05 - AISBench - INFO - The gsm8kdataset_plot has been saved in outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_plot.html 01/24 22:48:05 - AISBench - INFO - Converting perf results of stage ... 01/24 22:48:05 - AISBench - INFO - Finish Converting! 01/24 22:48:05 - AISBench - INFO - Start calculating metrics ... 01/24 22:48:05 - AISBench - INFO - Start calculating common metrics ... 01/24 22:48:05 - AISBench - INFO - Start calculating add units ... 01/24 22:48:05 - AISBench - INFO - Finish calculating perf data! 01/24 22:48:05 - AISBench - INFO - Summarizing performance results... 01/24 22:48:05 - AISBench - INFO - Performance Results of task: vllm-api-stream-chat/gsm8kdataset: ╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 300806.1781 ms │ 189326.0489 ms │ 568345.5121 ms │ 380629.6785 ms │ 384208.3527 ms │ 385363.7709 ms │ 566871.7684 ms │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 107441.2231 ms │ 343.8054 ms │ 378132.3979 ms │ 188817.4877 ms │ 190985.8451 ms │ 192547.6847 ms │ 378008.356 ms │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 193.5585 ms │ 185.1008 ms │ 197.262 ms │ 193.8146 ms │ 195.0803 ms │ 196.0323 ms │ 196.9688 ms │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 194.2067 ms │ 0.0108 ms │ 2782.7124 ms │ 184.9998 ms │ 194.2631 ms │ 221.2895 ms │ 304.363 ms │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3506.86 │ 3431.0 │ 3508.0 │ 3508.0 │ 3508.0 │ 3508.0 │ 3508.0 │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 1000.0 │ 1000.0 │ 1000.0 │ 1000.0 │ 1000.0 │ 1000.0 │ 1000.0 │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 3.7745 token/s │ 1.7595 token/s │ 5.2819 token/s │ 2.6272 token/s │ 5.1028 token/s │ 5.1502 token/s │ 5.2754 token/s │ 100 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪══════════════════╡ │ Benchmark Duration │ total │ 580456.2704 ms │ ├──────────────────────────┼─────────┼──────────────────┤ │ Total Requests │ total │ 100 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Success Requests │ total │ 100 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Concurrency │ total │ 51.8224 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Max Concurrency │ total │ 320 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Request Throughput │ total │ 0.1723 req/s │ ├──────────────────────────┼─────────┼──────────────────┤ │ Total Input Tokens │ total │ 350686 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Prefill Token Throughput │ total │ 32.6398 token/s │ ├──────────────────────────┼─────────┼──────────────────┤ │ Total generated tokens │ total │ 100000 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Input Token Throughput │ total │ 604.1558 token/s │ ├──────────────────────────┼─────────┼──────────────────┤ │ Output Token Throughput │ total │ 172.2783 token/s │ ├──────────────────────────┼─────────┼──────────────────┤ │ Total Token Throughput │ total │ 776.434 token/s │ ╘══════════════════════════╧═════════╧══════════════════╛ 01/24 22:48:05 - AISBench - INFO - Performance Result files locate in outputs/default/20260124_223809/performances/vllm-api-stream-chat. ``` - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-25 17:45:29 +08:00
Canlin Guo	b45bd92c2b	[Bugfix] Add defensive check for multimodal_config (#6230 ) ### What this PR does / why we need it? In vLLM-Omni, there exists the empty `ModelConfig`. We need to add a check before accessing the sub-field of model_config. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Will checked by CI. - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-25 17:39:19 +08:00
wangxiyuan	95649344aa	Revert "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) This reverts commit `8966a99710`. It breaks the test `tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-25 15:25:38 +08:00
Icey	7799c4ca3b	[Fusion] change fusion env variable (#6201 ) ### What this PR does / why we need it? Since CI has integrated Triton, `fuse_qknorm_rope` is enabled by default. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-24 22:49:33 +08:00
SILONG ZENG	6ccccad102	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #5 ) (#5996 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `.../distributed/kv_transfer/kv_pool/ascend_store/ascend_store_connector.py` \| \| `vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/backend/backend.py` \| \| ` .../distributed/kv_transfer/kv_pool/ascend_store/backend/memcache_backend.py` \| \| ` .../distributed/kv_transfer/kv_pool/ascend_store/backend/mooncake_backend.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/config_data.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/kv_transfer.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_scheduler.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_worker.py` \| \| ` .../distributed/kv_transfer/kv_pool/cpu_offload/cpu_kv_cache_manager.py` \| \| ` .../distributed/kv_transfer/kv_pool/cpu_offload/cpu_offload_connector.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/cpu_offload/metadata.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ucm_connector.py` \| \| ` vllm_ascend/distributed/kv_transfer/utils/mooncake_transfer_engine.py` \| \| ` vllm_ascend/distributed/kv_transfer/utils/utils.py` \| \| ` vllm_ascend/kv_offload/cpu_npu.py` \| \| ` vllm_ascend/kv_offload/npu.py` \| \| ` vllm_ascend/lora/lora_ops.py` \| \| ` vllm_ascend/lora/punica_npu.py` \| \| ` vllm_ascend/lora/utils.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com>	2026-01-24 22:45:38 +08:00
SILONG ZENG	7faa6878a6	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #3 ) (#5978 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/attention/mla_v1.py` \| \| `vllm_ascend/attention/sfa_v1.py` \| \| `vllm_ascend/core/recompute_scheduler.py` \| \| `vllm_ascend/core/scheduler_dynamic_batch.py` \| \| `vllm_ascend/distributed/device_communicators/npu_communicator.py` \| \| `vllm_ascend/distributed/device_communicators/pyhccl.py` \| \| `vllm_ascend/distributed/device_communicators/pyhccl_wrapper.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: Soren <user@SorendeMac-mini.local>	2026-01-24 22:10:18 +08:00
SILONG ZENG	4e53c1d900	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #6 ) (#6001 ) ### What this PR does / why we need it? \| File Path \| \| :--- \| \| ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` \| \| ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` \| \| ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` \| \| ` vllm_ascend/eplb/core/eplb_utils.py` \| \| ` vllm_ascend/eplb/core/eplb_worker.py` \| \| ` vllm_ascend/eplb/core/policy/policy_abstract.py` \| \| ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` \| \| ` vllm_ascend/eplb/core/policy/policy_factory.py` \| \| ` vllm_ascend/eplb/core/policy/policy_flashlb.py` \| \| ` vllm_ascend/eplb/core/policy/policy_random.py` \| \| ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` \| \| ` vllm_ascend/eplb/eplb_updator.py` \| \| ` vllm_ascend/eplb/utils.py` \| \| ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` \| \| ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` \| \| ` vllm_ascend/model_loader/netloader/interaction/elastic.py` \| \| ` vllm_ascend/model_loader/netloader/load.py` \| \| ` vllm_ascend/model_loader/netloader/netloader.py` \| \| ` vllm_ascend/model_loader/netloader/utils.py` \| \| ` vllm_ascend/patch/platform/__init__.py` \| \| ` vllm_ascend/patch/platform/patch_balance_schedule.py` \| \| ` vllm_ascend/patch/platform/patch_ec_connector.py` \| \| ` vllm_ascend/patch/platform/patch_mamba_config.py` \| \| ` vllm_ascend/patch/platform/patch_multiproc_executor.py` \| \| ` vllm_ascend/patch/platform/patch_sched_yield.py` \| - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-24 22:08:33 +08:00
SILONG ZENG	153da1a669	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #4 ) (#6200 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/distributed/kv_transfer/__init__.py` \| \| `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_connector.py` \| \| `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-24 20:40:48 +08:00
Shaoxu Cheng	fbae41697e	[310P]: refactoring for 310p kvcache and some ops class (#6117 ) ### What this PR does / why we need it? * Refactor the LayerNorm and activation operator classes to decouple the 310P device implementation from the main branch. * Refactor `mm_encoder_attention` on 310P to use the `torch_npu._npu_flash_attention_unpad` operator. * Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P so they are no longer padded to 16× alignment. * Refactor `model_runner` on 310P to align the KV-cache initialization logic with the mainline implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? use the e2e tests. - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-24 20:34:29 +08:00
Angazenn	5b746f3e83	[Inductor]change pass to adapt to new addrmsnormBias operator (#6094 ) ### What this PR does / why we need it? #5790 changes default addrmsnormBias operator if custom ops is enabled. This PR modifies AddRmsNormQuant pass to align with addrmsnormBias. --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-24 20:16:44 +08:00
LICO67373	8966a99710	[Refactor] Unify full-graph parameter update logic (#6041 ) ### What this PR does / why we need it? Refactor: Unify full-graph parameter update logic This PR consolidates the scattered full-graph parameter update logic into a unified approach, improving code architecture and eliminating duplication. Key improvements: 1. Unified interface - Create `update_full_graph_params` as the single entry point for all full-graph updates - Replace multiple scattered update calls with one unified function - Remove ~50 lines of duplicated if-else logic across `model_runner_v1.py` and `eagle_proposer.py` 2. Better architecture - Move update logic to respective Backend classes (`AscendAttentionBackend`, `AscendMLABackend`) - Each Backend manages its own parameter update logic internally - Simplify caller code to just dispatch to the appropriate Backend 3. Cleaner parameter handling - Remove unnecessary `pcp_size` and `dcp_size` parameter passing - Get parallel configuration directly from distributed groups - Consistent with how other parts of the codebase obtain these values Why we need it: - Maintainability: Future changes only need to be made in one place per Backend - Code quality: Follows DRY principle and Single Responsibility Principle - Readability: Cleaner, more intuitive code structure ### Does this PR introduce _any_ user-facing change? No. This is a pure refactoring with no functional changes - same behavior, cleaner code. ### How was this patch tested? - All existing unit tests pass with updated mocks - No new tests needed (pure refactoring, no behavior changes) - CI validates correctness --- - vLLM version: v0.13.0 Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: drslark <slarksblood@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-24 20:12:57 +08:00
liziyu	f66bcdfb29	[P/D] Mooncake connector add zmq socket fail log (#6155 ) Mooncake connector add zmq socket fail log - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-24 12:06:42 +08:00
liziyu	14bef9af6f	[P/D] Remove restrictions on mooncake for IPv6 (#5946 ) ### What this PR does / why we need it? Remove restrictions on mooncake for IPv6 Dependencies: cann8.5、mooncake v0.3.8.post1 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-24 11:30:22 +08:00
Angazenn	019a2fe6e6	[Eagle3]enhance skipping dp allreduce and add it into eagle proposer (#6192 ) ### What this PR does / why we need it? This PR： 1. Enhances the logic of `_skip_all_reduce_across_dp_group` to skip all cpu dp allreduce for dense models. This is also for purpose 2. 2. Adds `_skip_all_reduce_across_dp_group` into eagle_proposer. Now models like Qwen3-235b supports eagle3 spec decode. A typical setting for these moe models on pd disaggregation often introduce `dp_size > 1`. This requires `set_forward_context` to call a cpu dp allreduce to retrieve `num_tokens_across_dp` on all cases. Skipping this allreduce greatly improves performance. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-24 11:29:42 +08:00
UnifiedCacheManager	a2f022f9b6	[UCMConnector]Add has_connector_metadata (#6172 ) ### What this PR does / why we need it? ucm_connector add has `has_connector_metadata` interface to adapt to the latest KV connector in vLLM. ### Does this PR introduce _any_ user-facing change? this PR doesn't introduce _any_ user-facing change. ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2026-01-23 21:16:48 +08:00
drslark	44a4ff6960	[main][BugFix] Avoided a bug of `torch_npu.npu_mm_reduce_scatter_base` when sp size >= 16 (#6168 ) ### What this PR does / why we need it? If `sp` is enabled and `tp_size` >= 16, `torch_npu.npu_mm_reduce_scatter_base` will raises a exception. After consulting with the operator developer, we learned that the operator does not work when `tp` = 16. So, we disable the operator when `tp` = 16. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested We started a server with `sp` enabled and `tp` = 16. It started successfully. ```text [0;36m(APIServer pid=1855938)[0;0m INFO: Started server process [1855938] [0;36m(APIServer pid=1855938)[0;0m INFO: Waiting for application startup. [0;36m(APIServer pid=1855938)[0;0m INFO: Application startup complete. ``` - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-23 21:12:23 +08:00
yjmyl	e90b14140b	[feature] add_rms_norm support bias (#5790 ) ### What this PR does / why we need it? This PR is to replace addRmsNorm and Add With addRmsNormBias. This way can lead to a more effecient result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Full Test Pass - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com> Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>	2026-01-23 21:09:54 +08:00
baxingpiaochong	8786412f5c	[Bugfix]KV pool rank 0 consumes more HBM (#6113 ) ### What this PR does / why we need it? before add_set_deivce <img width="2354" height="674" alt="image" src="https://github.com/user-attachments/assets/8b81ab5f-b9ba-4fd2-8546-8f36ac15d32b" /> after <img width="1044" height="156" alt="image" src="https://github.com/user-attachments/assets/996d845a-8abd-4aae-b894-4a9832b1f742" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2026-01-23 19:47:33 +08:00
weiguihua2	4173255c0c	[main][Bugix] fix kv pcp+pooling+pd separation bug (#6153 ) ### What this PR does / why we need it? Rectify the problem that the pcp and pd separation and kv pooling scenario. In the pooling scenario, multi_nodes_meta_mapping is empty. As a result, an error is reported when the remote_host information is obtained through the get_remote_port_send_num method. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-23 16:15:04 +08:00
zhaomingyu13	ff63626874	[Bugfix] Fix the issue of the acceptance rate decline for Qwen3-30B-A3B-EAGLE3 (#6138 ) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug（#5967）. Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>	2026-01-23 16:12:56 +08:00
SILONG ZENG	78af0c30a3	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #12 ) (#6177 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/ops/triton/activation/swiglu_quant.py` \| \| `vllm_ascend/ops/triton/batch_invariant/matmul.py` \| \| `vllm_ascend/ops/triton/batch_invariant/mean.py` \| \| `vllm_ascend/ops/triton/batch_invariant/rmsnorm.py` \| \| `vllm_ascend/ops/triton/fla/chunk.py` \| \| `vllm_ascend/ops/triton/fla/chunk_delta_h.py` \| \| `vllm_ascend/ops/triton/fla/chunk_o.py` \| \| `vllm_ascend/ops/triton/fla/chunk_scaled_dot_kkt.py` \| \| `vllm_ascend/ops/triton/fla/cumsum.py` \| \| `vllm_ascend/ops/triton/fla/fused_qkvzba_split_reshape.py` \| \| `vllm_ascend/ops/triton/fla/l2norm.py` \| \| `vllm_ascend/ops/triton/fla/layernorm_guard.py` \| \| `vllm_ascend/ops/triton/fla/sigmoid_gating.py` \| \| `vllm_ascend/ops/triton/fla/solve_tril.py` \| \| `vllm_ascend/ops/triton/fla/utils.py` \| \| `vllm_ascend/ops/triton/fla/wy_fast.py` \| \| `vllm_ascend/ops/triton/fused_gdn_gating.py` \| \| `vllm_ascend/ops/triton/layernorm_gated.py` \| \| `vllm_ascend/ops/triton/linearnorm/split_qkv_rmsnorm_rope.py` \| \| `vllm_ascend/ops/triton/mamba/causal_conv1d.py` \| \| `vllm_ascend/ops/triton/reject_sample.py` \| \| `vllm_ascend/ops/triton/rope.py` \| \| `vllm_ascend/ops/triton/spec_decode/utils.py` \| \| `vllm_ascend/ops/triton/triton_utils.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-23 14:59:19 +08:00
LI SHENGYONG	8210a62a44	[EPLB][Bugfix]Reduce unnecessary video memory usage (#6020 ) ### What this PR does / why we need it? 1.Incorporate the warm up of the EPLB into the profile run. 2.Reusing the same gather buffer ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? qwen3-235b aime baseline \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| eplb The OOM issue does not occur. \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-23 14:21:13 +08:00
Qiu	749e24f81e	[bugfix] align max_num_batched_tokens with tppcp when using FLASHCOMM1 (#6000 ) ### What this PR does / why we need it? Align max_num_batched_tokens with tppcp when using FLASHCOMM1 to avoid assert error in `NPUModelRunner._dummy_run`. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-23 14:19:49 +08:00
simplzyu	f8d03d21f1	Add Medusa speculative decoding support for vllm_ascend (#5668 ) ### What this PR does / why we need it? `vllm_ascend` already supports several speculative decoding strategies such as MTP, EAGLE, N-gram, and suffix decoding. However, Medusa is not yet supported. Medusa is an efficient speculative decoding framework that leverages a lightweight draft model to propose multiple tokens in a single step, which can significantly improve decoding throughput and reduce latency. To enable Medusa-based speculative decoding on Ascend hardware and provide more decoding options for users, this PR adds Medusa support into the `vllm_ascend` speculative decoding pipeline. ### Does this PR introduce _any_ user-facing change? This PR introduces Medusa speculative decoding as an additional speculative decoding method: ✔ Adds `MedusaProposer` and integrates it into the speculative decoding registry ✔ Extends `SpecDcodeType` with a `MEDUSA` enum entry ✔ Updates `NPUModelRunner` to recognize and invoke Medusa during decoding ✔ Adds Medusa-specific handling in the draft token generation logic ✔ Ensures backward compatibility — Medusa is only used when explicitly enabled Key code changes include: * New file: `vllm_ascend/spec_decode/medusa_proposer.py` * Register Medusa in `get_spec_decode_method` * Extend proposer type hints to include `MedusaProposer` * Add a Medusa-specific branch in `generate_draft_token_ids` * Pass `sample_hidden_states` required by Medusa ### How was this patch tested? Medusa is implemented as a new proposer class (`MedusaProposer`) following the existing speculative decoding interface. The integration works as follows: 1. Users enable Medusa via the speculative decoding configuration. 2. `get_spec_decode_method()` returns a `MedusaProposer` instance when `method="medusa"`. 3. During decoding, `NPUModelRunner` detects that the active drafter is a `MedusaProposer`. 4. Instead of the generic speculative decoding path, the Medusa-specific `generate_token_ids()` method is invoked, which consumes: * `valid_sampled_token_ids` * `sampling_metadata` * `spec_decode_metadata` * `sample_hidden_states` 5. The proposed tokens are validated by the target model as usual. When Medusa is not enabled, the decoding pipeline behaves exactly as before, ensuring full backward compatibility. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: simplzyu <191163281@qq.com> Signed-off-by: simplzyu <zhenyuguo@cmbchina.com>	2026-01-23 14:14:23 +08:00

... 6 7 8 9 10 ...

1685 Commits