xc-llm-ascend

Author	SHA1	Message	Date
lilinsiman	7932255c06	[Refactor][EAGLE] 6/N route mtp to eagle except pcp/dcp+mtp (#6349 ) ### What this PR does / why we need it? Overview: This pull request refactors speculative decoding for Eagle and MTP proposers on Ascend hardware. It fixes a bug related to draft_attn_metadatas being lost, migrates the lmhead feature, and adds routing logic in MtpProposer. Details: 1. Migrated the lmhead feature from mtp to eagle and normalized it in eagle_proposer. 2. Fixed the bug where draft_attn_metadatas was lost after enabling eagle mode in the merge graph. 3. Added the routing for pcp and disable padded drafter batch; in mtp mode, if pcp and disable padded drafter batch are not enabled, the normalized file eagle_proposer will be used. RFC: https://github.com/vllm-project/vllm-ascend/issues/5467 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ut and test - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-02-02 19:15:31 +08:00
LHXuuu	45a573cff1	[Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W4A8 dynamic weight. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: menogrey <1299267905@qq.com>	2026-02-02 16:39:32 +08:00
lty	082aa2e5b7	[Bugfix]The service fails to be started when the memcache pool is enabled (#6229 ) ### What this PR does / why we need it? The service fails to be started when the memcache pool is enabled without configuring the mooncake path. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? ``` #memcache echo 200000 > /proc/sys/vm/nr_hugepages source /usr/local/memfabric_hybrid/set_env.sh source /usr/local/memcache_hybrid/set_env.sh source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh export MMC_LOCAL_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-local.conf vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \ --host $local_ip \ --port 8002 \ --served-model-name model \ --data-parallel-size 2 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --no-enable-chunked-prefill \ --max-num-seqs 4 \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enforce-eager \ --quantization ascend \ --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \ --kv-transfer-config \ '{ "kv_connector": "AscendStoreConnector", "kv_role": "kv_both", "kv_connector_extra_config": { "backend": "memcache", "lookup_rpc_port":"0" } }' ``` - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-02-02 16:26:18 +08:00
Shaoxu Cheng	460ea88276	[Refact.]: Refactor some leftover implementations of 300I DUO in the main branch. (#6425 ) ### What this PR does / why we need it? - Replace the RoPE operator implementation. - Refactor some leftover implementations of 300I DUO in the main branch. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-02-02 16:12:04 +08:00
wangxiyuan	eeedf7c503	[Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470 ) ### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to https://github.com/vllm-project/vllm/pull/31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to https://github.com/vllm-project/vllm/pull/32082 by https://github.com/vllm-project/vllm-ascend/pull/6335. - Fix `ReshapeAndCacheOperation setup failed!` due to https://github.com/vllm-project/vllm/pull/25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-02 15:57:55 +08:00
SILONG ZENG	347eb36a59	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #9 ) (#6135 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|`vllm_ascend/worker/model_runner_v1.py`\| \|`vllm_ascend/worker/pcp_utils.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-01 23:20:20 +08:00
wangxiyuan	b4aafd4293	[Core][Misc] Clean up ProfileExecuteDuration (#6461 ) ### What this PR does / why we need it? This PR removes the custom `ProfileExecuteDuration` utility and its usages across the codebase. This utility was used for profiling execution duration of different stages in the inference process. It is replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`, which integrates with PyTorch's profiler. This change simplifies the code by removing a custom implementation in favor of an upstream utility, improving maintainability. Associated documentation and tests for `ProfileExecuteDuration` are also removed. ### Does this PR introduce _any_ user-facing change? `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now. ### How was this patch tested? CI passed. The changes are a cleanup and replacement with a standard utility. Existing tests cover the functionality. The removed feature had its own tests which are also removed. Related RFC: #5304 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-01 20:06:01 +08:00
fems14	775fbc4cd2	【main】【bugfix】fix: restrict default MLAPO activation to Decode nodes only (#6451 ) ### What this PR does / why we need it? There is an issue with the current default logic for MLAPO (MLA Policy Optimization). By design, MLAPO should only be enabled by default on Decode (D) nodes. However, in hybrid (collocated prefill and decode) scenarios, the strategy is erroneously activated during the Prefill stage. This PR corrects the default behavior to ensure that MLAPO is exclusively enabled for the Decoding phase. This prevents unexpected policy interference during Prefill and ensures optimal performance in hybrid deployment environments. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-01-31 22:44:56 +08:00
Li Wang	5b0a6bcfe9	[ModelRunner] Revert "[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6459 ) This reverts commit `56f5d3bd49`. ### What this PR does / why we need it? The patch https://github.com/vllm-project/vllm-ascend/pull/6357 which break the functionality availability in the spec_decode scenario, let's revert and make CI happy first ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-31 16:33:34 +08:00
Qiu	638cae824d	[bugfix](CP) Fix and unify the PD request discrimination logic. (#5939 ) ### What this PR does / why we need it? Since the PR (https://github.com/vllm-project/vllm/pull/32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-31 10:26:02 +08:00
wubin58	4230bc8646	[Bugfix]Modify NPU rotary encoding parameter fields，fix RopeOperation setup failed in condition of self.rotary_dim < self.head_size (#6310 ) ### What this PR does / why we need it? change self.head_size to self.rotary_dim. only the rotary part is processed here, the dimension should be rotary_dim. Fix bug #6060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only a small section of code was modified to adjust the parameters, and all standard tests were passed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: fengshi666 <fengshi666@adsl-99-12-210-25.dsl.hstntx.sbcglobal.net> Co-authored-by: fengshi666 <fengshi666@adsl-99-12-210-25.dsl.hstntx.sbcglobal.net>	2026-01-30 21:25:04 +08:00
Yizhou	56f5d3bd49	[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6357 ) ### What this PR does / why we need it? This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building the attention metadata. Together, these changes prevent kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-30 16:41:44 +08:00
ChenCangtao	f2990f7741	[e2e Test][npugraph_ex]add static kernel e2e test case (#6320 ) ### What this PR does / why we need it? Added an E2E test case for the scenario of enabling a static kernel for npugraph_ex, monitoring its compilation and unloading process. Also fixed the previously existing spelling errors - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-30 16:24:48 +08:00
liziyu	d252e4f5ec	[P/D] Using the cache load operator to replace the index select operator. (#6295 ) ### What this PR does / why we need it? Using the cache load operator to replace the index select operator. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-30 14:27:53 +08:00
Wang Kunpeng	70cc5f7969	[bugfix]fix rope_forward_triton error (#6404 ) ### What this PR does / why we need it? The rope_forward_triton method reports an error. For example: ``` (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=True) (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/triton/rope.py", line 155, in rope_forward_triton (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] cos = cos.view(num_tokens, -1) (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] RuntimeError: shape '[14, -1]' is invalid for input of size 768 ``` This is because an incorrect num_tokens_padded was passed in. Related-RFC: https://github.com/vllm-project/vllm-ascend/issues/5449 Co-authored-by: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-01-30 14:09:00 +08:00
zxr2333	14bd55f30c	[P/D][BugFix] Fix layerwise P/D request_id error (#6360 ) ### What this PR does / why we need it? Fix layerwise Connector P/D request_id error, due to vllm pr: https://github.com/vllm-project/vllm/pull/27987, which will add a random suffix to request_id in EngineCore. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-01-29 20:19:05 +08:00
Qiu	feab047084	[bugfix](pcp,gqa) set kv_inverse_idx_for_chunk and cp_kv_recover_idx_for_chunk to None when dcp only (#6317 ) ### What this PR does / why we need it? We only do restore and recover for pcp, so we should set `kv_inverse_idx_for_chunk` and `cp_kv_recover_idx_for_chunk` to `None` when only using dcp. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-29 19:35:52 +08:00
Qiu	50e0e87646	[bugfix](CP,MLA) fix wrong slot_mapping of decode for mixed p/d batch (#6344 ) ### What this PR does / why we need it? PR #5672 attempted to remove the -1 padding for duplicate tokens in the decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler slicing approach. However, in the single-ops logic and mixed PD batches, the decode slot_mapping did not eliminate the -1 and also shared the slicing method, resulting in incorrect slot_mapping. This PR resolves this issue, and the logic will be further consolidated in subsequent refactoring PRs. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-29 16:48:37 +08:00
Sergey-Zlobin	6a7b3bc29c	Qwen3-VL-MoE EAGLE support for vLLM-Ascend (#6327 ) ### What this PR does / why we need it? Qwen3-VL-MoE EAGLE support for vLLM-Ascend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The patch tested with Qwen3-VL-30B-A3B-Instruct model - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Sergey_Zlobin <sirg_zlobin@mail.ru>	2026-01-29 16:44:30 +08:00
JiangWeixiang	41a52beb26	[bugfix] resolve kv cache leak on P-side due to incorrect req_id (#6325 ) ### What this PR does / why we need it? This PR fixes a critical bug in the PD-separated inference pipeline where KV cache on the Prefill (P) side was not being properly released. The issue arises when multiple clients use the same x-request-id: to avoid request ID collisions, both Prefill and Decode nodes append a random suffix to the incoming x-request-id. A previous PR ensured consistency by having the P-side pass its final request_id as remote_request_id to the D-side via kv_transfer_param. However, during KV cache cleanup, the D-side incorrectly used the local req_id (instead of remote_request_id) to select the target P-side rank. This mismatch caused the P-side KV cache to remain unreleased on certain ranks, leading to memory leaks. This PR corrects the logic to use remote_request_id consistently when determining the P-side rank. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The fix was validated by running multiple concurrent benchmark instances - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: ghphotoframe <854746559@qq.com>	2026-01-29 16:05:56 +08:00
wangxiyuan	7a5b345dc4	[Misc] Drop deepseek patch (#6288 ) We patched deepseek before since we notice asserterror raised by transformers. Now due to transformers upgrade, the patch looks useless now. Let's remove it. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-29 14:45:50 +08:00
whx	39f8af9d96	[Main2Main][BugFix] Add shared_experts check for AscendSharedFusedMoE (#6335 ) ### What this PR does / why we need it? PR https://github.com/vllm-project/vllm/pull/32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-01-29 08:47:20 +08:00
hucong	df588ed488	[BugFix] Disable enable_shared_expert_dp by default if tensor_parallel_size=1 (#6361 ) ### What this PR does / why we need it? Disable enable_shared_expert_dp by default if tensor_parallel_size=1 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: underfituu <hzhucong@163.com>	2026-01-28 22:01:01 +08:00
linfeng-yuan	245c1ca241	[0.14.1][bugfix][sched] fix incompatibility of RecomputeScheduler with vllm v0.14.1 (#6286 ) ### What this PR does / why we need it? This PR rebases RecomputeScheduler codebase to vllm tags/v0.14.1 in order to fix the incompatibility with vllm's original Scheduler and AsyncScheduler. Main changes focus on multimodal model and speculative decoding parts. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We tested this PR with 2P1D E2E serving test case. - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-01-28 20:16:58 +08:00
Shaoxu Cheng	9fadc8df4f	[Fixbugs]: fix refactor cause to 310p chunkprefill error (#6340 ) Adapt modelrunner refactor change to make 310p work - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-28 16:41:32 +08:00
Yizhou	ac963f1519	[Fix] Adds CUDA graph stats to execution state (#6331 ) ### What this PR does / why we need it? Adds a CUDA graph profiling stats field to the execution state and updates the NPU model runner to set, unpack, and forward those stats during execution. This preserves CUDA graph metrics across state transitions, improving observability for later use and diagnostics. ### Does this PR introduce _any_ user-facing change? Enable this by set ```python llm = LLM( ... disable_log_stats=False, cudagraph_metrics=True, ... ) ``` or `--cudagraph-metrics` and make sure do not disable log stats. After this, you should be able to see something like this, which is really helpful for some light debugging: ``` [loggers.py:257] Engine 000: Avg prompt throughput: 32.3 tokens/s, Avg generation throughput: 114.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% [cuda_graph.py:117] CUDAGraph Config Settings: [cuda_graph.py:117] [cuda_graph.py:117] - Mode: FULL_DECODE_ONLY [cuda_graph.py:117] - Capture sizes: [1, 2, 4, 8, 16, 24, 32] [cuda_graph.py:117] [cuda_graph.py:117] CUDAGraph Stats: [cuda_graph.py:117] [cuda_graph.py:117] \| Unpadded Tokens \| Padded Tokens \| Num Paddings \| Runtime Mode \| Count \| [cuda_graph.py:117] \|-----------------\|---------------\|--------------\|--------------\|-------\| [cuda_graph.py:117] \| 4 \| 4 \| 0 \| FULL \| 18 \| [cuda_graph.py:117] \| 5 \| 5 \| 0 \| NONE \| 1 \| [cuda_graph.py:117] \| 1 \| 1 \| 0 \| FULL \| 1 \| [cuda_graph.py:117] \| 18 \| 18 \| 0 \| NONE \| 1 \| ``` ### How was this patch tested? None. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-28 16:34:20 +08:00
LICO67373	379ce599d0	[Bugfix] Add missing draft_attn_metadatas parameter to fix MTP test (#6232 ) ### What this PR does / why we need it? Fix the MTP test failure caused by accessing non-existent attribute `forward_context.draft_attn_metadatas`. Root cause: In `AscendAttentionBackendImpl.update_graph_params`, the code incorrectly accessed `forward_context.draft_attn_metadatas`, but `ForwardContext` class doesn't have this attribute. The original code passed this value via function parameter. Fix: Add `draft_attn_metadatas` parameter to the entire call chain: - `update_full_graph_params` function in `acl_graph.py` - All `update_graph_params` methods in attention backends - Pass the parameter correctly in `eagle_proposer.py` Also applied Gemini's suggestion to make `vllm_config=None` in `AscendAttentionCPImpl.update_graph_params` for API consistency. Related to item 9 in #5463 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This fixes the CI test failure: `test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` Signed-off-by: lico67373 <918688502@qq.com>	2026-01-28 14:41:18 +08:00
Wang Kunpeng	c498cea22d	[refactor] refactor excute_model and _dymmy_run method (#6043 ) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase Related-RFC: https://github.com/vllm-project/vllm-ascend/issues/5449 Co-authored-by: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-27 22:27:01 +08:00
TMC	41eb71d665	[Refactor] profiler config optimze (#6141 ) ### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: Enable Data Simplification: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. Use Lightweight Stack Tracing: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. Code Simplification: Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. Test setup: max length = 50, profiler + stack enabled Before optimization: Profiler data size: 651 MB Generate time: 3 seconds After optimization: Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: mengchengTang <745274877@qq.com>	2026-01-27 22:09:50 +08:00
CodeCat	54e8389f8e	[Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (#6006 ) ### What this PR does / why we need it? This PR builds upon PR https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to further enhance the npu_graph_ex_passes module. Based on prior work, we have added graph optimization support for the add_rms_quant fused operator in scenarios where a bias term is present—ensuring the fusion pattern is correctly registered and matched into the computation graph. This time, we performed the operator fusion of MatmulAllReduceAddRMSNorm and added corresponding ST test cases for regression monitoring. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-27 16:41:48 +08:00
pu-zhe	57fd6e4bd9	[Refact.]: refactoring 310p-kv cache allocator, align with main branch (#6270 ) ### What this PR does / why we need it? refactoring 310p-kv cache allocator, align with main branch vLLM version: v0.14.0 vLLM main: https://github.com/vllm-project/vllm-ascend/pull/6270 Qwen2.5-7B E2E Test --------- Signed-off-by: pu-zhe <puzhe1@h-partners.com> Signed-off-by: pu-zhe <zpuaa@outlook.com> Co-authored-by: pu-zhe <puzhe1@h-partners.com>	2026-01-27 16:26:48 +08:00
Angazenn	5e34c70ffc	[Misc] Removes unnecessary graph size re-initialization (#6280 ) ### What this PR does / why we need it? This PR removes `update_default_aclgraph_sizes`. In earlier versions, we add this function to change default `cudagraph_capture_sizes` because `_npu_paged_attention` degrades significantly on certain shapes (which is included in default `cudagraph_capture_sizes` of VLLM). Now since we use FIA as default attention op (which does not contain such performance degradation), there is no need to add this default change. Otherwise, it could cause some conflicts if we set a small `cudagraph_capture_sizes` that < 20 now. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-27 14:38:07 +08:00
meihanc	fea197ad50	[Main2Main] Upgrade vllm commit to 0123 (#6169 ) ### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27df97c3eb79f891802fc0e858f8f7ac6a0) Modify import paths due to the refactors： https://github.com/vllm-project/vllm/pull/32245 https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16da1e423ede2c2f52a9850cbfbb39cefe96) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to https://github.com/vllm-project/vllm/pull/28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117ea2e689cd43df4be6892671a17cdae5833) 1. Add `skip_compiled` param in `set_forward_context` due to https://github.com/vllm-project/vllm/pull/30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to https://github.com/vllm-project/vllm/pull/24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors：https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. https://github.com/vllm-project/vllm/pull/32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor https://github.com/vllm-project/vllm/pull/30143 3. Remove unused `maybe_setup_kv_connector` due to https://github.com/vllm-project/vllm/pull/32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to https://github.com/vllm-project/vllm/pull/32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cceb877dfd13f98c538c4c96158047d98bd) Setting temperature=0.0 due to the removal of the default temperature value in https://github.com/vllm-project/vllm/pull/32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>	2026-01-27 08:44:36 +08:00
Mercykid-bash	29fb27d3bb	BugFix: Fix moe_load accumulation error in ACL graph mode (#6182 ) This PR fixes the numerical error in moe_load accumulation under ACL graph mode on NPU: using += for NPU tensors in graph mode does not throw errors but leads to incorrect values, so we replace it with the in-place add_() method to ensure accurate calculation. Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>	2026-01-26 17:18:46 +08:00
Canlin Guo	2d3b8a51f9	[Patch] Remove the patch of ECExampleConnector (#5976 ) ### What this PR does / why we need it? Part of #5304. https://github.com/vllm-project/vllm/pull/30225 has been merged now. We don't need this patch anymore. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-26 17:10:03 +08:00
Jingchun Gao	b390e0ef78	[Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (#5416 ) - Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>	2026-01-26 16:53:07 +08:00
ChenCangtao	1645546661	[bugfix][npugraph_ex]fix static kernel uninstall issue (#6128 ) ### What this PR does / why we need it? The static kernel in torch_npu is uninstalled through Python's atexit mechanism. However, in vllm-ascend, when inference ends or the service stops, the worker process is terminated. This way, ending the process does not trigger the atexit mechanism, causing the static kernel not to be unloaded. When using the nougraph_ex backend and enabling the static kernel, we registered a signal handler to explicitly unload the static kernel. When there are many static kernels, unloading usually takes some time, whereas vllm will directly kill the process after sending a terminate event. Therefore, we choose to handle it by starting a new process. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-26 15:03:18 +08:00
Nengjun Ma	f910cebe04	[Doc] 310P Documents update (#6246 ) ### What this PR does / why we need it? 310P support guides updates, as currently has supported in main branch. --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-26 14:33:21 +08:00
yuxinshan	0bb1f91c2c	[Feature] Mooncake connector get remote ptp size (#5822 ) ### What this PR does / why we need it? To support elastic scaling when using mooncake connector, we should support to configure different tp sizes for different nodes. As a result, we transfer the prefill node information, such as tp size, through the request's kv_transfer_params. The decode nodes get the prefill tp size through the request's kv_transfer_params, instead of getting it from the configuration of the mooncake connector . - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>	2026-01-26 14:28:33 +08:00
LI SHENGYONG	611e223b7d	[EPLB][Bugfix] EPLB support fp/bf16 (#5531 ) ### What this PR does / why we need it? EPLB support dtype of fp/bf16. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? w8a8_dynamic Baseline: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| w8a8_dynamic eplb: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| The fp16 conversation is normal. The fp16 test is in progress. Baseline fp16 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| eplb fp16 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-26 14:28:16 +08:00
Li Wang	c26ad78f86	[CI][lint] Add rule `codespell` back (#6236 ) ### What this PR does / why we need it? After removing codepsell a while, we discovered that typo had a problem correctly recognizing certain misspelled words, so I suggested adding it back. - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 14:12:33 +08:00
Canlin Guo	65289676b4	[Refactor] Separate `_prepare_inputs` to `_prepare_inputs` and `_preprocess` (#6191 ) ### What this PR does / why we need it? Align with upstream vLLM. This PR will help downstream vLLM-Omni reduce the cost for maintaining the _prepare_inputs. Besides, it helps vLLM-Ascend code more readable. In the future, we can follow closer to vLLM. The `preprocess` logic is same as GPUModelRunner. We don't need to maintain it anymore. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-26 14:05:23 +08:00
Shanshan Shen	76ac688388	[MM][Perf] Parallelize Q/K/V padding in AscendMMEncoderAttention for better performance (#6204 ) ### What this PR does / why we need it? Currently, we pad the last dim of qkv to 128 before flash attention (in `AscendMMEncoderAttention`) to get better performance on Ascend NPU. However, the qkv padding is executed serially, which may lead to more overhead when launching `aclnnConstantPadNd` (launch 3 times). Since the three operations are mutually independent, we stack qkv first and then pad them in one kernel launch. With this optimization, TTFT has been reduced by 3.15%, peak throughput has been increased by 4.20%. --- ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Launch the server: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --dtype bfloat16 \ --limit-mm-per-prompt '{"image": 1}' \ --max-model-len 16384 \ --max-num-batched-tokens 16384 ``` Run benchmark: ```bash vllm bench serve \ --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --hf-split train \ --dataset-path lmarena-ai/vision-arena-bench-v0.1 \ --num-prompts 1000 \ --no-stream ``` Before this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Failed requests: 0 Benchmark duration (s): 122.33 Total input tokens: 66638 Total generated tokens: 122845 Request throughput (req/s): 8.17 Output token throughput (tok/s): 1004.18 Peak output token throughput (tok/s): 3073.00 Peak concurrent requests: 1000.00 Total token throughput (tok/s): 1548.90 ---------------Time to First Token---------------- Mean TTFT (ms): 51757.16 Median TTFT (ms): 44853.42 P99 TTFT (ms): 110700.14 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 226.06 Median TPOT (ms): 206.85 P99 TPOT (ms): 935.31 ---------------Inter-token Latency---------------- Mean ITL (ms): 208.82 Median ITL (ms): 96.37 P99 ITL (ms): 2183.13 ================================================== ``` After this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Failed requests: 0 Benchmark duration (s): 121.47 Total input tokens: 66638 Total generated tokens: 122860 Request throughput (req/s): 8.23 Output token throughput (tok/s): 1011.47 Peak output token throughput (tok/s): 3202.00 Peak concurrent requests: 1000.00 Total token throughput (tok/s): 1560.08 ---------------Time to First Token---------------- Mean TTFT (ms): 50125.08 Median TTFT (ms): 46270.85 P99 TTFT (ms): 108107.12 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 227.11 Median TPOT (ms): 205.13 P99 TPOT (ms): 816.08 ---------------Inter-token Latency---------------- Mean ITL (ms): 204.60 Median ITL (ms): 92.66 P99 ITL (ms): 2219.02 ================================================== ``` - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-26 10:20:24 +08:00
huangning1995	ce11fd49f3	[Feature] Batch invariant torch.compile (#6107 ) ### What this PR does / why we need it? Building upon https://github.com/vllm-project/vllm-ascend/pull/5517 to enable batch-invariant in vllm-ascend, we observed that the performance of BI in eager mode remains suboptimal. This PR further integrates batch-invariant with torch.compile, which improves inference performance by 350% when tested with Qwen3-0.6B. ### Does this PR introduce _any_ user-facing change? Previously, enabling both aclgraph and Batch-Invariant would cause an "ub overflow" error. This occurred because transposed input tensors could produce incorrect stride() values. To fix this, we now call .contiguous() on the input tensors before passing them to Triton kernels. This ensures a contiguous memory layout and prevents transposed tensors from causing incorrect stride calculations. ### Test Plan pytest -sv --durations=0 tests/e2e/singlecard/test_aclgraph_batch_invariant.py ### Test Result ``` ============================================================================ slowest durations ============================================================================ 87.37s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_v1_generation_is_deterministic_across_batch_sizes_with_needle 77.39s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN 74.04s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_without_batch_invariance_should_fail 73.59s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_simple_generation (8 durations < 0.005s hidden. Use -vv to show these durations.) ================================================================ 4 passed, 3 warnings in 312.45s (0:05:12) ================================================================ ``` ### Performance export VLLM_BATCH_INVARIANT=1 vllm serve /home/Qwen3-0.6B \ --served-model-name qwen \ --port 8000 \ --max-num-seqs 256 \ --tensor-parallel-size 1 \ --max-model-len 5500 \ --max-num-batched-tokens 5500 \ --reasoning-parser qwen3 \ --gpu-memory-utilization 0.9 \ --compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,32]}' \ --additional-config '{"ascend_scheduler_config":{"enabled":true},"enable_weight_nz_layout":true}' vllm bench serve --served-model-name qwen --trust-remote-code --backend vllm --model /home/Qwen3-0.6B/ --endpoint /v1/completions --dataset-name random --random-input-len 512 --random-output-len 256 --num-prompts 800 --max-concurrency 8 torch.compile batch invariant performance: ``` ============ Serving Benchmark Result ============ Successful requests: 800 Failed requests: 0 Maximum request concurrency: 8 Benchmark duration (s): 477.21 Total input tokens: 409600 Total generated tokens: 204800 Request throughput (req/s): 1.68 Output token throughput (tok/s): 429.16 Peak output token throughput (tok/s): 472.00 Peak concurrent requests: 16.00 Total token throughput (tok/s): 1287.48 ---------------Time to First Token---------------- Mean TTFT (ms): 285.53 Median TTFT (ms): 312.70 P99 TTFT (ms): 324.22 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.59 Median TPOT (ms): 17.50 P99 TPOT (ms): 18.44 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.59 Median ITL (ms): 17.45 P99 ITL (ms): 18.76 ================================================== ``` Eager ``` ============ Serving Benchmark Result ============ Successful requests: 800 Failed requests: 0 Maximum request concurrency: 8 Benchmark duration (s): 1694.70 Total input tokens: 409600 Total generated tokens: 204800 Request throughput (req/s): 0.47 Output token throughput (tok/s): 120.85 Peak output token throughput (tok/s): 136.00 Peak concurrent requests: 16.00 Total token throughput (tok/s): 362.54 ---------------Time to First Token---------------- Mean TTFT (ms): 164.29 Median TTFT (ms): 129.71 P99 TTFT (ms): 1961.66 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 65.81 Median TPOT (ms): 65.15 P99 TPOT (ms): 72.27 ---------------Inter-token Latency---------------- Mean ITL (ms): 65.81 Median ITL (ms): 64.64 P99 ITL (ms): 75.72 ================================================== ``` - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: huangning1995 <huangning12@huawei.com>	2026-01-26 09:15:06 +08:00
linfeng-yuan	96309e2b79	[ops] support advanced apply_top_k_top_p without top_k constraint (#6098 ) ### What this PR does / why we need it? Implement `apply_top_k_top_p` via ascendC to eliminate the constraint of k [1,1024]. It enables high performance TopKTopP calculation and avoid D2H synchronization introduced by k validation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E serving with `k=4096` and `p=0.95` - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>	2026-01-26 09:08:42 +08:00
wangxiyuan	4e3919e965	Reapply "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) (#6231 ) This reverts commit `95649344aa`. The CI failure doesn't related to this change. Let's reapply it. - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-26 09:04:54 +08:00
Li Wang	63adbedb7a	[Worker] Implement update max_model_len interface for NPUWorker (#6193 ) ### What this PR does / why we need it? This patch purpose to add the `update_max_model_len` interface. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:03:33 +08:00
drslark	384d84c7ef	[Bugfix] Avoided a bug of drafter when `dp` and `sp` are enabled (#6226 ) ### What this PR does / why we need it? Avoided a bug of drafter when `dp` and `sp` are enabled. Specifically, disable `sp` when drafter is dense. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? An aisbench test: ```shell python3 aisbench_test.py --input_len 3500 --output_len 1000 --data_num 100 --concurrency 320 --request_rate 8 ``` The result is okay. ```text [2026-01-24 22:38:20,256] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Calculate global interval offsets time: 0.5922 s 01/24 22:38:20 - AISBench - INFO - Process 0 using precomputed sleep offsets with 100 requests Process-0 pid:220279: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 100/100 [09:40<00:00, 5.81s/it] Pid: 220279 \| Post: 100 \| Received: 100 \| Failed: 0 \| Post Time:12.51s \| Receive Time:580.92s: Encoding output text...: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 100/100 [00:01<00:00, 93.75it/s] 01/24 22:48:02 - AISBench - INFO - Start converting origin data to detailed data ... 01/24 22:48:02 - AISBench - INFO - Finish converting origin data to detailed data█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 100/100 [00:01<00:00, 95.08it/s] 01/24 22:48:02 - AISBench - INFO - Added 'Actual RPS: After Excluding Anomalies' to group 'Time - RPS: ' in legend explanation table 01/24 22:48:02 - AISBench - INFO - Successfully merged chart into position (1, 1) 01/24 22:48:02 - AISBench - INFO - RPS distribution charts saved to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_rps_distribution_plot_with_actual_rps.html 01/24 22:48:02 - AISBench - INFO - Updated chart with actual RPS saved to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_rps_distribution_plot_with_actual_rps.html [2026-01-24 22:48:02,557] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Start extracting pref datas ... [2026-01-24 22:48:02,558] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Finish extracting pref datas! [2026-01-24 22:48:02,558] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Dumping detail perf data ... Dumping data to h5: 100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:00<00:00, 75.31it/s] [2026-01-24 22:48:02,588] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Dump detail perf data cost: 0.02995561994612217(s) [2026-01-24 22:48:02,588] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Performance task finished, results saved in outputs/default/20260124_223809/performances/vllm-api-stream-chat 01/24 22:48:02 - AISBench - INFO - time elapsed: 586.32s Running tasks: 100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [09:55<00:00, 595.91s/it] 01/24 22:48:05 - AISBench - INFO - Performance evaluation tasks completed. 01/24 22:48:05 - AISBench - INFO - Loading detail perf data of model='vllm-api-stream-chat' dataset='gsm8kdataset' ... 01/24 22:48:05 - AISBench - INFO - Starting request timeline processing... 01/24 22:48:05 - AISBench - INFO - Data preprocessing completed in 0.0004s 01/24 22:48:05 - AISBench - INFO - Generating timeline traces for 100 requests... 01/24 22:48:05 - AISBench - INFO - Generated timeline trace chunks in 0.0441s 01/24 22:48:05 - AISBench - INFO - Generating concurrency traces... 01/24 22:48:05 - AISBench - INFO - Generated concurrency trace chunks in 0.0011s 01/24 22:48:05 - AISBench - INFO - Creating figure layout... 01/24 22:48:05 - AISBench - INFO - Figure layout created in 0.0504s 01/24 22:48:05 - AISBench - INFO - Writing to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_plot.html... 01/24 22:48:05 - AISBench - INFO - HTML written in 0.0181s 01/24 22:48:05 - AISBench - INFO - Completed! Total execution time: 0.1148s 01/24 22:48:05 - AISBench - INFO - The gsm8kdataset_plot has been saved in outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_plot.html 01/24 22:48:05 - AISBench - INFO - Converting perf results of stage ... 01/24 22:48:05 - AISBench - INFO - Finish Converting! 01/24 22:48:05 - AISBench - INFO - Start calculating metrics ... 01/24 22:48:05 - AISBench - INFO - Start calculating common metrics ... 01/24 22:48:05 - AISBench - INFO - Start calculating add units ... 01/24 22:48:05 - AISBench - INFO - Finish calculating perf data! 01/24 22:48:05 - AISBench - INFO - Summarizing performance results... 01/24 22:48:05 - AISBench - INFO - Performance Results of task: vllm-api-stream-chat/gsm8kdataset: ╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡ │ E2EL │ total │ 300806.1781 ms │ 189326.0489 ms │ 568345.5121 ms │ 380629.6785 ms │ 384208.3527 ms │ 385363.7709 ms │ 566871.7684 ms │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TTFT │ total │ 107441.2231 ms │ 343.8054 ms │ 378132.3979 ms │ 188817.4877 ms │ 190985.8451 ms │ 192547.6847 ms │ 378008.356 ms │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ TPOT │ total │ 193.5585 ms │ 185.1008 ms │ 197.262 ms │ 193.8146 ms │ 195.0803 ms │ 196.0323 ms │ 196.9688 ms │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ ITL │ total │ 194.2067 ms │ 0.0108 ms │ 2782.7124 ms │ 184.9998 ms │ 194.2631 ms │ 221.2895 ms │ 304.363 ms │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ InputTokens │ total │ 3506.86 │ 3431.0 │ 3508.0 │ 3508.0 │ 3508.0 │ 3508.0 │ 3508.0 │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokens │ total │ 1000.0 │ 1000.0 │ 1000.0 │ 1000.0 │ 1000.0 │ 1000.0 │ 1000.0 │ 100 │ ├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤ │ OutputTokenThroughput │ total │ 3.7745 token/s │ 1.7595 token/s │ 5.2819 token/s │ 2.6272 token/s │ 5.1028 token/s │ 5.1502 token/s │ 5.2754 token/s │ 100 │ ╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛ ╒══════════════════════════╤═════════╤══════════════════╕ │ Common Metric │ Stage │ Value │ ╞══════════════════════════╪═════════╪══════════════════╡ │ Benchmark Duration │ total │ 580456.2704 ms │ ├──────────────────────────┼─────────┼──────────────────┤ │ Total Requests │ total │ 100 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Failed Requests │ total │ 0 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Success Requests │ total │ 100 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Concurrency │ total │ 51.8224 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Max Concurrency │ total │ 320 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Request Throughput │ total │ 0.1723 req/s │ ├──────────────────────────┼─────────┼──────────────────┤ │ Total Input Tokens │ total │ 350686 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Prefill Token Throughput │ total │ 32.6398 token/s │ ├──────────────────────────┼─────────┼──────────────────┤ │ Total generated tokens │ total │ 100000 │ ├──────────────────────────┼─────────┼──────────────────┤ │ Input Token Throughput │ total │ 604.1558 token/s │ ├──────────────────────────┼─────────┼──────────────────┤ │ Output Token Throughput │ total │ 172.2783 token/s │ ├──────────────────────────┼─────────┼──────────────────┤ │ Total Token Throughput │ total │ 776.434 token/s │ ╘══════════════════════════╧═════════╧══════════════════╛ 01/24 22:48:05 - AISBench - INFO - Performance Result files locate in outputs/default/20260124_223809/performances/vllm-api-stream-chat. ``` - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-25 17:45:29 +08:00
Canlin Guo	b45bd92c2b	[Bugfix] Add defensive check for multimodal_config (#6230 ) ### What this PR does / why we need it? In vLLM-Omni, there exists the empty `ModelConfig`. We need to add a check before accessing the sub-field of model_config. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Will checked by CI. - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-25 17:39:19 +08:00
wangxiyuan	95649344aa	Revert "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) This reverts commit `8966a99710`. It breaks the test `tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-25 15:25:38 +08:00

... 6 7 8 9 10 ...

1706 Commits