Commit Graph

1706 Commits

Author SHA1 Message Date
lilinsiman
7932255c06 [Refactor][EAGLE] 6/N route mtp to eagle except pcp/dcp+mtp (#6349)
### What this PR does / why we need it?

Overview: This pull request refactors speculative decoding for Eagle and
MTP proposers on Ascend hardware. It fixes a bug related to
draft_attn_metadatas being lost, migrates the lmhead feature, and adds
routing logic in MtpProposer.

Details:
1. Migrated the lmhead feature from mtp to eagle and normalized it in
eagle_proposer.
2. Fixed the bug where draft_attn_metadatas was lost after enabling
eagle mode in the merge graph.
3. Added the routing for pcp and disable padded drafter batch; in mtp
mode, if pcp and disable padded drafter batch are not enabled, the
normalized file eagle_proposer will be used.

RFC: https://github.com/vllm-project/vllm-ascend/issues/5467

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
ut and test

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-02-02 19:15:31 +08:00
LHXuuu
45a573cff1 [Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889)
### What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W4A8 dynamic weight.

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
2026-02-02 16:39:32 +08:00
lty
082aa2e5b7 [Bugfix]The service fails to be started when the memcache pool is enabled (#6229)
### What this PR does / why we need it?
The service fails to be started when the memcache pool is enabled
without configuring the mooncake path.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
```
#memcache
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/memfabric_hybrid/set_env.sh
source /usr/local/memcache_hybrid/set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-local.conf

vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \
  --host $local_ip \
  --port 8002 \
  --served-model-name model \
  --data-parallel-size 2 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --max-num-seqs 4 \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --enforce-eager \
  --quantization ascend \
  --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \
  --kv-transfer-config \
    '{
            "kv_connector": "AscendStoreConnector",
            "kv_role": "kv_both",
            "kv_connector_extra_config": {
	            "backend": "memcache",
                "lookup_rpc_port":"0"
            }
    }'
```

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: lty <linhebiwen@gmail.com>
2026-02-02 16:26:18 +08:00
Shaoxu Cheng
460ea88276 [Refact.]: Refactor some leftover implementations of 300I DUO in the main branch. (#6425)
### What this PR does / why we need it?
- Replace the RoPE operator implementation.
- Refactor some leftover implementations of 300I DUO in the main branch.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-02-02 16:12:04 +08:00
wangxiyuan
eeedf7c503 [Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470)
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
https://github.com/vllm-project/vllm/pull/31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
https://github.com/vllm-project/vllm/pull/32082 by
https://github.com/vllm-project/vllm-ascend/pull/6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
https://github.com/vllm-project/vllm/pull/25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8


co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-02 15:57:55 +08:00
SILONG ZENG
347eb36a59 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #9) (#6135)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
|`vllm_ascend/worker/model_runner_v1.py`|
|`vllm_ascend/worker/pcp_utils.py`|

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-02-01 23:20:20 +08:00
wangxiyuan
b4aafd4293 [Core][Misc] Clean up ProfileExecuteDuration (#6461)
### What this PR does / why we need it?
This PR removes the custom `ProfileExecuteDuration` utility and its
usages across the codebase. This utility was used for profiling
execution duration of different stages in the inference process. It is
replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`,
which integrates with PyTorch's profiler.

This change simplifies the code by removing a custom implementation in
favor of an upstream utility, improving maintainability. Associated
documentation and tests for `ProfileExecuteDuration` are also removed.

### Does this PR introduce _any_ user-facing change?
`VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now.

### How was this patch tested?
CI passed. The changes are a cleanup and replacement with a standard
utility. Existing tests cover the functionality. The removed feature had
its own tests which are also removed.

Related RFC: #5304

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-01 20:06:01 +08:00
fems14
775fbc4cd2 【main】【bugfix】fix: restrict default MLAPO activation to Decode nodes only (#6451)
### What this PR does / why we need it?
There is an issue with the current default logic for MLAPO (MLA Policy
Optimization). By design, MLAPO should only be enabled by default on
Decode (D) nodes. However, in hybrid (collocated prefill and decode)
scenarios, the strategy is erroneously activated during the Prefill
stage.
This PR corrects the default behavior to ensure that MLAPO is
exclusively enabled for the Decoding phase. This prevents unexpected
policy interference during Prefill and ensures optimal performance in
hybrid deployment environments.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: fems14 <1804143737@qq.com>
2026-01-31 22:44:56 +08:00
Li Wang
5b0a6bcfe9 [ModelRunner] Revert "[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6459)
This reverts commit 56f5d3bd49.

### What this PR does / why we need it?
The patch https://github.com/vllm-project/vllm-ascend/pull/6357 which
break the functionality availability in the spec_decode scenario, let's
revert and make CI happy first
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-31 16:33:34 +08:00
Qiu
638cae824d [bugfix](CP) Fix and unify the PD request discrimination logic. (#5939)
### What this PR does / why we need it?
Since the PR (https://github.com/vllm-project/vllm/pull/32118) has
modified the criteria for judging Prefill and Decode requests in vLLM,
PCPManager needs to synchronize with this standard. As PCPManager
involves multiple calculations of PD request counts, this PR attempts to
consolidate the related logic and update the PD request count once per
batch.

### How was this patch tested?
```bash
pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py
```

- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-31 10:26:02 +08:00
wubin58
4230bc8646 [Bugfix]Modify NPU rotary encoding parameter fields,fix RopeOperation setup failed in condition of self.rotary_dim < self.head_size (#6310)
### What this PR does / why we need it?
change self.head_size to self.rotary_dim.
only the rotary part is processed here, the dimension should be rotary_dim.

Fix bug #6060

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only a small section of code was modified to adjust the parameters, and
all standard tests were passed.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: fengshi666 <fengshi666@adsl-99-12-210-25.dsl.hstntx.sbcglobal.net>
Co-authored-by: fengshi666 <fengshi666@adsl-99-12-210-25.dsl.hstntx.sbcglobal.net>
2026-01-30 21:25:04 +08:00
Yizhou
56f5d3bd49 [Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6357)
### What this PR does / why we need it?
This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, and asserts the layout
constraint before building the attention metadata. Together, these
changes prevent kernel mismatches or failures and ensure correct shapes
for FIA/TND execution in full graph modes.

We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2026-01-30 16:41:44 +08:00
ChenCangtao
f2990f7741 [e2e Test][npugraph_ex]add static kernel e2e test case (#6320)
### What this PR does / why we need it?
Added an E2E test case for the scenario of enabling a static kernel for
npugraph_ex, monitoring its compilation and unloading process.
Also fixed the previously existing spelling errors

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
2026-01-30 16:24:48 +08:00
liziyu
d252e4f5ec [P/D] Using the cache load operator to replace the index select operator. (#6295)
### What this PR does / why we need it?
Using the cache load operator to replace the index select operator.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-30 14:27:53 +08:00
Wang Kunpeng
70cc5f7969 [bugfix]fix rope_forward_triton error (#6404)
### What this PR does / why we need it?
The rope_forward_triton method reports an error.
For example:
```
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822]     q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=True)
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822]   File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/triton/rope.py", line 155, in rope_forward_triton
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822]     cos = cos.view(num_tokens, -1)
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822]           ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] RuntimeError: shape '[14, -1]' is invalid for input of size 768
```
This is because an incorrect num_tokens_padded was passed in.

Related-RFC: https://github.com/vllm-project/vllm-ascend/issues/5449

Co-authored-by: @zhenwenqi2024
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-01-30 14:09:00 +08:00
zxr2333
14bd55f30c [P/D][BugFix] Fix layerwise P/D request_id error (#6360)
### What this PR does / why we need it?
Fix layerwise Connector P/D request_id error, due to vllm pr:
https://github.com/vllm-project/vllm/pull/27987, which will add a random
suffix to request_id in EngineCore.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-01-29 20:19:05 +08:00
Qiu
feab047084 [bugfix](pcp,gqa) set kv_inverse_idx_for_chunk and cp_kv_recover_idx_for_chunk to None when dcp only (#6317)
### What this PR does / why we need it?
We only do restore and recover for pcp, so we should set
`kv_inverse_idx_for_chunk` and `cp_kv_recover_idx_for_chunk` to `None`
when only using dcp.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-29 19:35:52 +08:00
Qiu
50e0e87646 [bugfix](CP,MLA) fix wrong slot_mapping of decode for mixed p/d batch (#6344)
### What this PR does / why we need it?
PR #5672 attempted to remove the -1 padding for duplicate tokens in the
decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler
slicing approach. However, in the single-ops logic and mixed PD batches,
the decode slot_mapping did not eliminate the -1 and also shared the
slicing method, resulting in incorrect slot_mapping. This PR resolves
this issue, and the logic will be further consolidated in subsequent
refactoring PRs.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-29 16:48:37 +08:00
Sergey-Zlobin
6a7b3bc29c Qwen3-VL-MoE EAGLE support for vLLM-Ascend (#6327)
### What this PR does / why we need it?
Qwen3-VL-MoE EAGLE support for vLLM-Ascend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
The patch tested with Qwen3-VL-30B-A3B-Instruct model

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: Sergey_Zlobin <sirg_zlobin@mail.ru>
2026-01-29 16:44:30 +08:00
JiangWeixiang
41a52beb26 [bugfix] resolve kv cache leak on P-side due to incorrect req_id (#6325)
### What this PR does / why we need it?
This PR fixes a critical bug in the PD-separated inference pipeline
where KV cache on the Prefill (P) side was not being properly released.
The issue arises when multiple clients use the same x-request-id: to
avoid request ID collisions, both Prefill and Decode nodes append a
random suffix to the incoming x-request-id. A previous PR ensured
consistency by having the P-side pass its final request_id as
remote_request_id to the D-side via kv_transfer_param. However, during
KV cache cleanup, the D-side incorrectly used the local req_id (instead
of remote_request_id) to select the target P-side rank. This mismatch
caused the P-side KV cache to remain unreleased on certain ranks,
leading to memory leaks. This PR corrects the logic to use
remote_request_id consistently when determining the P-side rank.
### Does this PR introduce _any_ user-facing change?
No. 
### How was this patch tested?
The fix was validated by running multiple concurrent benchmark instances

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: ghphotoframe <854746559@qq.com>
2026-01-29 16:05:56 +08:00
wangxiyuan
7a5b345dc4 [Misc] Drop deepseek patch (#6288)
We patched deepseek before since we notice asserterror raised by
transformers. Now due to transformers upgrade, the patch looks useless
now. Let's remove it.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-29 14:45:50 +08:00
whx
39f8af9d96 [Main2Main][BugFix] Add shared_experts check for AscendSharedFusedMoE (#6335)
### What this PR does / why we need it?
PR https://github.com/vllm-project/vllm/pull/32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-01-29 08:47:20 +08:00
hucong
df588ed488 [BugFix] Disable enable_shared_expert_dp by default if tensor_parallel_size=1 (#6361)
### What this PR does / why we need it?

Disable enable_shared_expert_dp by default if tensor_parallel_size=1


- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: underfituu <hzhucong@163.com>
2026-01-28 22:01:01 +08:00
linfeng-yuan
245c1ca241 [0.14.1][bugfix][sched] fix incompatibility of RecomputeScheduler with vllm v0.14.1 (#6286)
### What this PR does / why we need it?
This PR rebases RecomputeScheduler codebase to vllm tags/v0.14.1 in
order to fix the incompatibility with vllm's original Scheduler and
AsyncScheduler. Main changes focus on multimodal model and speculative
decoding parts.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
We tested this PR with 2P1D E2E serving test case.

- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-01-28 20:16:58 +08:00
Shaoxu Cheng
9fadc8df4f [Fixbugs]: fix refactor cause to 310p chunkprefill error (#6340)
Adapt modelrunner refactor change to make 310p work

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-01-28 16:41:32 +08:00
Yizhou
ac963f1519 [Fix] Adds CUDA graph stats to execution state (#6331)
### What this PR does / why we need it?
Adds a CUDA graph profiling stats field to the execution state and
updates the NPU model runner to set, unpack, and forward those stats
during execution. This preserves CUDA graph metrics across state
transitions, improving observability for later use and diagnostics.

### Does this PR introduce _any_ user-facing change?
Enable this by set
```python
    llm = LLM(
        ...
        disable_log_stats=False,
        cudagraph_metrics=True,
        ...
    )
```
or `--cudagraph-metrics` and make sure do not disable log stats.

After this, you should be able to see something like this, which is
really helpful for some light debugging:
```
[loggers.py:257] Engine 000: Avg prompt throughput: 32.3 tokens/s, Avg generation throughput: 114.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
[cuda_graph.py:117] **CUDAGraph Config Settings:**
[cuda_graph.py:117] 
[cuda_graph.py:117] - Mode: FULL_DECODE_ONLY
[cuda_graph.py:117] - Capture sizes: [1, 2, 4, 8, 16, 24, 32]
[cuda_graph.py:117] 
[cuda_graph.py:117] **CUDAGraph Stats:**
[cuda_graph.py:117] 
[cuda_graph.py:117] | Unpadded Tokens | Padded Tokens | Num Paddings | Runtime Mode | Count |
[cuda_graph.py:117] |-----------------|---------------|--------------|--------------|-------|
[cuda_graph.py:117] | 4               | 4             | 0            | FULL         | 18    |
[cuda_graph.py:117] | 5               | 5             | 0            | NONE         | 1     |
[cuda_graph.py:117] | 1               | 1             | 0            | FULL         | 1     |
[cuda_graph.py:117] | 18              | 18            | 0            | NONE         | 1     |
```

### How was this patch tested?
None.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2026-01-28 16:34:20 +08:00
LICO67373
379ce599d0 [Bugfix] Add missing draft_attn_metadatas parameter to fix MTP test (#6232)
### What this PR does / why we need it?

Fix the MTP test failure caused by accessing non-existent attribute
`forward_context.draft_attn_metadatas`.

**Root cause:**
In `AscendAttentionBackendImpl.update_graph_params`, the code
incorrectly accessed `forward_context.draft_attn_metadatas`, but
`ForwardContext` class doesn't have this attribute. The original code
passed this value via function parameter.

**Fix:**
Add `draft_attn_metadatas` parameter to the entire call chain:
- `update_full_graph_params` function in `acl_graph.py`
- All `update_graph_params` methods in attention backends
- Pass the parameter correctly in `eagle_proposer.py`

Also applied Gemini's suggestion to make `vllm_config=None` in
`AscendAttentionCPImpl.update_graph_params` for API consistency.

Related to item 9 in #5463

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This fixes the CI test failure:

`test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]`

Signed-off-by: lico67373 <918688502@qq.com>
2026-01-28 14:41:18 +08:00
Wang Kunpeng
c498cea22d [refactor] refactor excute_model and _dymmy_run method (#6043)
### What this PR does / why we need it?
The structure of the `excute_model` and `_dymmy_run` methods in
NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the `_prepare_inputs` method into `_prepare_inputs`,
`_determine_batch_execution_and_padding`, `_build_attention_metadata`,
and `_preprocess`.
Modify `_generate_process_reqs_hidden_states` to `_model_forward`.
Align the implementation of the `postprocess` phase

**Related-RFC**: https://github.com/vllm-project/vllm-ascend/issues/5449

**Co-authored-by**: @zhenwenqi2024 
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2026-01-27 22:27:01 +08:00
TMC
41eb71d665 [Refactor] profiler config optimze (#6141)
### What this PR does / why we need it?
This PR optimizes the torch_npu profiler configuration to significantly
reduce overhead and trace file size. The key changes include:
**Enable Data Simplification**: Explicitly sets data_simplification=True
in _ExperimentalConfig. This filters out unnecessary intermediate data
during profiling, drastically reducing the memory footprint and I/O
overhead.
**Use Lightweight Stack Tracing**: Replaces with_stack with with_modules
when torch_profiler_with_stack is enabled. In torch_npu, with_stack
introduces heavy latency. with_modules provides equivalent semantic
information with much lower overhead.
**Code Simplification:** Removes redundant parameter configurations in
_ExperimentalConfig by utilizing default values, making the codebase
cleaner and easier to maintain.

**Test setup:**
 max length = 50, profiler + stack enabled

**Before optimization:**
Profiler data size: 651 MB
Generate time: 3 seconds

**After optimization:**
Profiler data size: 156 MB (≈76% reduction)
Generate time: <1 second

### Does this PR introduce _any_ user-facing change?
No API changes. Users profiling on Ascend will experience faster
profiling execution and smaller trace files when stack tracing is
enabled.
### How was this patch tested?
Manually verified on Ascend NPU by running vLLM with the profiler
enabled. Confirmed that trace files are generated correctly containing
necessary stack/module info, while showing the reported reduction in
size and time.

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: mengchengTang <745274877@qq.com>
2026-01-27 22:09:50 +08:00
CodeCat
54e8389f8e [Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (#6006)
### What this PR does / why we need it?
This PR builds upon PR
https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to
further enhance the npu_graph_ex_passes module. Based on prior work, we
have added graph optimization support for the add_rms_quant fused
operator in scenarios where a bias term is present—ensuring the fusion
pattern is correctly registered and matched into the computation graph.

This time, we performed the operator fusion of MatmulAllReduceAddRMSNorm
and added corresponding ST test cases for regression monitoring.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: cjian <2318164299@qq.com>
2026-01-27 16:41:48 +08:00
pu-zhe
57fd6e4bd9 [Refact.]: refactoring 310p-kv cache allocator, align with main branch (#6270)
### What this PR does / why we need it?
refactoring 310p-kv cache allocator, align with main branch

vLLM version: v0.14.0
vLLM main: https://github.com/vllm-project/vllm-ascend/pull/6270
Qwen2.5-7B E2E Test

---------

Signed-off-by: pu-zhe <puzhe1@h-partners.com>
Signed-off-by: pu-zhe <zpuaa@outlook.com>
Co-authored-by: pu-zhe <puzhe1@h-partners.com>
2026-01-27 16:26:48 +08:00
Angazenn
5e34c70ffc [Misc] Removes unnecessary graph size re-initialization (#6280)
### What this PR does / why we need it?

This PR removes `update_default_aclgraph_sizes`. In earlier versions, we
add this function to change default `cudagraph_capture_sizes` because
`_npu_paged_attention` degrades significantly on certain shapes (which
is included in default `cudagraph_capture_sizes` of VLLM). Now since we
use FIA as default attention op (which does not contain such performance
degradation), there is no need to add this default change. Otherwise, it
could cause some conflicts if we set a small `cudagraph_capture_sizes`
that < 20 now.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-01-27 14:38:07 +08:00
meihanc
fea197ad50 [Main2Main] Upgrade vllm commit to 0123 (#6169)
### What this PR does / why we need it?
1.  Upgrade vllm commit to: 0115
(8471b27df97c3eb79f891802fc0e858f8f7ac6a0)
Modify import paths due to the refactors:
https://github.com/vllm-project/vllm/pull/32245
https://github.com/vllm-project/vllm/pull/32060
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913
2. Upgrade vllm commit to: 0119
(9a1f16da1e423ede2c2f52a9850cbfbb39cefe96)
Fix `WorkerProc.__init__() missing 1 required positional argument:
'is_driver_worker'` due to
https://github.com/vllm-project/vllm/pull/28506
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569
3. Upgrade vllm commit to:
0120(148117ea2e689cd43df4be6892671a17cdae5833)
1. Add `skip_compiled` param in `set_forward_context` due to
https://github.com/vllm-project/vllm/pull/30385
2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to
https://github.com/vllm-project/vllm/pull/24322
change `self.max_num_tokens =
vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size`
3. Modify UT import paths due to the
refactors:https://github.com/vllm-project/vllm/pull/32060
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946
4. Upgrade vllm commit to:
0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9)
1. vLLM switched `uses_mrope` from target to draft model config, making
`positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's
direct self.positions access and tests missing
`draft_model_config.uses_mrope`.
https://github.com/vllm-project/vllm/pull/32048
2. Moved bs_to_padded_graph_size from CompilationConfig to
CudagraphDispatcher due to the refactor
https://github.com/vllm-project/vllm/pull/30143
3. Remove unused `maybe_setup_kv_connector` due to
https://github.com/vllm-project/vllm/pull/32077
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834
6. Upgrade vllm commit to:
0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5)
Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig
due to https://github.com/vllm-project/vllm/pull/32414
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054
8. Upgrade vllm commit to:
0123(dc917cceb877dfd13f98c538c4c96158047d98bd)
Setting temperature=0.0 due to the removal of the default temperature
value in https://github.com/vllm-project/vllm/pull/32723
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: wjunLu <wjunlu217@gmail.com>
2026-01-27 08:44:36 +08:00
Mercykid-bash
29fb27d3bb BugFix: Fix moe_load accumulation error in ACL graph mode (#6182)
This PR fixes the numerical error in moe_load accumulation under ACL
graph mode on NPU: using += for NPU tensors in graph mode does not throw
errors but leads to incorrect values, so we replace it with the in-place
add_() method to ensure accurate calculation.

Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
2026-01-26 17:18:46 +08:00
Canlin Guo
2d3b8a51f9 [Patch] Remove the patch of ECExampleConnector (#5976)
### What this PR does / why we need it?
Part of #5304.

https://github.com/vllm-project/vllm/pull/30225 has been merged now. We
don't need this patch anymore.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-01-26 17:10:03 +08:00
Jingchun Gao
b390e0ef78 [Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (#5416)
- Fixed the computing of final hidden_states when enabling pipeline
parallel and prefill context parallel at the same time. Only in the last
PP rank, hidden_states are required and have right tensor type.
- Fixed the shape of intermediate_tensors in the dummy_run when enabling
pipeline parallel and flashcomm1. The intermediate_tensors should be
divided by tp_size. Otherwise, the moe will raise issues.
- Fixed the shape of self.intermediate_tensors for sufficient slice
space

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
2026-01-26 16:53:07 +08:00
ChenCangtao
1645546661 [bugfix][npugraph_ex]fix static kernel uninstall issue (#6128)
### What this PR does / why we need it?

The static kernel in torch_npu is uninstalled through Python's atexit
mechanism.
However, in vllm-ascend, when inference ends or the service stops, the
worker process is terminated. This way, ending the process does not
trigger the atexit mechanism, causing the static kernel not to be
unloaded.
When using the nougraph_ex backend and enabling the static kernel, we
registered a signal handler to explicitly unload the static kernel.
When there are many static kernels, unloading usually takes some time,
whereas vllm will directly kill the process after sending a terminate
event. Therefore, we choose to handle it by starting a new process.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
2026-01-26 15:03:18 +08:00
Nengjun Ma
f910cebe04 [Doc] 310P Documents update (#6246)
### What this PR does / why we need it?
310P support guides updates, as currently has supported in main branch.

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-01-26 14:33:21 +08:00
yuxinshan
0bb1f91c2c [Feature] Mooncake connector get remote ptp size (#5822)
### What this PR does / why we need it?
To support elastic scaling when using mooncake connector, we should
support to **configure different tp sizes for different nodes**.
As a result, we transfer the prefill node information, such as tp size,
through **the request's kv_transfer_params**.
The decode nodes **get the prefill tp size** through the request's
kv_transfer_params, instead of getting it from the configuration of the
mooncake connector .

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: yuxinshan <syx_ctyg@126.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
2026-01-26 14:28:33 +08:00
LI SHENGYONG
611e223b7d [EPLB][Bugfix] EPLB support fp/bf16 (#5531)
### What this PR does / why we need it?
EPLB support dtype of fp/bf16.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
w8a8_dynamic Baseline:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

w8a8_dynamic eplb:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

The fp16 conversation is normal.
The fp16 test is in progress.

Baseline fp16
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

eplb fp16
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-26 14:28:16 +08:00
Li Wang
c26ad78f86 [CI][lint] Add rule codespell back (#6236)
### What this PR does / why we need it?
After removing codepsell a while, we discovered that typo had a problem
correctly recognizing certain misspelled words, so I suggested adding it
back.

- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 14:12:33 +08:00
Canlin Guo
65289676b4 [Refactor] Separate _prepare_inputs to _prepare_inputs and _preprocess (#6191)
### What this PR does / why we need it?

Align with upstream vLLM. This PR will help downstream vLLM-Omni reduce
the cost for maintaining the _prepare_inputs. Besides, it helps
vLLM-Ascend code more readable. In the future, we can follow closer to
vLLM. The `preprocess` logic is same as GPUModelRunner. We don't need to
maintain it anymore.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CI.

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-01-26 14:05:23 +08:00
Shanshan Shen
76ac688388 [MM][Perf] Parallelize Q/K/V padding in AscendMMEncoderAttention for better performance (#6204)
### What this PR does / why we need it?

Currently, we pad the last dim of qkv to 128 before flash attention (in
`AscendMMEncoderAttention`) to get better performance on Ascend NPU.
However, the qkv padding is executed serially, which may lead to more
overhead when launching `aclnnConstantPadNd` (launch 3 times).

Since the three operations are mutually independent, we stack qkv first
and then pad them in one kernel launch. With this optimization, **TTFT**
has been reduced by **3.15%**, **peak throughput** has been increased by
**4.20%**.

---

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Launch the server:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 16384 \
--max-num-batched-tokens 16384
```

Run benchmark:

```bash
vllm bench serve \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--hf-split train \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--num-prompts 1000 \
--no-stream
```

Before this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  122.33    
Total input tokens:                      66638     
Total generated tokens:                  122845    
Request throughput (req/s):              8.17      
Output token throughput (tok/s):         1004.18   
Peak output token throughput (tok/s):    3073.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          1548.90   
---------------Time to First Token----------------
Mean TTFT (ms):                          51757.16  
Median TTFT (ms):                        44853.42  
P99 TTFT (ms):                           110700.14 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          226.06    
Median TPOT (ms):                        206.85    
P99 TPOT (ms):                           935.31    
---------------Inter-token Latency----------------
Mean ITL (ms):                           208.82    
Median ITL (ms):                         96.37     
P99 ITL (ms):                            2183.13   
==================================================
```

After this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  121.47    
Total input tokens:                      66638     
Total generated tokens:                  122860    
Request throughput (req/s):              8.23      
Output token throughput (tok/s):         1011.47   
Peak output token throughput (tok/s):    3202.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          1560.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          50125.08  
Median TTFT (ms):                        46270.85  
P99 TTFT (ms):                           108107.12 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          227.11    
Median TPOT (ms):                        205.13    
P99 TPOT (ms):                           816.08    
---------------Inter-token Latency----------------
Mean ITL (ms):                           204.60    
Median ITL (ms):                         92.66     
P99 ITL (ms):                            2219.02   
==================================================
```
- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-26 10:20:24 +08:00
huangning1995
ce11fd49f3 [Feature] Batch invariant torch.compile (#6107)
### What this PR does / why we need it?
Building upon https://github.com/vllm-project/vllm-ascend/pull/5517 to
enable batch-invariant in vllm-ascend, we observed that the performance
of BI in eager mode remains suboptimal.

This PR further integrates batch-invariant with torch.compile, which
improves inference performance by 350% when tested with Qwen3-0.6B.

### Does this PR introduce _any_ user-facing change?
Previously, enabling both aclgraph and Batch-Invariant would cause an
"ub overflow" error. This occurred because transposed input tensors
could produce incorrect stride() values.

To fix this, we now call .contiguous() on the input tensors before
passing them to Triton kernels. This ensures a contiguous memory layout
and prevents transposed tensors from causing incorrect stride
calculations.

### Test Plan
pytest -sv --durations=0
tests/e2e/singlecard/test_aclgraph_batch_invariant.py

### Test Result
```
============================================================================ slowest durations ============================================================================
87.37s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_v1_generation_is_deterministic_across_batch_sizes_with_needle
77.39s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN
74.04s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_without_batch_invariance_should_fail
73.59s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_simple_generation

(8 durations < 0.005s hidden.  Use -vv to show these durations.)
================================================================ 4 passed, 3 warnings in 312.45s (0:05:12) ================================================================
```
### Performance
export VLLM_BATCH_INVARIANT=1
vllm serve /home/Qwen3-0.6B \
--served-model-name qwen \
--port 8000 \
--max-num-seqs 256 \
--tensor-parallel-size 1 \
--max-model-len 5500 \
--max-num-batched-tokens 5500 \
--reasoning-parser qwen3 \
--gpu-memory-utilization 0.9 \
--compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY",
"cudagraph_capture_sizes":[1,2,4,8,16,32]}' \
--additional-config
'{"ascend_scheduler_config":{"enabled":true},"enable_weight_nz_layout":true}'

vllm bench serve --served-model-name qwen --trust-remote-code --backend
vllm --model /home/Qwen3-0.6B/ --endpoint /v1/completions --dataset-name
random --random-input-len 512 --random-output-len 256 --num-prompts 800
--max-concurrency 8

torch.compile batch invariant performance:
```
============ Serving Benchmark Result ============
Successful requests:                     800       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  477.21    
Total input tokens:                      409600    
Total generated tokens:                  204800    
Request throughput (req/s):              1.68      
Output token throughput (tok/s):         429.16    
Peak output token throughput (tok/s):    472.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1287.48   
---------------Time to First Token----------------
Mean TTFT (ms):                          285.53    
Median TTFT (ms):                        312.70    
P99 TTFT (ms):                           324.22    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.59     
Median TPOT (ms):                        17.50     
P99 TPOT (ms):                           18.44     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.59     
Median ITL (ms):                         17.45     
P99 ITL (ms):                            18.76     
==================================================
```
Eager
```
============ Serving Benchmark Result ============
Successful requests:                     800       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  1694.70   
Total input tokens:                      409600    
Total generated tokens:                  204800    
Request throughput (req/s):              0.47      
Output token throughput (tok/s):         120.85    
Peak output token throughput (tok/s):    136.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          362.54    
---------------Time to First Token----------------
Mean TTFT (ms):                          164.29    
Median TTFT (ms):                        129.71    
P99 TTFT (ms):                           1961.66   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.81     
Median TPOT (ms):                        65.15     
P99 TPOT (ms):                           72.27     
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.81     
Median ITL (ms):                         64.64     
P99 ITL (ms):                            75.72     
==================================================
```

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: huangning1995 <huangning12@huawei.com>
2026-01-26 09:15:06 +08:00
linfeng-yuan
96309e2b79 [ops] support advanced apply_top_k_top_p without top_k constraint (#6098)
### What this PR does / why we need it?
Implement `apply_top_k_top_p` via ascendC to eliminate the constraint of
k [1,1024]. It enables high performance TopKTopP calculation and avoid
D2H synchronization introduced by k validation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
E2E serving with `k=4096` and  `p=0.95`
- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
2026-01-26 09:08:42 +08:00
wangxiyuan
4e3919e965 Reapply "[Refactor] Unify full-graph parameter update logic (#6041)" (#6227) (#6231)
This reverts commit 95649344aa.

The CI failure doesn't related to this change. Let's reapply it.

- vLLM version: v0.14.0
- vLLM main:
d68209402d
2026-01-26 09:04:54 +08:00
Li Wang
63adbedb7a [Worker] Implement update max_model_len interface for NPUWorker (#6193)
### What this PR does / why we need it?
This patch purpose to add the `update_max_model_len` interface.

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 09:03:33 +08:00
drslark
384d84c7ef [Bugfix] Avoided a bug of drafter when dp and sp are enabled (#6226)
### What this PR does / why we need it?

Avoided a bug of drafter when `dp` and `sp` are enabled.

Specifically, disable `sp` when drafter is dense.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

An aisbench test:

```shell
python3 aisbench_test.py --input_len 3500 --output_len 1000 --data_num 100 --concurrency 320 --request_rate 8
```

The result is okay.

```text
[2026-01-24 22:38:20,256] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Calculate global interval offsets time: 0.5922 s
01/24 22:38:20 - AISBench - INFO - Process 0 using precomputed sleep offsets with 100 requests
Process-0 pid:220279: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [09:40<00:00,  5.81s/it]
Pid:      220279 | Post:        100 | Received:    100 | Failed:        0 | Post Time:12.51s | Receive Time:580.92s: 
Encoding output text...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 93.75it/s]
01/24 22:48:02 - AISBench - INFO - Start converting origin data to detailed data ...
01/24 22:48:02 - AISBench - INFO - Finish converting origin data to detailed data█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 95.08it/s]
01/24 22:48:02 - AISBench - INFO - Added 'Actual RPS: After Excluding Anomalies' to group 'Time - RPS: ' in legend explanation table
01/24 22:48:02 - AISBench - INFO - Successfully merged chart into position (1, 1)
01/24 22:48:02 - AISBench - INFO - RPS distribution charts saved to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_rps_distribution_plot_with_actual_rps.html
01/24 22:48:02 - AISBench - INFO - Updated chart with actual RPS saved to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_rps_distribution_plot_with_actual_rps.html
[2026-01-24 22:48:02,557] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Start extracting pref datas ...
[2026-01-24 22:48:02,558] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Finish extracting pref datas!
[2026-01-24 22:48:02,558] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Dumping detail perf data ...
Dumping data to h5: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 75.31it/s]
[2026-01-24 22:48:02,588] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Dump detail perf data cost: 0.02995561994612217(s)
[2026-01-24 22:48:02,588] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Performance task finished, results saved in outputs/default/20260124_223809/performances/vllm-api-stream-chat
01/24 22:48:02 - AISBench - INFO - time elapsed: 586.32s
Running tasks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [09:55<00:00, 595.91s/it]
01/24 22:48:05 - AISBench - INFO - Performance evaluation tasks completed.
01/24 22:48:05 - AISBench - INFO - Loading detail perf data of model='vllm-api-stream-chat' dataset='gsm8kdataset' ...
01/24 22:48:05 - AISBench - INFO - Starting request timeline processing...
01/24 22:48:05 - AISBench - INFO - Data preprocessing completed in 0.0004s
01/24 22:48:05 - AISBench - INFO - Generating timeline traces for 100 requests...
01/24 22:48:05 - AISBench - INFO - Generated timeline trace chunks in 0.0441s
01/24 22:48:05 - AISBench - INFO - Generating concurrency traces...
01/24 22:48:05 - AISBench - INFO - Generated concurrency trace chunks in 0.0011s
01/24 22:48:05 - AISBench - INFO - Creating figure layout...
01/24 22:48:05 - AISBench - INFO - Figure layout created in 0.0504s
01/24 22:48:05 - AISBench - INFO - Writing to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_plot.html...
01/24 22:48:05 - AISBench - INFO - HTML written in 0.0181s
01/24 22:48:05 - AISBench - INFO - Completed! Total execution time: 0.1148s
01/24 22:48:05 - AISBench - INFO - The gsm8kdataset_plot has been saved in outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_plot.html
01/24 22:48:05 - AISBench - INFO - Converting perf results of stage ...
01/24 22:48:05 - AISBench - INFO - Finish Converting!
01/24 22:48:05 - AISBench - INFO - Start calculating metrics ...
01/24 22:48:05 - AISBench - INFO - Start calculating common metrics ...
01/24 22:48:05 - AISBench - INFO - Start calculating add units ...
01/24 22:48:05 - AISBench - INFO - Finish calculating perf data!
01/24 22:48:05 - AISBench - INFO - Summarizing performance results...
01/24 22:48:05 - AISBench - INFO - Performance Results of task: vllm-api-stream-chat/gsm8kdataset: 
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average        │ Min            │ Max            │ Median         │ P75            │ P90            │ P99            │  N  │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL                     │ total   │ 300806.1781 ms │ 189326.0489 ms │ 568345.5121 ms │ 380629.6785 ms │ 384208.3527 ms │ 385363.7709 ms │ 566871.7684 ms │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT                     │ total   │ 107441.2231 ms │ 343.8054 ms    │ 378132.3979 ms │ 188817.4877 ms │ 190985.8451 ms │ 192547.6847 ms │ 378008.356 ms  │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT                     │ total   │ 193.5585 ms    │ 185.1008 ms    │ 197.262 ms     │ 193.8146 ms    │ 195.0803 ms    │ 196.0323 ms    │ 196.9688 ms    │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL                      │ total   │ 194.2067 ms    │ 0.0108 ms      │ 2782.7124 ms   │ 184.9998 ms    │ 194.2631 ms    │ 221.2895 ms    │ 304.363 ms     │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens              │ total   │ 3506.86        │ 3431.0         │ 3508.0         │ 3508.0         │ 3508.0         │ 3508.0         │ 3508.0         │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens             │ total   │ 1000.0         │ 1000.0         │ 1000.0         │ 1000.0         │ 1000.0         │ 1000.0         │ 1000.0         │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput    │ total   │ 3.7745 token/s │ 1.7595 token/s │ 5.2819 token/s │ 2.6272 token/s │ 5.1028 token/s │ 5.1502 token/s │ 5.2754 token/s │ 100 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤══════════════════╕
│ Common Metric            │ Stage   │ Value            │
╞══════════════════════════╪═════════╪══════════════════╡
│ Benchmark Duration       │ total   │ 580456.2704 ms   │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Requests           │ total   │ 100              │
├──────────────────────────┼─────────┼──────────────────┤
│ Failed Requests          │ total   │ 0                │
├──────────────────────────┼─────────┼──────────────────┤
│ Success Requests         │ total   │ 100              │
├──────────────────────────┼─────────┼──────────────────┤
│ Concurrency              │ total   │ 51.8224          │
├──────────────────────────┼─────────┼──────────────────┤
│ Max Concurrency          │ total   │ 320              │
├──────────────────────────┼─────────┼──────────────────┤
│ Request Throughput       │ total   │ 0.1723 req/s     │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Input Tokens       │ total   │ 350686           │
├──────────────────────────┼─────────┼──────────────────┤
│ Prefill Token Throughput │ total   │ 32.6398 token/s  │
├──────────────────────────┼─────────┼──────────────────┤
│ Total generated tokens   │ total   │ 100000           │
├──────────────────────────┼─────────┼──────────────────┤
│ Input Token Throughput   │ total   │ 604.1558 token/s │
├──────────────────────────┼─────────┼──────────────────┤
│ Output Token Throughput  │ total   │ 172.2783 token/s │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Token Throughput   │ total   │ 776.434 token/s  │
╘══════════════════════════╧═════════╧══════════════════╛
01/24 22:48:05 - AISBench - INFO - Performance Result files locate in outputs/default/20260124_223809/performances/vllm-api-stream-chat.
```
- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: drslark <slarksblood@qq.com>
2026-01-25 17:45:29 +08:00
Canlin Guo
b45bd92c2b [Bugfix] Add defensive check for multimodal_config (#6230)
### What this PR does / why we need it?

In vLLM-Omni, there exists the empty `ModelConfig`. We need to add a
check before accessing the sub-field of model_config.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Will checked by CI.

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-01-25 17:39:19 +08:00
wangxiyuan
95649344aa Revert "[Refactor] Unify full-graph parameter update logic (#6041)" (#6227)
This reverts commit 8966a99710.

It breaks the test
`tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]`

- vLLM version: v0.14.0
- vLLM main:
d68209402d
2026-01-25 15:25:38 +08:00