Commit Graph

2339 Commits

Author SHA1 Message Date
Mercykid-bash
29fb27d3bb BugFix: Fix moe_load accumulation error in ACL graph mode (#6182)
This PR fixes the numerical error in moe_load accumulation under ACL
graph mode on NPU: using += for NPU tensors in graph mode does not throw
errors but leads to incorrect values, so we replace it with the in-place
add_() method to ensure accurate calculation.

Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
2026-01-26 17:18:46 +08:00
Canlin Guo
2d3b8a51f9 [Patch] Remove the patch of ECExampleConnector (#5976)
### What this PR does / why we need it?
Part of #5304.

https://github.com/vllm-project/vllm/pull/30225 has been merged now. We
don't need this patch anymore.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-01-26 17:10:03 +08:00
Jingchun Gao
b390e0ef78 [Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (#5416)
- Fixed the computing of final hidden_states when enabling pipeline
parallel and prefill context parallel at the same time. Only in the last
PP rank, hidden_states are required and have right tensor type.
- Fixed the shape of intermediate_tensors in the dummy_run when enabling
pipeline parallel and flashcomm1. The intermediate_tensors should be
divided by tp_size. Otherwise, the moe will raise issues.
- Fixed the shape of self.intermediate_tensors for sufficient slice
space

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
2026-01-26 16:53:07 +08:00
yuxinshan
7d119df2a9 [Feat] proxy delay to remove instances (#5934)
### What this PR does / why we need it?
For the proxy, we should remove instances when the proxy are not
processing requests.
But sometimes, We need to **isolate** some faulty nodes when a large
number of **requests** are coming in.
So we support to **isolate** faulty nodes by **lowering their priority**
and **deleted** them when the proxy does not process requests.

### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`,
when using `/instances/remove` API to delete the node from the proxy
server:
```txt
curl -X POST http://localhost:9000/instances/remove \
  -H "Content-Type: application/json" \
  -d '{
        "type": "decode",
        "instances": "127.0.0.1:8201"
      }'
```
There are 2 situations:
* 【New】When the proxy is processing requests, isolate the nodes and
remove them when the proxy is free.
```txt
{"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* When the proxy is free, remove the nodes directly.
```txt
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: yuxinshan <syx_ctyg@126.com>
2026-01-26 16:29:45 +08:00
Li Wang
de095c5fed [CI] Add workfolw_dispatch for nightly image build (#6269)
### What this PR does / why we need it?
Currently, the nightly image is built at 20 PM and 23 PM UTC+8. Due to
some timeliness requirements, we need to add a new trigger method for
nightly image builds.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 15:56:38 +08:00
ChenCangtao
1645546661 [bugfix][npugraph_ex]fix static kernel uninstall issue (#6128)
### What this PR does / why we need it?

The static kernel in torch_npu is uninstalled through Python's atexit
mechanism.
However, in vllm-ascend, when inference ends or the service stops, the
worker process is terminated. This way, ending the process does not
trigger the atexit mechanism, causing the static kernel not to be
unloaded.
When using the nougraph_ex backend and enabling the static kernel, we
registered a signal handler to explicitly unload the static kernel.
When there are many static kernels, unloading usually takes some time,
whereas vllm will directly kill the process after sending a terminate
event. Therefore, we choose to handle it by starting a new process.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
2026-01-26 15:03:18 +08:00
Nengjun Ma
f910cebe04 [Doc] 310P Documents update (#6246)
### What this PR does / why we need it?
310P support guides updates, as currently has supported in main branch.

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-01-26 14:33:21 +08:00
yuxinshan
0bb1f91c2c [Feature] Mooncake connector get remote ptp size (#5822)
### What this PR does / why we need it?
To support elastic scaling when using mooncake connector, we should
support to **configure different tp sizes for different nodes**.
As a result, we transfer the prefill node information, such as tp size,
through **the request's kv_transfer_params**.
The decode nodes **get the prefill tp size** through the request's
kv_transfer_params, instead of getting it from the configuration of the
mooncake connector .

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: yuxinshan <syx_ctyg@126.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
2026-01-26 14:28:33 +08:00
LI SHENGYONG
611e223b7d [EPLB][Bugfix] EPLB support fp/bf16 (#5531)
### What this PR does / why we need it?
EPLB support dtype of fp/bf16.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
w8a8_dynamic Baseline:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

w8a8_dynamic eplb:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

The fp16 conversation is normal.
The fp16 test is in progress.

Baseline fp16
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

eplb fp16
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-26 14:28:16 +08:00
wangxiyuan
52d4acfa51 [Doc] add release note for v0.14.0rc1 (#6225)
Add release note for v0.14.0rc1

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-26 14:22:40 +08:00
dependabot[bot]
1f26f83e34 [CI] Bump actions/checkout from 4 to 6 (#6255)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-26 14:21:00 +08:00
dependabot[bot]
ae71c4237e [CI] Bump actions/setup-python from 6.1.0 to 6.2.0 (#6256)
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 6.1.0 to 6.2.0.

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-26 14:20:14 +08:00
Li Wang
c26ad78f86 [CI][lint] Add rule codespell back (#6236)
### What this PR does / why we need it?
After removing codepsell a while, we discovered that typo had a problem
correctly recognizing certain misspelled words, so I suggested adding it
back.

- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 14:12:33 +08:00
wangxiyuan
f4abd9b7b5 [CI] Fix 310p image build (#6259)
Fix 310p docker image build error

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-26 14:11:56 +08:00
Canlin Guo
65289676b4 [Refactor] Separate _prepare_inputs to _prepare_inputs and _preprocess (#6191)
### What this PR does / why we need it?

Align with upstream vLLM. This PR will help downstream vLLM-Omni reduce
the cost for maintaining the _prepare_inputs. Besides, it helps
vLLM-Ascend code more readable. In the future, we can follow closer to
vLLM. The `preprocess` logic is same as GPUModelRunner. We don't need to
maintain it anymore.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CI.

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-01-26 14:05:23 +08:00
Shanshan Shen
e3eefdecbd [Doc] Update max_tokens to max_completion_tokens in all docs (#6248)
### What this PR does / why we need it?

Fix:

```
DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field.
```

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-26 11:57:40 +08:00
Shaoxu Cheng
418fccf0bc [310P]: fix 310p image cannot build (#6238)
Fix 310p image build error

- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-01-26 11:37:19 +08:00
Shanshan Shen
76ac688388 [MM][Perf] Parallelize Q/K/V padding in AscendMMEncoderAttention for better performance (#6204)
### What this PR does / why we need it?

Currently, we pad the last dim of qkv to 128 before flash attention (in
`AscendMMEncoderAttention`) to get better performance on Ascend NPU.
However, the qkv padding is executed serially, which may lead to more
overhead when launching `aclnnConstantPadNd` (launch 3 times).

Since the three operations are mutually independent, we stack qkv first
and then pad them in one kernel launch. With this optimization, **TTFT**
has been reduced by **3.15%**, **peak throughput** has been increased by
**4.20%**.

---

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Launch the server:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 16384 \
--max-num-batched-tokens 16384
```

Run benchmark:

```bash
vllm bench serve \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--hf-split train \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--num-prompts 1000 \
--no-stream
```

Before this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  122.33    
Total input tokens:                      66638     
Total generated tokens:                  122845    
Request throughput (req/s):              8.17      
Output token throughput (tok/s):         1004.18   
Peak output token throughput (tok/s):    3073.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          1548.90   
---------------Time to First Token----------------
Mean TTFT (ms):                          51757.16  
Median TTFT (ms):                        44853.42  
P99 TTFT (ms):                           110700.14 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          226.06    
Median TPOT (ms):                        206.85    
P99 TPOT (ms):                           935.31    
---------------Inter-token Latency----------------
Mean ITL (ms):                           208.82    
Median ITL (ms):                         96.37     
P99 ITL (ms):                            2183.13   
==================================================
```

After this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  121.47    
Total input tokens:                      66638     
Total generated tokens:                  122860    
Request throughput (req/s):              8.23      
Output token throughput (tok/s):         1011.47   
Peak output token throughput (tok/s):    3202.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          1560.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          50125.08  
Median TTFT (ms):                        46270.85  
P99 TTFT (ms):                           108107.12 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          227.11    
Median TPOT (ms):                        205.13    
P99 TPOT (ms):                           816.08    
---------------Inter-token Latency----------------
Mean ITL (ms):                           204.60    
Median ITL (ms):                         92.66     
P99 ITL (ms):                            2219.02   
==================================================
```
- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-26 10:20:24 +08:00
huangning1995
ce11fd49f3 [Feature] Batch invariant torch.compile (#6107)
### What this PR does / why we need it?
Building upon https://github.com/vllm-project/vllm-ascend/pull/5517 to
enable batch-invariant in vllm-ascend, we observed that the performance
of BI in eager mode remains suboptimal.

This PR further integrates batch-invariant with torch.compile, which
improves inference performance by 350% when tested with Qwen3-0.6B.

### Does this PR introduce _any_ user-facing change?
Previously, enabling both aclgraph and Batch-Invariant would cause an
"ub overflow" error. This occurred because transposed input tensors
could produce incorrect stride() values.

To fix this, we now call .contiguous() on the input tensors before
passing them to Triton kernels. This ensures a contiguous memory layout
and prevents transposed tensors from causing incorrect stride
calculations.

### Test Plan
pytest -sv --durations=0
tests/e2e/singlecard/test_aclgraph_batch_invariant.py

### Test Result
```
============================================================================ slowest durations ============================================================================
87.37s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_v1_generation_is_deterministic_across_batch_sizes_with_needle
77.39s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN
74.04s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_without_batch_invariance_should_fail
73.59s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_simple_generation

(8 durations < 0.005s hidden.  Use -vv to show these durations.)
================================================================ 4 passed, 3 warnings in 312.45s (0:05:12) ================================================================
```
### Performance
export VLLM_BATCH_INVARIANT=1
vllm serve /home/Qwen3-0.6B \
--served-model-name qwen \
--port 8000 \
--max-num-seqs 256 \
--tensor-parallel-size 1 \
--max-model-len 5500 \
--max-num-batched-tokens 5500 \
--reasoning-parser qwen3 \
--gpu-memory-utilization 0.9 \
--compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY",
"cudagraph_capture_sizes":[1,2,4,8,16,32]}' \
--additional-config
'{"ascend_scheduler_config":{"enabled":true},"enable_weight_nz_layout":true}'

vllm bench serve --served-model-name qwen --trust-remote-code --backend
vllm --model /home/Qwen3-0.6B/ --endpoint /v1/completions --dataset-name
random --random-input-len 512 --random-output-len 256 --num-prompts 800
--max-concurrency 8

torch.compile batch invariant performance:
```
============ Serving Benchmark Result ============
Successful requests:                     800       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  477.21    
Total input tokens:                      409600    
Total generated tokens:                  204800    
Request throughput (req/s):              1.68      
Output token throughput (tok/s):         429.16    
Peak output token throughput (tok/s):    472.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1287.48   
---------------Time to First Token----------------
Mean TTFT (ms):                          285.53    
Median TTFT (ms):                        312.70    
P99 TTFT (ms):                           324.22    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.59     
Median TPOT (ms):                        17.50     
P99 TPOT (ms):                           18.44     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.59     
Median ITL (ms):                         17.45     
P99 ITL (ms):                            18.76     
==================================================
```
Eager
```
============ Serving Benchmark Result ============
Successful requests:                     800       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  1694.70   
Total input tokens:                      409600    
Total generated tokens:                  204800    
Request throughput (req/s):              0.47      
Output token throughput (tok/s):         120.85    
Peak output token throughput (tok/s):    136.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          362.54    
---------------Time to First Token----------------
Mean TTFT (ms):                          164.29    
Median TTFT (ms):                        129.71    
P99 TTFT (ms):                           1961.66   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.81     
Median TPOT (ms):                        65.15     
P99 TPOT (ms):                           72.27     
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.81     
Median ITL (ms):                         64.64     
P99 ITL (ms):                            75.72     
==================================================
```

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: huangning1995 <huangning12@huawei.com>
2026-01-26 09:15:06 +08:00
linfeng-yuan
96309e2b79 [ops] support advanced apply_top_k_top_p without top_k constraint (#6098)
### What this PR does / why we need it?
Implement `apply_top_k_top_p` via ascendC to eliminate the constraint of
k [1,1024]. It enables high performance TopKTopP calculation and avoid
D2H synchronization introduced by k validation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
E2E serving with `k=4096` and  `p=0.95`
- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
2026-01-26 09:08:42 +08:00
wangxiyuan
4e3919e965 Reapply "[Refactor] Unify full-graph parameter update logic (#6041)" (#6227) (#6231)
This reverts commit 95649344aa.

The CI failure doesn't related to this change. Let's reapply it.

- vLLM version: v0.14.0
- vLLM main:
d68209402d
2026-01-26 09:04:54 +08:00
Li Wang
c38c838d03 [CI] Decrease Qwen3 dense model output throughput baseline to make ci happy (#6233)
### What this PR does / why we need it?
As
https://github.com/vllm-project/vllm-ascend/actions/runs/21327913593/job/61388195448
shows, I encountered two CI failures., The results consistently pointed
to the reduced outcome 1600 -> 1514

- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 09:04:13 +08:00
Li Wang
63adbedb7a [Worker] Implement update max_model_len interface for NPUWorker (#6193)
### What this PR does / why we need it?
This patch purpose to add the `update_max_model_len` interface.

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 09:03:33 +08:00
Li Wang
ca297eb57f [CI] Migrate e2e test runner to hk (#5344)
### What this PR does / why we need it?
This patch add new runner labels for the HK region, and e2e single-card
testing has been migrated to this runner.

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 09:00:51 +08:00
wangxiyuan
99bdd7363c [CI] update vLLM to 0.14.1 (#6222)
Upgrade vLLM to 0.14.1
- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-25 17:52:16 +08:00
drslark
384d84c7ef [Bugfix] Avoided a bug of drafter when dp and sp are enabled (#6226)
### What this PR does / why we need it?

Avoided a bug of drafter when `dp` and `sp` are enabled.

Specifically, disable `sp` when drafter is dense.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

An aisbench test:

```shell
python3 aisbench_test.py --input_len 3500 --output_len 1000 --data_num 100 --concurrency 320 --request_rate 8
```

The result is okay.

```text
[2026-01-24 22:38:20,256] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Calculate global interval offsets time: 0.5922 s
01/24 22:38:20 - AISBench - INFO - Process 0 using precomputed sleep offsets with 100 requests
Process-0 pid:220279: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [09:40<00:00,  5.81s/it]
Pid:      220279 | Post:        100 | Received:    100 | Failed:        0 | Post Time:12.51s | Receive Time:580.92s: 
Encoding output text...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 93.75it/s]
01/24 22:48:02 - AISBench - INFO - Start converting origin data to detailed data ...
01/24 22:48:02 - AISBench - INFO - Finish converting origin data to detailed data█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 95.08it/s]
01/24 22:48:02 - AISBench - INFO - Added 'Actual RPS: After Excluding Anomalies' to group 'Time - RPS: ' in legend explanation table
01/24 22:48:02 - AISBench - INFO - Successfully merged chart into position (1, 1)
01/24 22:48:02 - AISBench - INFO - RPS distribution charts saved to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_rps_distribution_plot_with_actual_rps.html
01/24 22:48:02 - AISBench - INFO - Updated chart with actual RPS saved to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_rps_distribution_plot_with_actual_rps.html
[2026-01-24 22:48:02,557] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Start extracting pref datas ...
[2026-01-24 22:48:02,558] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Finish extracting pref datas!
[2026-01-24 22:48:02,558] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Dumping detail perf data ...
Dumping data to h5: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 75.31it/s]
[2026-01-24 22:48:02,588] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Dump detail perf data cost: 0.02995561994612217(s)
[2026-01-24 22:48:02,588] [ais_bench.benchmark.openicl.icl_inferencer.icl_gen_perf_inferencer] [INFO] Performance task finished, results saved in outputs/default/20260124_223809/performances/vllm-api-stream-chat
01/24 22:48:02 - AISBench - INFO - time elapsed: 586.32s
Running tasks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [09:55<00:00, 595.91s/it]
01/24 22:48:05 - AISBench - INFO - Performance evaluation tasks completed.
01/24 22:48:05 - AISBench - INFO - Loading detail perf data of model='vllm-api-stream-chat' dataset='gsm8kdataset' ...
01/24 22:48:05 - AISBench - INFO - Starting request timeline processing...
01/24 22:48:05 - AISBench - INFO - Data preprocessing completed in 0.0004s
01/24 22:48:05 - AISBench - INFO - Generating timeline traces for 100 requests...
01/24 22:48:05 - AISBench - INFO - Generated timeline trace chunks in 0.0441s
01/24 22:48:05 - AISBench - INFO - Generating concurrency traces...
01/24 22:48:05 - AISBench - INFO - Generated concurrency trace chunks in 0.0011s
01/24 22:48:05 - AISBench - INFO - Creating figure layout...
01/24 22:48:05 - AISBench - INFO - Figure layout created in 0.0504s
01/24 22:48:05 - AISBench - INFO - Writing to outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_plot.html...
01/24 22:48:05 - AISBench - INFO - HTML written in 0.0181s
01/24 22:48:05 - AISBench - INFO - Completed! Total execution time: 0.1148s
01/24 22:48:05 - AISBench - INFO - The gsm8kdataset_plot has been saved in outputs/default/20260124_223809/performances/vllm-api-stream-chat/gsm8kdataset_plot.html
01/24 22:48:05 - AISBench - INFO - Converting perf results of stage ...
01/24 22:48:05 - AISBench - INFO - Finish Converting!
01/24 22:48:05 - AISBench - INFO - Start calculating metrics ...
01/24 22:48:05 - AISBench - INFO - Start calculating common metrics ...
01/24 22:48:05 - AISBench - INFO - Start calculating add units ...
01/24 22:48:05 - AISBench - INFO - Finish calculating perf data!
01/24 22:48:05 - AISBench - INFO - Summarizing performance results...
01/24 22:48:05 - AISBench - INFO - Performance Results of task: vllm-api-stream-chat/gsm8kdataset: 
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average        │ Min            │ Max            │ Median         │ P75            │ P90            │ P99            │  N  │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL                     │ total   │ 300806.1781 ms │ 189326.0489 ms │ 568345.5121 ms │ 380629.6785 ms │ 384208.3527 ms │ 385363.7709 ms │ 566871.7684 ms │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT                     │ total   │ 107441.2231 ms │ 343.8054 ms    │ 378132.3979 ms │ 188817.4877 ms │ 190985.8451 ms │ 192547.6847 ms │ 378008.356 ms  │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT                     │ total   │ 193.5585 ms    │ 185.1008 ms    │ 197.262 ms     │ 193.8146 ms    │ 195.0803 ms    │ 196.0323 ms    │ 196.9688 ms    │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL                      │ total   │ 194.2067 ms    │ 0.0108 ms      │ 2782.7124 ms   │ 184.9998 ms    │ 194.2631 ms    │ 221.2895 ms    │ 304.363 ms     │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens              │ total   │ 3506.86        │ 3431.0         │ 3508.0         │ 3508.0         │ 3508.0         │ 3508.0         │ 3508.0         │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens             │ total   │ 1000.0         │ 1000.0         │ 1000.0         │ 1000.0         │ 1000.0         │ 1000.0         │ 1000.0         │ 100 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput    │ total   │ 3.7745 token/s │ 1.7595 token/s │ 5.2819 token/s │ 2.6272 token/s │ 5.1028 token/s │ 5.1502 token/s │ 5.2754 token/s │ 100 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤══════════════════╕
│ Common Metric            │ Stage   │ Value            │
╞══════════════════════════╪═════════╪══════════════════╡
│ Benchmark Duration       │ total   │ 580456.2704 ms   │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Requests           │ total   │ 100              │
├──────────────────────────┼─────────┼──────────────────┤
│ Failed Requests          │ total   │ 0                │
├──────────────────────────┼─────────┼──────────────────┤
│ Success Requests         │ total   │ 100              │
├──────────────────────────┼─────────┼──────────────────┤
│ Concurrency              │ total   │ 51.8224          │
├──────────────────────────┼─────────┼──────────────────┤
│ Max Concurrency          │ total   │ 320              │
├──────────────────────────┼─────────┼──────────────────┤
│ Request Throughput       │ total   │ 0.1723 req/s     │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Input Tokens       │ total   │ 350686           │
├──────────────────────────┼─────────┼──────────────────┤
│ Prefill Token Throughput │ total   │ 32.6398 token/s  │
├──────────────────────────┼─────────┼──────────────────┤
│ Total generated tokens   │ total   │ 100000           │
├──────────────────────────┼─────────┼──────────────────┤
│ Input Token Throughput   │ total   │ 604.1558 token/s │
├──────────────────────────┼─────────┼──────────────────┤
│ Output Token Throughput  │ total   │ 172.2783 token/s │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Token Throughput   │ total   │ 776.434 token/s  │
╘══════════════════════════╧═════════╧══════════════════╛
01/24 22:48:05 - AISBench - INFO - Performance Result files locate in outputs/default/20260124_223809/performances/vllm-api-stream-chat.
```
- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: drslark <slarksblood@qq.com>
2026-01-25 17:45:29 +08:00
Canlin Guo
b45bd92c2b [Bugfix] Add defensive check for multimodal_config (#6230)
### What this PR does / why we need it?

In vLLM-Omni, there exists the empty `ModelConfig`. We need to add a
check before accessing the sub-field of model_config.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Will checked by CI.

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-01-25 17:39:19 +08:00
wangxiyuan
2928ae2af5 [Image] fix 310p image build (#6228)
Fix 310p image build error

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-25 16:07:13 +08:00
wangxiyuan
95649344aa Revert "[Refactor] Unify full-graph parameter update logic (#6041)" (#6227)
This reverts commit 8966a99710.

It breaks the test
`tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]`

- vLLM version: v0.14.0
- vLLM main:
d68209402d
2026-01-25 15:25:38 +08:00
Icey
7799c4ca3b [Fusion] change fusion env variable (#6201)
### What this PR does / why we need it?
Since CI has integrated Triton, `fuse_qknorm_rope` is enabled by
default.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-24 22:49:33 +08:00
SILONG ZENG
6ccccad102 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #5) (#5996)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
|
`.../distributed/kv_transfer/kv_pool/ascend_store/ascend_store_connector.py`
|
|
`vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/backend/backend.py`
|
| `
.../distributed/kv_transfer/kv_pool/ascend_store/backend/memcache_backend.py`
|
| `
.../distributed/kv_transfer/kv_pool/ascend_store/backend/mooncake_backend.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/config_data.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/kv_transfer.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_scheduler.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_worker.py`
|
| `
.../distributed/kv_transfer/kv_pool/cpu_offload/cpu_kv_cache_manager.py`
|
| `
.../distributed/kv_transfer/kv_pool/cpu_offload/cpu_offload_connector.py`
|
| ` vllm_ascend/distributed/kv_transfer/kv_pool/cpu_offload/metadata.py`
|
| ` vllm_ascend/distributed/kv_transfer/kv_pool/ucm_connector.py` |
| `
vllm_ascend/distributed/kv_transfer/utils/mooncake_transfer_engine.py` |
| ` vllm_ascend/distributed/kv_transfer/utils/utils.py` |
| ` vllm_ascend/kv_offload/cpu_npu.py` |
| ` vllm_ascend/kv_offload/npu.py` |
| ` vllm_ascend/lora/lora_ops.py` |
| ` vllm_ascend/lora/punica_npu.py` |
| ` vllm_ascend/lora/utils.py` |

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: SILONG ZENG <2609716663@qq.com>
2026-01-24 22:45:38 +08:00
SILONG ZENG
7faa6878a6 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #3) (#5978)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| `vllm_ascend/attention/mla_v1.py` |
| `vllm_ascend/attention/sfa_v1.py` |
| `vllm_ascend/core/recompute_scheduler.py` |
| `vllm_ascend/core/scheduler_dynamic_batch.py` |
| `vllm_ascend/distributed/device_communicators/npu_communicator.py` |
| `vllm_ascend/distributed/device_communicators/pyhccl.py` |
| `vllm_ascend/distributed/device_communicators/pyhccl_wrapper.py` |

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: Soren <user@SorendeMac-mini.local>
2026-01-24 22:10:18 +08:00
SILONG ZENG
4e53c1d900 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #6) (#6001)
### What this PR does / why we need it?
| File Path |
| :--- |
| ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` |
| ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` |
| ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` |
| ` vllm_ascend/eplb/core/eplb_utils.py` |
| ` vllm_ascend/eplb/core/eplb_worker.py` |
| ` vllm_ascend/eplb/core/policy/policy_abstract.py` |
| ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` |
| ` vllm_ascend/eplb/core/policy/policy_factory.py` |
| ` vllm_ascend/eplb/core/policy/policy_flashlb.py` |
| ` vllm_ascend/eplb/core/policy/policy_random.py` |
| ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` |
| ` vllm_ascend/eplb/eplb_updator.py` |
| ` vllm_ascend/eplb/utils.py` |
| ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` |
| ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` |
| ` vllm_ascend/model_loader/netloader/interaction/elastic.py` |
| ` vllm_ascend/model_loader/netloader/load.py` |
| ` vllm_ascend/model_loader/netloader/netloader.py` |
| ` vllm_ascend/model_loader/netloader/utils.py` |
| ` vllm_ascend/patch/platform/__init__.py` |
| ` vllm_ascend/patch/platform/patch_balance_schedule.py` |
| ` vllm_ascend/patch/platform/patch_ec_connector.py` |
| ` vllm_ascend/patch/platform/patch_mamba_config.py` |
| ` vllm_ascend/patch/platform/patch_multiproc_executor.py` |
| ` vllm_ascend/patch/platform/patch_sched_yield.py` |


- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 22:08:33 +08:00
SILONG ZENG
153da1a669 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #4) (#6200)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| `vllm_ascend/distributed/kv_transfer/__init__.py` |
| `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_connector.py` |
|
`vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py`
|

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 20:40:48 +08:00
Shaoxu Cheng
fbae41697e [310P]: refactoring for 310p kvcache and some ops class (#6117)
### What this PR does / why we need it?
* Refactor the LayerNorm and activation operator classes to decouple the
310P device implementation from the main branch.
* Refactor `mm_encoder_attention` on 310P to use the
`torch_npu._npu_flash_attention_unpad` operator.
* Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P
so they are no longer padded to 16× alignment.
* Refactor `model_runner` on 310P to align the KV-cache initialization
logic with the mainline implementation.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
use the e2e tests.

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-01-24 20:34:29 +08:00
Angazenn
5b746f3e83 [Inductor]change pass to adapt to new addrmsnormBias operator (#6094)
### What this PR does / why we need it?
#5790 changes default addrmsnormBias operator if custom ops is enabled.
This PR modifies AddRmsNormQuant pass to align with addrmsnormBias.

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-01-24 20:16:44 +08:00
LICO67373
8966a99710 [Refactor] Unify full-graph parameter update logic (#6041)
### What this PR does / why we need it?

**Refactor: Unify full-graph parameter update logic**

This PR consolidates the scattered full-graph parameter update logic
into a unified approach, improving code architecture and eliminating
duplication.

**Key improvements:**

1. **Unified interface**
- Create `update_full_graph_params` as the single entry point for all
full-graph updates
   - Replace multiple scattered update calls with one unified function
- Remove ~50 lines of duplicated if-else logic across
`model_runner_v1.py` and `eagle_proposer.py`

2. **Better architecture**
- Move update logic to respective Backend classes
(`AscendAttentionBackend`, `AscendMLABackend`)
   - Each Backend manages its own parameter update logic internally
   - Simplify caller code to just dispatch to the appropriate Backend

3. **Cleaner parameter handling**
   - Remove unnecessary `pcp_size` and `dcp_size` parameter passing
   - Get parallel configuration directly from distributed groups
   - Consistent with how other parts of the codebase obtain these values

**Why we need it:**
- **Maintainability**: Future changes only need to be made in one place
per Backend
- **Code quality**: Follows DRY principle and Single Responsibility
Principle
- **Readability**: Cleaner, more intuitive code structure

### Does this PR introduce _any_ user-facing change?

**No.** This is a pure refactoring with no functional changes - same
behavior, cleaner code.

### How was this patch tested?

- All existing unit tests pass with updated mocks
- No new tests needed (pure refactoring, no behavior changes)
- CI validates correctness

---

- vLLM version: v0.13.0

Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: drslark <slarksblood@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2026-01-24 20:12:57 +08:00
Zeng haolong
8129c429ef [Doc] Improved English grammar and integrated the DeepWiki badge for Ask AI (#6216)
### What this PR does / why we need it?

README.md: Improved English grammar and integrated the DeepWiki badge
for Ask AI

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

None

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: fyfugoyfa <zenghaolong@huawei.com>
Signed-off-by: Mitchell-xiyunfeng <3617237115@qq.com>
Co-authored-by: fyfugoyfa <zenghaolong@huawei.com>
2026-01-24 20:11:18 +08:00
Icey
4fcacca8a6 [BugFix] Fix build wheel (#6218)
### What this PR does / why we need it?
- Fixes
https://github.com/vllm-project/vllm-ascend/actions/runs/21312847954/job/61351587180

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-24 20:08:20 +08:00
Icey
fc26260d84 [BugFix] buildwheel dependency install (#6212)
### What this PR does / why we need it?
buildwheel dependency install, fixes
https://github.com/vllm-project/vllm-ascend/actions/runs/21309549095

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-24 17:11:55 +08:00
wangxiyuan
21833a4321 [Doc] Add release note for 0.13.0rc2 (#6207)
Add release note for 0.13.0rc2

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-24 12:51:47 +08:00
liziyu
f66bcdfb29 [P/D] Mooncake connector add zmq socket fail log (#6155)
Mooncake connector add zmq socket fail log

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-24 12:06:42 +08:00
liziyu
14bef9af6f [P/D] Remove restrictions on mooncake for IPv6 (#5946)
### What this PR does / why we need it?
Remove restrictions on mooncake for IPv6
Dependencies: cann8.5、mooncake v0.3.8.post1

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-24 11:30:22 +08:00
Angazenn
019a2fe6e6 [Eagle3]enhance skipping dp allreduce and add it into eagle proposer (#6192)
### What this PR does / why we need it?
This PR:
1. Enhances the logic of `_skip_all_reduce_across_dp_group` to skip all
cpu dp allreduce for dense models. This is also for purpose 2.
2. Adds `_skip_all_reduce_across_dp_group` into eagle_proposer. Now
models like Qwen3-235b supports eagle3 spec decode. A typical setting
for these moe models on pd disaggregation often introduce `dp_size > 1`.
This requires `set_forward_context` to call a cpu dp allreduce to
retrieve `num_tokens_across_dp` on all cases. Skipping this allreduce
greatly improves performance.

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-01-24 11:29:42 +08:00
zhangyiming
56d8f088dd [Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)
### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: menogrey <1299267905@qq.com>
2026-01-24 11:29:07 +08:00
zhaomingyu13
2dd68652bc [Doc] Add the setting description of cudagraph_capture_sizes in speculative decoding user guide (#5637)
### What this PR does / why we need it?
Add the setting description of cudagraph_capture_sizes, guide users to
avoid the common mistakes frequently made when using the EAGLE overlay
fullgraph.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need for testing
- vLLM version: v0.13.0
- vLLM main:
8be6432bda

---------

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-23 23:22:44 +08:00
UnifiedCacheManager
a2f022f9b6 [UCMConnector]Add has_connector_metadata (#6172)
### What this PR does / why we need it?
ucm_connector add has `has_connector_metadata` interface to adapt to the
latest KV connector in vLLM.

### Does this PR introduce _any_ user-facing change?
this PR doesn't introduce _any_ user-facing change.


### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
2026-01-23 21:16:48 +08:00
lhchg
717d299ae5 [BugFix]bug fix for dispatch_ffn_combine (#6156)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Some synchronization logic of the fusion operator copies EP *
expertPerRank int32 values. This part of data contains synchronization
signals and data.

The 512B DataBlock of Ascend A3 writes all data in the same block
atomically to the HBM.

For the DeepSeek model, when expertPerRank per device is 16, the 512B
alignment is met in both 16-device single-node and 32-device two-node
scenarios. Therefore, we check the first position of each 512B data. If
the value is not 0, it indicates that the current 512B data has been
sent.

However, for other cases where expertPerRank per device is not 16, EP *
expertPerRank does not meet the 512B alignment. If the above logic is
used for checking, there will be problems.

Therefore, here we will pad the EP * expertPerRank data length to the
length aligned to 512B.

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: lhchg <lhao_cheng@163.com>
Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>
2026-01-23 21:14:18 +08:00
drslark
44a4ff6960 [main][BugFix] Avoided a bug of torch_npu.npu_mm_reduce_scatter_base when sp size >= 16 (#6168)
### What this PR does / why we need it?
If `sp` is enabled and `tp_size` >= 16,
`torch_npu.npu_mm_reduce_scatter_base` will raises a exception.
After consulting with the operator developer, we learned that the
operator does not work when `tp` = 16.
So, we disable the operator when `tp` = 16.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested

We started a server with `sp` enabled and `tp` = 16.

It started successfully.

```text
(APIServer pid=1855938) INFO:     Started server process [1855938]
(APIServer pid=1855938) INFO:     Waiting for application startup.
(APIServer pid=1855938) INFO:     Application startup complete.
```

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: drslark <slarksblood@qq.com>
2026-01-23 21:12:23 +08:00
yjmyl
e90b14140b [feature] add_rms_norm support bias (#5790)
### What this PR does / why we need it?
This PR is to replace addRmsNorm and Add With addRmsNormBias. This way
can lead to a more effecient result.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Full Test Pass

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com>
Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>
2026-01-23 21:09:54 +08:00