Commit Graph

483 Commits

Author SHA1 Message Date
Li Wang
37a9cf818a [Misc] Bump mooncake version to v0.3.8.post1 (#6110)
### What this PR does / why we need it?
Since the mooncake has the newer
[release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1),
we pin the tag to latest release

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-22 11:03:16 +08:00
wangxiyuan
69740039b7 [CI] Upgrade CANN to 8.5.0 (#6070)
### What this PR does / why we need it?
1. Upgrade CANN to 8.5.0
2. move triton-ascend 3.2.0 to requirements

note: we skipped the two failed e2e test, see
https://github.com/vllm-project/vllm-ascend/issues/6076 for more detail.
We'll fix it soon.


### How was this patch tested?
Closes: https://github.com/vllm-project/vllm-ascend/issues/5494

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-22 09:29:50 +08:00
Nengjun Ma
ab676413e6 Default enable MLAPO (#5952)
### What this PR does / why we need it?
1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD
disagregation D Instance, for example: DeepSeekV3-W8A8,
DeepSeek-R1-W8A8.
2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models,
currently is DeepSeek-V3.2-W8A8.

### Does this PR introduce _any_ user-facing change?
Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO
feature for deepseek w8a8 model

The effect of enabling MLAPO SFA model deployed on a single A3 Node:
Test
with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py
dataset: gsm8k-lite,without set MTP, FULL GRAPH, has 19% promote:
未默认开启 MLAPO 时:
├─────────────────────────┤
│                TTFT                      │ 14055.8836 ms   │
├─────────────────────────┤
│                ITL                         │ 66.8171 ms.          │
├─────────────────────────┤
│ Output Token Throughput  │ 104.9105 token/s │
├─────────────────────────┤
默认开启 MLAPO 时:
├─────────────────────────┤
│                TTFT                      │ 3753.1547 ms   │
├─────────────────────────┤
│                ITL.                        │ 61.4236  ms.       │
├─────────────────────────┤
│ Output Token Throughput  │ 125.2075 token/s│
├─────────────────────────┤

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-01-22 09:26:39 +08:00
MengLong Chen
a15a5f6aa5 [Doc] Supplement PD separation parameters of DeepSeek V3.1 (#6053)
### What this PR does / why we need it?
Supplement PD separation parameters of DeepSeek V3.1
The recommended parameter configuration for DeepSeek V3.1 in the EP32
scenario after PD separation has been adjusted, and the core parameters
have been described in detail.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2026-01-22 08:53:44 +08:00
meihanc
53bfb38192 [CI]Update triton ascend version in 3.2.0 (#6067)
### What this PR does / why we need it?
update triton ascend version in 3.2.0

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-01-21 16:02:23 +08:00
Magnus
5b129cf0a1 [1/N][Feat] Xlite Qwen3 MoE Support (#5951)
### What this PR does / why we need it?
This patch adds support for the Qwen3-MoE model in Xlite. For more
details about Xlite, please refer to the following
link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.

Qwen3-MoE TODO List:
- [ ] Qwen3-235B-A22B support
- [ ] Qwen3-MoE weights NZ support
- [ ] Qwen3-MoE data parallel support

## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance
Comparison
- aclgraph: main(69b170b8b5)
- xlite-full: main + xlite-full
- xlite-decode-only: main + xlite-decode-only
- diff1: Performance comparison between xlite-full and aclgraph
- diff2: Performance comparison between xlite-decode-only and aclgraph

| maxconcurrency | item | TTFT(ms) | | TPOT(ms) | | QPS (req/s) |
OutputSpeed (token/s) |
| --- | --- | --- | --- | --- | --- | --- | --- |
|  |  | Avg | P99 | Avg | P99 |  |  |
| 1 | baseline-aclgraph | 205.07 | 287.29 | 12.34 | 12.65 | 0.14 | 78.81
|
| 1 | xlite-full | 66.40 | 113.69 | 11.71 | 12.40 | 0.15 | 84.73 |
| 1 | xlite-decode-only | 221.15 | 316.40 | 12.16 | 12.91 | 0.14 | 79.70
|
| 1 | diff1 | -67.62% | -60.43% | -5.11% | -1.98% | 7.14% | 7.51% |
| 1 | diff2 | 7.84% | 10.13% | -1.46% | 2.06% | 0.00% | 1.13% |
|  |  |  |  |  |  |  |  |
| 16 | baseline-aclgraph | 1892.16 | 13916.86 | 22.78 | 39.28 | 1.15 |
589.89 |
| 16 | xlite-full | 1355.40 | 8907.45 | 15.96 | 25.15 | 1.65 | 850.21 |
| 16 | xlite-decode-only | 1519.42 | 8711.64 | 19.23 | 29.73 | 1.38 |
711.60 |
| 16 | diff1 | -28.37% | -36.00% | -29.94% | -35.97% | 43.48% | 44.13% |
| 16 | diff2 | -19.70% | -37.40% | -15.58% | -24.31% | 20.00% | 20.63% |
|  |  |  |  |  |  |  |  |
| 32 | baseline-aclgraph | 673.80 | 3914.90 | 32.20 | 37.95 | 1.80 |
928.54 |
| 32 | xlite-full | 481.65 | 2710.50 | 19.95 | 25.35 | 2.91 | 1506.67 |
| 32 | xlite-decode-only | 372.22 | 1095.25 | 25.19 | 28.47 | 2.33 |
1202.82 |
| 32 | diff1 | -28.52% | -30.76% | -38.04% | -33.20% | 61.67% | 62.26% |
| 32 | diff2 | -44.76% | -72.02% | -21.77% | -24.98% | 29.44% | 29.54% |
|  |  |  |  |  |  |  |  |
| 48 | baseline-aclgraph | 583.18 | 3277.65 | 41.02 | 46.05 | 2.17 |
1115.08 |
| 48 | xlite-full | 973.42 | 8237.33 | 23.29 | 30.50 | 3.71 | 1908.09 |
| 48 | xlite-decode-only | 480.79 | 2026.98 | 31.48 | 35.41 | 2.83 |
1453.75 |
| 48 | diff1 | 66.92% | 151.32% | -43.22% | -33.77% | 70.97% | 71.12% |
| 48 | diff2 | -17.56% | -38.16% | -23.26% | -23.11% | 30.41% | 30.37% |
|  |  |  |  |  |  |  |  |
| 64 | baseline-aclgraph | 742.74 | 5953.39 | 47.79 | 53.15 | 2.48 |
1272.37 |
| 64 | xlite-full | 545.22 | 3941.34 | 25.09 | 30.41 | 4.64 | 2376.44 |
| 64 | xlite-decode-only | 752.40 | 4534.29 | 38.67 | 43.28 | 3.06 |
1567.94 |
| 64 | diff1 | -26.59% | -33.80% | -47.50% | -42.78% | 87.10% | 86.77% |
| 64 | diff2 | 1.30% | -23.84% | -19.08% | -18.57% | 23.39% | 23.23% |
|  |  |  |  |  |  |  |  |
| 100 | baseline-aclgraph | 565.52 | 1716.81 | 60.89 | 68.69 | 3.08 |
1580.64 |
| 100 | xlite-full | 398.14 | 2328.88 | 30.70 | 32.45 | 6.01 | 3086.42 |
| 100 | xlite-decode-only | 712.53 | 4875.94 | 52.71 | 60.78 | 3.53 |
1813.58 |
| 100 | diff1 | -29.60% | 35.65% | -49.58% | -52.76% | 95.13% | 95.26% |
| 100 | diff2 | 26.00% | 184.01% | -13.43% | -11.52% | 14.61% | 14.74% |
|  |  |  |  |  |  |  |  |
| 150 | baseline-aclgraph | 842.42 | 5175.01 | 73.60 | 88.18 | 3.80 |
1952.26 |
| 150 | xlite-full | 568.52 | 4204.33 | 37.90 | 40.01 | 7.27 | 3734.72 |
| 150 | xlite-decode-only | 654.43 | 2504.06 | 67.40 | 77.00 | 4.18 |
2145.11 |
| 150 | diff1 | -32.51% | -18.76% | -48.51% | -54.63% | 91.32% | 91.30%
|
| 150 | diff2 | -22.32% | -51.61% | -8.42% | -12.68% | 10.00% | 9.88% |
|  |  |  |  |  |  |  |  |
| 200 | baseline-aclgraph | 750.63 | 3049.91 | 88.26 | 101.95 | 4.28 |
2189.72 |
| 200 | xlite-full | 558.48 | 3791.98 | 45.54 | 49.04 | 8.17 | 4175.52 |
| 200 | xlite-decode-only | 807.09 | 4254.95 | 85.18 | 101.79 | 4.44 |
2271.52 |
| 200 | diff1 | -25.60% | 24.33% | -48.40% | -51.90% | 90.89% | 90.69% |
| 200 | diff2 | 7.52% | 39.51% | -3.49% | -0.16% | 3.74% | 3.74% |
|  |  |  |  |  |  |  |  |

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: changdawei1 <changdawei3@huawei.com>
Co-authored-by: LVYANGGUO <275926687@qq.com>
Co-authored-by: lulina <lina.lulina@huawei.com>
2026-01-21 09:26:03 +08:00
ChenCangtao
6c30f8bf87 [Feature]refactor the npugraph_ex config, support online-infer with static kernel (#5775)
### What this PR does / why we need it?
This is a part of
https://github.com/vllm-project/vllm-ascend/issues/4715#issue-3694310762
1. refactor the npugraph_ex config,modified the default configuration of
the static kernel, new default value of static kernel is false
2. support online-infer with static kernel
3. fixed the issue where manually modifying FX graphs caused an abnormal
model return type, and removed the related redundant code.

### Does this PR introduce _any_ user-facing change?
yes,the new config of npugraph_ex is as follow:
```
additional_config={
            "npugraph_ex_config": {
                "enable": True,
                "enable_static_kernel": False
            }
        }
```
### How was this patch tested?
```
vllm serve /data/DeepSeek-V3.1-Terminus-w4a8 \
    --host 0.0.0.0 \
    --port 8004 \
    --data-parallel-size 4 \
    --tensor-parallel-size 4 \
    --quantization ascend \
    --seed 1024 \
    --served-model-name deepseek_v3 \
    --enable-expert-parallel \
    --max-num-seqs 48 \
    --max-model-len 40000 \
    --async-scheduling \
    --max-num-batched-tokens 9000 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp","disable_padded_drafter_batch": false}' \
    --gpu-memory-utilization 0.9 \
    --compilation-config '{"cudagraph_capture_sizes":[4,32,64,112,160,176,192], "cudagraph_mode": "FULL_DECODE_ONLY"}' \
    --additional-config \
    '{"enable_shared_expert_dp": true,"multistream_overlap_shared_expert": true,"npugraph_ex_config":{"enable":true}}'
```

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
2026-01-20 21:31:38 +08:00
Canlin Guo
afabb49f00 [Docs][Model] Support Qwen3-VL-Embedding & Qwen3-VL-Reranker (#6034)
### What this PR does / why we need it?

Add docs for Qwen3-VL-Embedding & Qwen3-VL-Reranker.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-01-20 17:36:31 +08:00
meihanc
ea57e3e7a4 [Main2Main] Upgrade vllm commit to releases/v0.14.0 (#5988)
### What this PR does / why we need it?
Upgrade vllm commit to releases/v0.14.0

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-01-20 15:10:40 +08:00
starmountain1997
0664c6e67a [Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (#5921)
### What this PR does / why we need it?

#### Documentation Improvements

New Configuration: Added the layer_sharding parameter to the
DeepSeek-V3.2-W8A8 deployment tutorial. This guides users to include
`["q_b_proj", "o_proj"]` in their prefill node setup for better resource
utilization.

#### CI and Testing Updates

Test Config Update: Updated the multi-node E2E test configuration file:
tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml.

including disable `FLASHCOMM` and enable `FULL_DECODE_ONLY` and update
performance baseline.

### Does this PR introduce any user-facing change?

Yes. The documentation now recommends a more optimized startup command
for DeepSeek-V3.2-W8A8. Users following the updated tutorial will see
improved performance in multi-node PD disaggregation environments.

### How was this patch tested?
CI Validation: The updated E2E test configuration has been verified
through the nightly CI pipeline.

Environment: * vLLM version: v0.13.0

Base Commit:
[11b6af5](11b6af5280)

Hardware: Ascend A3/A2 multi-node cluster.

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-01-20 12:40:54 +08:00
Qiu
38cfcd572a [doc](cp) correct the prefill of GQA and adjust desc of block table. (#5697)
### What this PR does / why we need it?
correct the seq length of KV for prefill of GQA and clarify the desc of
block table distribution in developer guide.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-19 18:53:48 +08:00
LI SHENGYONG
83de5385b4 [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (#5897)
### What this PR does / why we need it?
1. Rename dynamic_ep to default_eplb.
2. Rename dynamic_ep_v2 to swift_balancer
3. Discard func compose_expert_update_info_bipartite.

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-19 05:47:40 +00:00
meihanc
9cad1a8349 [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928)
### What this PR does / why we need it?

Migrate the torch profiler configuration from deprecated environment
variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`,
`VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit
`ProfilerConfig` object, aligning with vLLM's configuration best
practices.
The profiler environment variable approach is deprecated in vLLM and
will be removed in v0.14.0 or v1.0.0.

### Does this PR introduce _any_ user-facing change?
yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR`
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-01-19 09:27:55 +08:00
herizhen
0eafed9bd6 [doc]Table split (#5929)
### What this PR does / why we need it?
Added legend descriptions, and split redundant tables into core
supported model tables and extended compatible model tables.
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut
- vLLM version: v0.13.0
- vLLM main:
11b6af5280

---------

Signed-off-by: herizhen <1270637059@qq.com>
2026-01-19 09:15:04 +08:00
Li Wang
c4fde5c064 [Doc] Upgrade outdated ut doc (#5937)
### What this PR does / why we need it?
For cpu env, we should set `SOC_VERSION` to mock different NPU chips for
different compilation paths
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
11b6af5280

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-19 09:12:46 +08:00
zzhxxx
05e69b99e5 [Doc] Remove Chinese characters from the icons in the doc. (#5959)
### What this PR does / why we need it?
Remove Chinese characters from the icons in the doc.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2026-01-18 07:22:57 +08:00
wjunLu
73a3f822c7 [Main2Main] Upgrade vllm commit to releases/v0.14.0 (#5911)
### What this PR does / why we need it?
Upgrade vllm commit to releases/v0.14.0

- Re-open cases in `tests/e2e/singlecard/pooling/test_scoring.py`, since
the errors before have been fixed by
https://github.com/vllm-project/vllm/pull/32243
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-15 23:22:43 +08:00
wangxiyuan
44d3b4d61a [Misc] Cleanup useless file and code (#5877)
Remove useless file and code

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-15 20:32:47 +08:00
Shanshan Shen
efa0f64f22 [Doc] Add tutorials for Qwen3-VL-30B-A3B-Instruct (#5331)
### What this PR does / why we need it?

Add tutorials for `Qwen3-VL-30B-A3B-Instruct`.

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-15 10:56:19 +08:00
LI SHENGYONG
da958ee386 [EPLB]Eplb Config Renaming (#5533)
### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817

### Does this PR introduce _any_ user-facing change?

before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`

after this pr: 
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`

### How was this patch tested?

#### test qwen3-235b eplb num_redundant_experts=16

without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-15 10:26:44 +08:00
wjunLu
c11a05c4e1 [Main2Main] Upgrade vllm commit to 0113 (#5839)
### What this PR does / why we need it?
Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9)

- Modify import paths due to the refactors
https://github.com/vllm-project/vllm/pull/31916
https://github.com/vllm-project/vllm/pull/32054

- Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional
arguments but 3 were given` due to
https://github.com/vllm-project/vllm/pull/24498

- Skip the async-scheduling tests in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never
verified
https://github.com/vllm-project/vllm/pull/31998

- Skip some pooling tests, which are caused by
https://github.com/vllm-project/vllm/pull/32148
where vllm is also failed
https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4

We will reopen those tests when main2main reachs
https://github.com/vllm-project/vllm/pull/32243

- Skip some cases in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are
broken by
https://github.com/vllm-project/vllm/pull/32118

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-01-15 09:48:53 +08:00
SILONG ZENG
4811ba62e0 [Lint]Style: reformat markdown files via markdownlint (#5884)
### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2026-01-15 09:06:01 +08:00
lty
295018ec0f [Refactor]Refactor of vllm_ascend/distributed module (#5719)
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604

This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: lty <linhebiwen@gmail.com>
2026-01-15 08:57:40 +08:00
herizhen
d31170496b [doc]index display by category (#5852)
### What this PR does / why we need it?
upgrade tutorial doc index display by category

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-14 16:50:49 +08:00
LHXuuu
0415e694cd [Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718)
### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
2026-01-14 09:17:26 +08:00
zhangxinyuehfad
f7b904641e [Main2Main] Upgrade vllm commit to 0109 (#5752)
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
https://github.com/vllm-project/vllm/pull/31786
2. fix spec_decode e2e test due to
https://github.com/vllm-project/vllm/pull/29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
https://github.com/vllm-project/vllm/pull/31891
4. fix `self.seq_lens - query_lens` on same device due to
https://github.com/vllm-project/vllm/pull/31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-13 19:14:43 +08:00
SILONG ZENG
523e83016b [Lint]Style: Convert root, benchmarks, tools and docs to ruff format (#5843)
### What this PR does / why we need it?
Description
This PR fixes linting issues in the root directory, benchmarks/, tools/
and docs/ to align with the project's Ruff configuration.

This is part of a gradual effort to enable full linting coverage across
the repository. The corresponding paths have been removed from the
exclude list in pyproject.toml.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2026-01-13 15:29:34 +08:00
Yikun Jiang
fe251a2efe [Doc] Update community contributors and versioning naming to follow vLLM (#5820)
### What this PR does / why we need it?

This pull request updates documentation to align with vLLM's community
standards.
- Change `Maintainers` to `Committers` to follow vLLM naming:
https://docs.vllm.ai/en/latest/governance/committers/
- Change release branch policy from `vX.Y.Z-dev` to `releases/vX.Y.Z`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
doc ci passed
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2026-01-13 08:47:11 +08:00
liziyu
451bbdc292 [Doc] add tls check to pd disaggregation readme (#5638)
### What this PR does / why we need it?

update pd disaggregation multi_node readme, update the environment check
command for A3, add tls check
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
8be6432bda

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-12 15:49:18 +08:00
wangxiyuan
5ccd53e28a [CI] adpat v0.13.0 change (#5793)
Add `releases` match case for CI jobs and update related doc for v0.13.0
branch

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-12 14:06:56 +08:00
wangxiyuan
354ee3b330 [Doc] Update doc url link (#5781)
Drop `dev` suffix for doc url.
Rename url to `https://docs.vllm.ai/projects/ascend`

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-12 11:21:31 +08:00
1092626063
3ba064f804 [Doc] Add GLM4.5 GLM4.6 doc (#5740)
### What this PR does / why we need it?
Add GLM4.5 GLM4.6 doc

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: 1092626063 <1092626063@qq.com>
2026-01-09 16:40:49 +08:00
zzhxxx
36d74aba58 [Doc][fix] Fix the title of the document for the layer_sharding feature (#5759)
### What this PR does / why we need it?
Fix the title of the document for the layer_sharding feature

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2026-01-09 14:15:22 +08:00
zyz111222
98c788a65a [Doc] add PaddleOCR-VL tutorials guide (#5556)
### What this PR does / why we need it?
1. add PaddleOCR-VL.md in the `docs/source/tutorials/`
2. add PaddleOCR-VL index in  `docs/source/tutorials/index.md`

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by CI

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: zouyizhou <zouyizhou@huawei.com>
2026-01-09 11:01:25 +08:00
zhenwenqi2024
97f6be8108 [feature]dcp&pcp support mlapo (#5672)
### What this PR does / why we need it?
mlapo in deepseek is a huge performance improvement in decode, this pr
support pcp & dcp with mlapo

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2026-01-08 23:49:23 +08:00
meihanc
503822c56c [Doc] Add Qwen3-Omni-30B-A3B-Thinking Tutorials (#3991)
### What this PR does / why we need it?
Add Qwen3-Omni-30B-A3B-Thinking Tutorials 

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
5326c89803

---------

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-01-08 16:57:20 +08:00
zzhxxx
f7db812ed7 [refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181)
### What this PR does / why we need it?
- Delete the environment variable
`VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED`
- Introduce layer_sharding as a configurable feature in
additional_config
- Revise the term "shared weight" to "shard weight."
Configuration : The feature is opt-in via the additional_config
argument:
```
--additional-config '{
  "layer_sharding": ["o_proj", "q_b_proj"]
}'
```

This is orthogonal to standard tensor parallelism and weight replication
strategies. It is treated as a separate, explicit feature.It can be used
in any scenario, combined with the
flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature
or the ShardedCP #4702 feature, to achieve significant performance.



- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
2026-01-08 09:05:02 +08:00
wangxiyuan
6f7a81cd9f [CI] cleanup single/multi-card test (#5623)
1. speed up e2e light test.
2. create `2-cards` and `4-cards` folder in multicard
3. move ops to nightly
4. run test in Alphabetical Order

- vLLM version: v0.13.0
- vLLM main:
8be6432bda

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-07 14:13:34 +08:00
wjunLu
b8f245792e [Main2Main] Upgrade vllm commit to 0106 (#5617)
### What this PR does / why we need it?
Upgrade vllm commit to 0106

- vLLM version: v0.13.0
- vLLM main:
8be6432bda

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-06 15:50:40 +08:00
meihanc
c1dcddce3f [CI]update bisheng version (#5621)
### What this PR does / why we need it?
update bisheng version in 20260105

- vLLM version: v0.13.0
- vLLM main:
8be6432bda

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-01-06 15:22:22 +08:00
wjunLu
3cf059a72b [Main2Main] Upgrade vllm commit to 0105 (#5595)
### What this PR does / why we need it?

Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e)

1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since
https://github.com/vllm-project/vllm/pull/31517 deleted unused arg

2. Remove dense `Qwen/Qwen3-0.6B` in
`tests/e2e/multicard/test_aclgraph_capture_replay.py` and
`tests/e2e/multicard/test_data_parallel.py` due to
https://github.com/vllm-project/vllm/pull/30739
where offline data parallel mode will not be supported/useful for dense
models

3. Adapt `vllm_ascend/worker/worker.py` due to
https://github.com/vllm-project/vllm/pull/31584

4. Adapt `self.block_size` calling due to
https://github.com/vllm-project/vllm/pull/31540

5. Modify `test_mla_v1.py` due to
https://github.com/vllm-project/vllm/pull/28454 , which refactorred
`get_head_size()`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-06 08:44:29 +08:00
Qiu
b10ef9b9f3 [docs] Correct image about prefill phase of PCP (#5598)
### What this PR does / why we need it?
Remove the incorrectly depicted DCP all_gather operation in the prefill
stage PCP for GQA diagram.

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-05 20:21:59 +08:00
huqi
2d22700d69 Docs: Add A3 Docker image guidance for Atlas A3 machines (#5256)
Fixes #3386

- Update Qwen3-30B-A3B.md to use A3-specific image tag
- Update Qwen3-Dense.md to provide both A2 and A3 image options  
- Update Qwen3-Next.md to use A3-specific image for Atlas A3
environments

Previously, documentation only mentioned A2 images (vllm-ascend:version)
but Atlas A3 machines require A3-specific images
(vllm-ascend:version-a3). This change ensures users select the correct
image for their hardware.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: hu-qi <huqi1024@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-01-05 19:42:42 +08:00
huqi
9d8b4c8d9d [Doc] Add NNAL installation guide and requirements (#5235)
Fixes #2727

- Add NNAL to the software requirements table with version information
- Add note explaining that prebuilt Docker images include NNAL
- Add warning message for manual installation when encountering
libatb.so errors
- Improve visibility of NNAL installation instructions to prevent
runtime errors

This addresses the issue where users encounter 'libatb.so not found'
errors due to missing NNAL installation in their environment.

### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: hu-qi <huqi1024@gmail.com>
Co-authored-by: zhangyiming <34808445+menogrey@users.noreply.github.com>
2026-01-05 19:40:26 +08:00
zhangmuzhi_yuwan
6c1a685b30 [Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415)
### What this PR does / why we need it?
This documentation provides a comprehensive technical guide for
deploying **vLLM-Ascend** using a **Prefill-Decode (PD) colocated
architecture** integrated with **Mooncake**, a high-performance
distributed KV Cache transfer engine. As Large Language Model (LLM)
serving scales, managing KV Cache efficiently across distributed nodes
is essential for reducing latency and optimizing hardware utilization.

The tutorial focuses on a multi-instance setup using Huawei **Atlas 800T
A2** nodes. By leveraging Mooncake’s distributed memory pooling, vLLM
instances can achieve seamless **cross-node KV Cache reuse**. This
capability allows an instance to retrieve precomputed cache from a
remote node's DRAM via high-speed **RoCE** networks, effectively
bypassing redundant prefill computations.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
0bfd7484fd

---------

Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2026-01-05 14:19:57 +08:00
lilinsiman
52863c4165 [Refactor][EAGLE] 2/N: load model and generate token (#5437)
### What this PR does / why we need it?
1. Refactor eagle and mtp function: load_model and generate_token_ids
2. Remove redundant code in mtp and eagle file
3. Refactor the UT of file

2/N of Refactor and merge mtp and eagle
Relational RFC: https://github.com/vllm-project/vllm-ascend/issues/5467

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut and tests

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-01-05 14:07:54 +08:00
L4
c23cf30709 [Doc] eval-type not support service but server (#2920)
### What this PR does / why we need it?
fix wrong eval-type in accuracy doc

- vLLM version: v0.10.2
- vLLM main:
fec347dee1

Signed-off-by: root <root@liaolile-laptop.localdomain>
Co-authored-by: root <root@liaolile-laptop.localdomain>
2026-01-05 11:17:39 +08:00
zhangxinyuehfad
a099b994b3 [Doc] update supported models (#5379)
### What this PR does / why we need it?
1. update supported models: Llama2 & Kimi-K2-Thinking & ERNIE-4.5 &
Qwen3-Omni
2. update Supported Hardware

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-01-05 09:21:52 +08:00
meihanc
fbb93ad8f2 [bugfix]update bishengir source envs (#5582)
### What this PR does / why we need it?
Due to the update of the Bisheng version's installation path, the
corresponding source path in the environment variables needs to be
updated.

- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-01-05 09:13:40 +08:00
InSec
7cf65d0581 [Doc]modify the quantization user guide and add a quantization adaptation developer guide (#5554)
### What this PR does / why we need it?
This PR makes the following modifications:
1.delete the `user_guide/feature_guide/quantization-llm-compressor.md`
and merge it into `user_guide/feature_guide/quantization.md`.
2.update the content of `user_guide/feature_guide/quantization.md`.
3.add guidance `developer_guide/feature_guide/quantization.md' on the
adaptation of quantization algorithms and quantized models.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

---------

Signed-off-by: IncSec <1790766300@qq.com>
Signed-off-by: InSec <1790766300@qq.com>
2026-01-05 09:12:11 +08:00