Commit Graph

627 Commits

Author SHA1 Message Date
SILONG ZENG
e2237819a9 [CI]Fixed the spell check function in typos.toml (#6753)
### What this PR does / why we need it?
The incorrect regular expression syntax `.*[UE4M3|ue4m3].*` actually
ignores all words containing any of the following characters: `u, e, 4,
m, 3, |`

```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
    ".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*",
    ".*ot.*", ".*[Tt]h[rR].*"]
```
===fix===>
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
    ".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*",
    ".*ot.*", ".*[Tt]h[rR].*"]
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-02-14 11:57:26 +08:00
Cao Yi
6de207de88 [main][Docs] Fix typos across documentation (#6728)
## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-02-13 15:50:05 +08:00
taoyao1221
41d056f947 [doc] add A2 series doc for GLM5.md (#6717)
### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
9562912cea

- vLLM version: v0.15.0
- vLLM main:
9562912cea
2026-02-12 16:08:17 +08:00
Canlin Guo
052cc4e61b [Docs] Fix GLM-5 deploy command (#6711)
This pull request refines the GLM-5 deployment documentation by updating
the Docker run command to include a more comprehensive set of device
mappings and by removing an extraneous quantization flag from the `vllm
serve` commands. These changes aim to correct and clarify the deployment
instructions, ensuring users can successfully set up and run the GLM-5
model as intended.


- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: Canlin Guo <961750412@qq.com>
2026-02-12 08:55:48 +08:00
iiiklw
a0315f6697 [npugraph_ex]enable npugraph_ex by default (#6664)
### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214).


### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
2026-02-12 08:44:06 +08:00
rika
b86ea66b0a [doc]add GLM5.md (#6709)
### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: nakairika <982275964@qq.com>
2026-02-12 04:00:40 +08:00
Icey
88773bb101 [main to main] upgrade main 0210 (#6673)
### What this PR does / why we need it?
upgrade vllm commit to `9562912cead1f11e8540fb91306c5cbda66f0007`

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
all tests passed

- vLLM version: v0.15.0
- vLLM main:
13397841ab

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-02-11 18:10:14 +08:00
wangxiyuan
7d4833bce9 [Doc][Misc] Restructure tutorial documentation (#6501)
### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-10 15:03:35 +08:00
wangxiyuan
2a826b5fad [Misc] upgrade to vllm main (#6646)
### What this PR does / why we need it?
This PR upgrades the core vLLM dependency to a newer version from the
main branch (`13397841ab469cecf1ed425c3f52a9ffc38139b5`). This is
necessary to keep our project up-to-date with the latest features and
fixes from upstream vLLM.

1.
ac32e66cf9
pass file is moved.

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: wxsIcey <1790571317@qq.com>
2026-02-10 14:08:59 +08:00
Cao Yi
1c7d1163f5 [main][Docs] Fix spelling errors across documentation (#6649)
Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-02-10 11:14:57 +08:00
DreamerLeader
905f0764e0 [DOC]Add Memcache Usage Guide (#6476)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
2026-02-09 21:55:00 +08:00
Li Wang
d018aeb5fa [Image] Bump mooncake version to v0.3.8.post1 (#6428)
### What this PR does / why we need it?
This patch bump the mooncake version to the latest
[release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1)
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test is locally
>>> from mooncake.engine import TransferEngine
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-02-06 10:54:03 +08:00
wangxiyuan
c38166eefa [Doc] backport 0.13.0 release note (#6584)
### What this PR does / why we need it?
Backport 0.13.0 release note to main branch and update related doc link

### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
by doc CI

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-06 10:29:15 +08:00
meihanc
922e5c163b [main2main] upgrade vllm main 0202 (#6560)
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
https://github.com/vllm-project/vllm/pull/32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to https://github.com/vllm-project/vllm/pull/33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
https://github.com/vllm-project/vllm/pull/33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
https://github.com/vllm-project/vllm/pull/32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
https://github.com/vllm-project/vllm/pull/32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to https://github.com/vllm-project/vllm/pull/27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
https://github.com/vllm-project/vllm/pull/33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
https://github.com/vllm-project/vllm/pull/32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-02-05 19:31:17 +08:00
DreamerLeader
2dac18afea [Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (#6126)
### What this PR does / why we need it?
Fix of Pooling Code and Update of Pooling Usage Guide
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
pr:[[Bugfix]Fixed precision issues caused by pooled request
pooling](https://github.com/vllm-project/vllm-ascend/pull/6049)
readyhttps://github.com/vllm-project/vllm-ascend/pull/6049
read for review
- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Signed-off-by: fangjianwei <f30058701@china.huawei.com>
Signed-off-by: DreamerLeader <88812830+DreamerLeader@users.noreply.github.com>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: fangjianwei <f30058701@china.huawei.com>
2026-02-04 16:35:41 +08:00
Nengjun Ma
78fad4e348 [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442)
### What this PR does / why we need it?
Refactor MLP weight prefetch to consistency with MoE Model's prefetching
in terms of code and usage.
Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP,
VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and
VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following:

--additional-config '{"weight_prefetch_config": { "enabled": true,
"prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}'

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-04 09:08:18 +08:00
zhangguinan
be5b66de6d [Doc] Contributing a Benchmark Tutorial for Suffix Speculative Decoding (#6323)
### What this PR does / why we need it?
Suffix Decoding is a CPU-based speculative decoding optimization that
accelerates inference by pattern matching and frequency-based prediction
from both prompts and generated content.

This document provides a step-by-step guide for deploying and evaluating
**Suffix Speculative Decoding** on the **Ascend** platform. By analyzing
performance gains across diverse datasets, it demonstrates the
significant advantages of this technology in inference acceleration. Our
goal is to empower developers to achieve high-efficiency model
optimization using Ascend hardware.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
2026-02-03 14:52:38 +08:00
meihanc
c08364f761 [Bugfix] Fix intermittent kv_port conflict with AscendDirectTransport (#6455)
### What this PR does / why we need it?

When using Mooncake on Ascend NPU, AscendDirectTransport randomly
allocates ports within range `[20000, 20000 + npu_per_node × 1000)`.
Reference:
[ascend_direct_transport.cpp#L554](https://github.com/kvcache-ai/Mooncake/blob/v0.3.7.post2/mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport/ascend_direct_transport.cpp#L475)

If `kv_port` overlaps with this range, users may encounter intermittent
startup failures:
```bash
zmq.error.ZMQError: Address already in use (addr='tcp://x.x.x.x:30012')
RuntimeError: KV Cache sending/receiving thread failed to start.
```
This pr fix intermittent kv_port conflict with AscendDirectTransport in
`Qwen3-235B-W8A8-EPLB.yaml`, and add Added `kv_port Configuration Guide`
section in `pd_disaggregation_mooncake_multi_node.md`.

test
Results(tests/e2e/nightly/multi_node/config/Qwen3-235B-W8A8-EPLB.yaml):
https://github.com/vllm-project/vllm-ascend/actions/runs/21540138907/job/62073265259

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-02-02 17:31:21 +08:00
wangxiyuan
eeedf7c503 [Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470)
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
https://github.com/vllm-project/vllm/pull/31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
https://github.com/vllm-project/vllm/pull/32082 by
https://github.com/vllm-project/vllm-ascend/pull/6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
https://github.com/vllm-project/vllm/pull/25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8


co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-02 15:57:55 +08:00
wangxiyuan
b4aafd4293 [Core][Misc] Clean up ProfileExecuteDuration (#6461)
### What this PR does / why we need it?
This PR removes the custom `ProfileExecuteDuration` utility and its
usages across the codebase. This utility was used for profiling
execution duration of different stages in the inference process. It is
replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`,
which integrates with PyTorch's profiler.

This change simplifies the code by removing a custom implementation in
favor of an upstream utility, improving maintainability. Associated
documentation and tests for `ProfileExecuteDuration` are also removed.

### Does this PR introduce _any_ user-facing change?
`VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now.

### How was this patch tested?
CI passed. The changes are a cleanup and replacement with a standard
utility. Existing tests cover the functionality. The removed feature had
its own tests which are also removed.

Related RFC: #5304

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-01 20:06:01 +08:00
ChenCangtao
46cee945b3 [doc][npugraph_ex]add npugraph_ex introduction doc (#6306)
### What this PR does / why we need it?
As part of the preparation work for the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214)
We have added a documentation about npugraph_ex, which mainly explains
and introduces its usage and FX graph optimization.
The introduction to FX graph optimization also includes specific
explanations of the default passes, the implementation methods for
custom fusion passes, and how to capture the FX graph during the
optimization process through environment variable configuration.

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
2026-01-30 11:21:37 +08:00
Nengjun Ma
597091be9f [Doc] Reranker guide remove deprecated task option (#6385)
### What this PR does / why we need it?
Reranker guide remove deprecated task option.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-01-29 16:00:26 +08:00
CodeCat
54e8389f8e [Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (#6006)
### What this PR does / why we need it?
This PR builds upon PR
https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to
further enhance the npu_graph_ex_passes module. Based on prior work, we
have added graph optimization support for the add_rms_quant fused
operator in scenarios where a bias term is present—ensuring the fusion
pattern is correctly registered and matched into the computation graph.

This time, we performed the operator fusion of MatmulAllReduceAddRMSNorm
and added corresponding ST test cases for regression monitoring.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: cjian <2318164299@qq.com>
2026-01-27 16:41:48 +08:00
meihanc
fea197ad50 [Main2Main] Upgrade vllm commit to 0123 (#6169)
### What this PR does / why we need it?
1.  Upgrade vllm commit to: 0115
(8471b27df97c3eb79f891802fc0e858f8f7ac6a0)
Modify import paths due to the refactors:
https://github.com/vllm-project/vllm/pull/32245
https://github.com/vllm-project/vllm/pull/32060
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913
2. Upgrade vllm commit to: 0119
(9a1f16da1e423ede2c2f52a9850cbfbb39cefe96)
Fix `WorkerProc.__init__() missing 1 required positional argument:
'is_driver_worker'` due to
https://github.com/vllm-project/vllm/pull/28506
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569
3. Upgrade vllm commit to:
0120(148117ea2e689cd43df4be6892671a17cdae5833)
1. Add `skip_compiled` param in `set_forward_context` due to
https://github.com/vllm-project/vllm/pull/30385
2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to
https://github.com/vllm-project/vllm/pull/24322
change `self.max_num_tokens =
vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size`
3. Modify UT import paths due to the
refactors:https://github.com/vllm-project/vllm/pull/32060
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946
4. Upgrade vllm commit to:
0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9)
1. vLLM switched `uses_mrope` from target to draft model config, making
`positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's
direct self.positions access and tests missing
`draft_model_config.uses_mrope`.
https://github.com/vllm-project/vllm/pull/32048
2. Moved bs_to_padded_graph_size from CompilationConfig to
CudagraphDispatcher due to the refactor
https://github.com/vllm-project/vllm/pull/30143
3. Remove unused `maybe_setup_kv_connector` due to
https://github.com/vllm-project/vllm/pull/32077
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834
6. Upgrade vllm commit to:
0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5)
Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig
due to https://github.com/vllm-project/vllm/pull/32414
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054
8. Upgrade vllm commit to:
0123(dc917cceb877dfd13f98c538c4c96158047d98bd)
Setting temperature=0.0 due to the removal of the default temperature
value in https://github.com/vllm-project/vllm/pull/32723
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: wjunLu <wjunlu217@gmail.com>
2026-01-27 08:44:36 +08:00
wangxiyuan
d9979f4d13 [Doc] quick fix for vllm-ascend version (#6278)
Correct vllm-ascend version name in doc

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-26 19:33:18 +08:00
wangxiyuan
cb553f8eee [Community] Nominate whx-sjtu as maintainer (#6268)
Since the first release v0.13.0rc2 and v0.14.0rc1 in 2026 are released.
We consider to refresh the maintainer team. I nominate whx-sjtu as the
new maintainer. 

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-26 19:22:26 +08:00
Nengjun Ma
f910cebe04 [Doc] 310P Documents update (#6246)
### What this PR does / why we need it?
310P support guides updates, as currently has supported in main branch.

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-01-26 14:33:21 +08:00
wangxiyuan
52d4acfa51 [Doc] add release note for v0.14.0rc1 (#6225)
Add release note for v0.14.0rc1

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-26 14:22:40 +08:00
Li Wang
c26ad78f86 [CI][lint] Add rule codespell back (#6236)
### What this PR does / why we need it?
After removing codepsell a while, we discovered that typo had a problem
correctly recognizing certain misspelled words, so I suggested adding it
back.

- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 14:12:33 +08:00
Shanshan Shen
e3eefdecbd [Doc] Update max_tokens to max_completion_tokens in all docs (#6248)
### What this PR does / why we need it?

Fix:

```
DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field.
```

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-26 11:57:40 +08:00
wangxiyuan
99bdd7363c [CI] update vLLM to 0.14.1 (#6222)
Upgrade vLLM to 0.14.1
- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-25 17:52:16 +08:00
Icey
7799c4ca3b [Fusion] change fusion env variable (#6201)
### What this PR does / why we need it?
Since CI has integrated Triton, `fuse_qknorm_rope` is enabled by
default.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-24 22:49:33 +08:00
wangxiyuan
21833a4321 [Doc] Add release note for 0.13.0rc2 (#6207)
Add release note for 0.13.0rc2

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-24 12:51:47 +08:00
liziyu
14bef9af6f [P/D] Remove restrictions on mooncake for IPv6 (#5946)
### What this PR does / why we need it?
Remove restrictions on mooncake for IPv6
Dependencies: cann8.5、mooncake v0.3.8.post1

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-24 11:30:22 +08:00
zhangyiming
56d8f088dd [Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)
### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: menogrey <1299267905@qq.com>
2026-01-24 11:29:07 +08:00
zhaomingyu13
2dd68652bc [Doc] Add the setting description of cudagraph_capture_sizes in speculative decoding user guide (#5637)
### What this PR does / why we need it?
Add the setting description of cudagraph_capture_sizes, guide users to
avoid the common mistakes frequently made when using the EAGLE overlay
fullgraph.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need for testing
- vLLM version: v0.13.0
- vLLM main:
8be6432bda

---------

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-23 23:22:44 +08:00
Angazenn
1e116829ac [doc]update --max-num-seqs in Qwen3-235b tutorial (#6197)
### What this PR does / why we need it?
This pr update --max-num-seqs in Qwen3-235b single-node-deployment
tutorial to ensure running into graph mode correctly.

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: Angazenn <supperccell@163.com>
2026-01-23 17:11:10 +08:00
Cao Yi
a69ef10c3a [Refactor] Quantization Module Refactor (#5738)
### Summary

This PR refactors the `vllm_ascend/quantization` module to improve code
organization, maintainability, and extensibility. The refactoring
introduces a clear separation of concerns with a registry-based scheme
discovery pattern, abstract base classes for quantization schemes, and
dedicated wrapper classes.

### Key Changes

#### 1. **Modular Directory Structure**

| Before | After |
|--------|-------|
| Flat file structure with mixed responsibilities | Organized into
`methods/` subpackage for schemes |
| Single `quant_config.py` (600+ lines) | Separate config files:
`modelslim_config.py`, `compressed_tensors_config.py` |
| `utils.py` with scheme lookup logic | `methods/registry.py` with
decorator-based registration |

#### 2. **Registry-Based Scheme Discovery**

Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a
decorator-based registry pattern:

```python
# Before: Manual dictionary mapping
ASCEND_QUANTIZATION_METHOD_MAP = {
    "W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...},
    ...
}

# After: Decorator-based registration
@register_scheme("W8A8_DYNAMIC", "linear")
class AscendW8A8DynamicLinearMethod(AscendLinearScheme):
    ...
```

#### 3. **Abstract Base Classes**

Introduced three abstract base classes in `methods/base.py`:
- `AscendLinearScheme` - Base for linear layer quantization
- `AscendMoEScheme` - Base for MoE layer quantization  
- `AscendAttentionScheme` - Base for attention layer quantization

#### 4. **Separated Config and Wrapper Classes**

- **Config classes** (`AscendModelSlimConfig`,
`AscendCompressedTensorsConfig`): Handle config parsing and scheme
selection
- **Wrapper classes** (`AscendLinearMethod`, `AscendFusedMoEMethod`,
etc.): Implement vLLM interfaces and delegate to schemes

#### 5. **Cleaner Public API**

```python
# New clean module interface
from vllm_ascend.quantization import (
    AscendModelSlimConfig,
    AscendCompressedTensorsConfig,
)
from vllm_ascend.quantization.methods import get_scheme_class
```

### Architecture Diagram

```mermaid
classDiagram
    direction TB
    
    class QuantizationConfig {
        <<vLLM Interface>>
        +get_quant_method()
    }
    
    class AscendModelSlimConfig {
        +quant_description
        +get_quant_method()
        -create_scheme_for_layer()
    }
    
    class AscendCompressedTensorsConfig {
        +target_scheme_map
        +get_quant_method()
        -_get_scheme_from_parts()
    }
    
    class AscendLinearMethod {
        <<Wrapper>>
        +quant_method: AscendLinearScheme
        +create_weights()
        +apply()
    }
    
    class AscendFusedMoEMethod {
        <<Wrapper>>
        +quant_method: AscendMoEScheme
        +create_weights()
        +apply()
    }
    
    class AscendLinearScheme {
        <<Abstract>>
        +get_weight()*
        +apply()*
        +get_pertensor_param()
        +get_perchannel_param()
    }
    
    class AscendMoEScheme {
        <<Abstract>>
        +get_weight()*
        +get_dynamic_quant_param()*
        +apply()*
    }
    
    class W8A8DynamicLinear {
        +get_weight()
        +apply()
    }
    
    class W8A8DynamicMoE {
        +get_weight()
        +apply()
    }
    
    QuantizationConfig <|-- AscendModelSlimConfig
    QuantizationConfig <|-- AscendCompressedTensorsConfig
    
    AscendModelSlimConfig ..> AscendLinearMethod : creates
    AscendModelSlimConfig ..> AscendFusedMoEMethod : creates
    AscendCompressedTensorsConfig ..> AscendLinearMethod : creates
    AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates
    
    AscendLinearMethod o-- AscendLinearScheme : delegates to
    AscendFusedMoEMethod o-- AscendMoEScheme : delegates to
    
    AscendLinearScheme <|-- W8A8DynamicLinear
    AscendMoEScheme <|-- W8A8DynamicMoE
```

### Scheme Registration Flow

```mermaid
sequenceDiagram
    participant Module as Scheme Module
    participant Registry as _SCHEME_REGISTRY
    participant Config as QuantConfig
    participant Wrapper as Wrapper Class
    
    Note over Module: At import time
    Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear")
    Registry->>Registry: Store (quant_type, layer_type) -> Class
    
    Note over Config: At runtime
    Config->>Config: Determine quant_type from description
    Config->>Registry: get_scheme_class(quant_type, layer_type)
    Registry-->>Config: Return scheme class
    Config->>Config: scheme = scheme_cls()
    Config->>Wrapper: Create wrapper with scheme
    Wrapper-->>Config: Return wrapper instance
```

### File Changes Summary

| Original Files | Refactored Files |
|----------------|------------------|
| `__init__.py` (empty) | `__init__.py` (exports public API) |
| `quant_config.py` | `modelslim_config.py` + `wrappers.py` |
| `compressed_tensors/` | `compressed_tensors_config.py` |
| `utils.py` | `methods/registry.py` |
| `w8a8_dynamic.py` | `methods/w8a8_dynamic.py` |
| `w8a8.py` | `methods/w8a8_static.py` |
| `w4a4_flatquant_dynamic.py` | `methods/w4a4_flatquant.py` |
| ... | `methods/base.py` (new) |

### Benefits

1. **Extensibility**: Adding new quantization schemes only requires
implementing the base class and adding `@register_scheme` decorator
2. **Maintainability**: Clear separation between config parsing, wrapper
logic, and scheme implementation
3. **Testability**: Abstract base classes enable easier unit testing and
mocking
4. **Discoverability**: Registry pattern makes it easy to list all
supported schemes
5. **Reduced Coupling**: Config classes no longer need to know about all
scheme implementations

___

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-01-23 14:13:47 +08:00
Li Wang
4d780a8b01 [Misc] Revert "[Misc] Bump mooncake version to v0.3.8.post1 (#6110)" (#6164)
### What this PR does / why we need it?
The new version of moonkcake lead to the image build failure. see
https://github.com/vllm-project/vllm-ascend/actions/runs/21236469259/job/61105443733,
we should revert it first
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-23 09:53:32 +08:00
zhangxinyuehfad
08a45e6053 [Doc] update supported features (#6165)
### What this PR does / why we need it?

update supported features


- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-23 09:50:11 +08:00
zhangxinyuehfad
819a4459ce Drop vLLM 0.13.0 support (#6069)
### What this PR does / why we need it?
Drop vLLM 0.13.0 support, upgrade to 0.14.0

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-23 09:45:08 +08:00
wjunLu
88632cf976 [CI][Doc] Upgrade wheel building's CANN to 8.5.0 and update the Docs (#6145)
### What this PR does / why we need it?
Upgrade wheel building's CANN to 8.5.0 and update the Docs


- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-22 19:50:54 +08:00
meihanc
e54d294df3 [CI]Install clang in dokerfile for triton ascend (#4409)
### What this PR does / why we need it?
Install clang in dokerfile for triton ascend

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-01-22 19:01:28 +08:00
wjunLu
a7d781f135 [Main] Upgrade PTA to 2.9.0 (#6112)
### What this PR does / why we need it?
Upgrade PTA to 2.9.0

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-22 17:59:06 +08:00
Li Wang
37a9cf818a [Misc] Bump mooncake version to v0.3.8.post1 (#6110)
### What this PR does / why we need it?
Since the mooncake has the newer
[release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1),
we pin the tag to latest release

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-22 11:03:16 +08:00
wangxiyuan
69740039b7 [CI] Upgrade CANN to 8.5.0 (#6070)
### What this PR does / why we need it?
1. Upgrade CANN to 8.5.0
2. move triton-ascend 3.2.0 to requirements

note: we skipped the two failed e2e test, see
https://github.com/vllm-project/vllm-ascend/issues/6076 for more detail.
We'll fix it soon.


### How was this patch tested?
Closes: https://github.com/vllm-project/vllm-ascend/issues/5494

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-22 09:29:50 +08:00
Nengjun Ma
ab676413e6 Default enable MLAPO (#5952)
### What this PR does / why we need it?
1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD
disagregation D Instance, for example: DeepSeekV3-W8A8,
DeepSeek-R1-W8A8.
2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models,
currently is DeepSeek-V3.2-W8A8.

### Does this PR introduce _any_ user-facing change?
Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO
feature for deepseek w8a8 model

The effect of enabling MLAPO SFA model deployed on a single A3 Node:
Test
with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py
dataset: gsm8k-lite,without set MTP, FULL GRAPH, has 19% promote:
未默认开启 MLAPO 时:
├─────────────────────────┤
│                TTFT                      │ 14055.8836 ms   │
├─────────────────────────┤
│                ITL                         │ 66.8171 ms.          │
├─────────────────────────┤
│ Output Token Throughput  │ 104.9105 token/s │
├─────────────────────────┤
默认开启 MLAPO 时:
├─────────────────────────┤
│                TTFT                      │ 3753.1547 ms   │
├─────────────────────────┤
│                ITL.                        │ 61.4236  ms.       │
├─────────────────────────┤
│ Output Token Throughput  │ 125.2075 token/s│
├─────────────────────────┤

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-01-22 09:26:39 +08:00
MengLong Chen
a15a5f6aa5 [Doc] Supplement PD separation parameters of DeepSeek V3.1 (#6053)
### What this PR does / why we need it?
Supplement PD separation parameters of DeepSeek V3.1
The recommended parameter configuration for DeepSeek V3.1 in the EP32
scenario after PD separation has been adjusted, and the core parameters
have been described in detail.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2026-01-22 08:53:44 +08:00
meihanc
53bfb38192 [CI]Update triton ascend version in 3.2.0 (#6067)
### What this PR does / why we need it?
update triton ascend version in 3.2.0

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-01-21 16:02:23 +08:00
Magnus
5b129cf0a1 [1/N][Feat] Xlite Qwen3 MoE Support (#5951)
### What this PR does / why we need it?
This patch adds support for the Qwen3-MoE model in Xlite. For more
details about Xlite, please refer to the following
link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.

Qwen3-MoE TODO List:
- [ ] Qwen3-235B-A22B support
- [ ] Qwen3-MoE weights NZ support
- [ ] Qwen3-MoE data parallel support

## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance
Comparison
- aclgraph: main(69b170b8b5)
- xlite-full: main + xlite-full
- xlite-decode-only: main + xlite-decode-only
- diff1: Performance comparison between xlite-full and aclgraph
- diff2: Performance comparison between xlite-decode-only and aclgraph

| maxconcurrency | item | TTFT(ms) | | TPOT(ms) | | QPS (req/s) |
OutputSpeed (token/s) |
| --- | --- | --- | --- | --- | --- | --- | --- |
|  |  | Avg | P99 | Avg | P99 |  |  |
| 1 | baseline-aclgraph | 205.07 | 287.29 | 12.34 | 12.65 | 0.14 | 78.81
|
| 1 | xlite-full | 66.40 | 113.69 | 11.71 | 12.40 | 0.15 | 84.73 |
| 1 | xlite-decode-only | 221.15 | 316.40 | 12.16 | 12.91 | 0.14 | 79.70
|
| 1 | diff1 | -67.62% | -60.43% | -5.11% | -1.98% | 7.14% | 7.51% |
| 1 | diff2 | 7.84% | 10.13% | -1.46% | 2.06% | 0.00% | 1.13% |
|  |  |  |  |  |  |  |  |
| 16 | baseline-aclgraph | 1892.16 | 13916.86 | 22.78 | 39.28 | 1.15 |
589.89 |
| 16 | xlite-full | 1355.40 | 8907.45 | 15.96 | 25.15 | 1.65 | 850.21 |
| 16 | xlite-decode-only | 1519.42 | 8711.64 | 19.23 | 29.73 | 1.38 |
711.60 |
| 16 | diff1 | -28.37% | -36.00% | -29.94% | -35.97% | 43.48% | 44.13% |
| 16 | diff2 | -19.70% | -37.40% | -15.58% | -24.31% | 20.00% | 20.63% |
|  |  |  |  |  |  |  |  |
| 32 | baseline-aclgraph | 673.80 | 3914.90 | 32.20 | 37.95 | 1.80 |
928.54 |
| 32 | xlite-full | 481.65 | 2710.50 | 19.95 | 25.35 | 2.91 | 1506.67 |
| 32 | xlite-decode-only | 372.22 | 1095.25 | 25.19 | 28.47 | 2.33 |
1202.82 |
| 32 | diff1 | -28.52% | -30.76% | -38.04% | -33.20% | 61.67% | 62.26% |
| 32 | diff2 | -44.76% | -72.02% | -21.77% | -24.98% | 29.44% | 29.54% |
|  |  |  |  |  |  |  |  |
| 48 | baseline-aclgraph | 583.18 | 3277.65 | 41.02 | 46.05 | 2.17 |
1115.08 |
| 48 | xlite-full | 973.42 | 8237.33 | 23.29 | 30.50 | 3.71 | 1908.09 |
| 48 | xlite-decode-only | 480.79 | 2026.98 | 31.48 | 35.41 | 2.83 |
1453.75 |
| 48 | diff1 | 66.92% | 151.32% | -43.22% | -33.77% | 70.97% | 71.12% |
| 48 | diff2 | -17.56% | -38.16% | -23.26% | -23.11% | 30.41% | 30.37% |
|  |  |  |  |  |  |  |  |
| 64 | baseline-aclgraph | 742.74 | 5953.39 | 47.79 | 53.15 | 2.48 |
1272.37 |
| 64 | xlite-full | 545.22 | 3941.34 | 25.09 | 30.41 | 4.64 | 2376.44 |
| 64 | xlite-decode-only | 752.40 | 4534.29 | 38.67 | 43.28 | 3.06 |
1567.94 |
| 64 | diff1 | -26.59% | -33.80% | -47.50% | -42.78% | 87.10% | 86.77% |
| 64 | diff2 | 1.30% | -23.84% | -19.08% | -18.57% | 23.39% | 23.23% |
|  |  |  |  |  |  |  |  |
| 100 | baseline-aclgraph | 565.52 | 1716.81 | 60.89 | 68.69 | 3.08 |
1580.64 |
| 100 | xlite-full | 398.14 | 2328.88 | 30.70 | 32.45 | 6.01 | 3086.42 |
| 100 | xlite-decode-only | 712.53 | 4875.94 | 52.71 | 60.78 | 3.53 |
1813.58 |
| 100 | diff1 | -29.60% | 35.65% | -49.58% | -52.76% | 95.13% | 95.26% |
| 100 | diff2 | 26.00% | 184.01% | -13.43% | -11.52% | 14.61% | 14.74% |
|  |  |  |  |  |  |  |  |
| 150 | baseline-aclgraph | 842.42 | 5175.01 | 73.60 | 88.18 | 3.80 |
1952.26 |
| 150 | xlite-full | 568.52 | 4204.33 | 37.90 | 40.01 | 7.27 | 3734.72 |
| 150 | xlite-decode-only | 654.43 | 2504.06 | 67.40 | 77.00 | 4.18 |
2145.11 |
| 150 | diff1 | -32.51% | -18.76% | -48.51% | -54.63% | 91.32% | 91.30%
|
| 150 | diff2 | -22.32% | -51.61% | -8.42% | -12.68% | 10.00% | 9.88% |
|  |  |  |  |  |  |  |  |
| 200 | baseline-aclgraph | 750.63 | 3049.91 | 88.26 | 101.95 | 4.28 |
2189.72 |
| 200 | xlite-full | 558.48 | 3791.98 | 45.54 | 49.04 | 8.17 | 4175.52 |
| 200 | xlite-decode-only | 807.09 | 4254.95 | 85.18 | 101.79 | 4.44 |
2271.52 |
| 200 | diff1 | -25.60% | 24.33% | -48.40% | -51.90% | 90.89% | 90.69% |
| 200 | diff2 | 7.52% | 39.51% | -3.49% | -0.16% | 3.74% | 3.74% |
|  |  |  |  |  |  |  |  |

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: changdawei1 <changdawei3@huawei.com>
Co-authored-by: LVYANGGUO <275926687@qq.com>
Co-authored-by: lulina <lina.lulina@huawei.com>
2026-01-21 09:26:03 +08:00