97 Commits

Author SHA1 Message Date
pz1116
14411e911e [Doc][0.18.0][KV Pool]add mooncake rdma timeout (#7784)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
- Add `default_kv_lease_ttl` to the `mooncake.json` example in the KV
Pool guide.
- Document `default_kv_lease_ttl` semantics and clarify that it should
be larger than `ASCEND_CONNECT_TIMEOUT` and `ASCEND_TRANSFER_TIMEOUT`.
- Add `HCCL_RDMA_TIMEOUT` explanation for Mooncake RDMA retransmission
timeout, including the recommended constraint note.
- Add `HCCL_RDMA_TIMEOUT=16` to relevant KV Pool environment setup
examples for consistency.

<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
2026-03-31 20:17:03 +08:00
Marck
17da96658f [ModelLoader][Feature] Add rfork support for fast model loading (#7392)
### What this PR does / why we need it?
Support an new load format: RFORK

For implementation details of this feature, please refer to #7441


### Does this PR introduce _any_ user-facing change?

add an new options for load-format: rfork

e.g.
```bash
vllm serve /workspace/models/Qwen3-8B --load-format rfork
```

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: Marck <1412354149@qq.com>
2026-03-25 16:40:30 +08:00
realliujiaxu
5d12446573 [Feat][SP] Suport SP for VL MoE models (#7044)
### What this PR does / why we need it?

2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712,
extend SP to VL MoE models.


### Does this PR introduce _any_ user-facing change?
remove `sp_threshold` in additional config and reuse `sp_min_token_num`
from vLLM.


### How was this patch tested?
- Model: Qwen3-VL-30B-A3B, 
- TP4 DP2
- 100 reqs
- max concurrency 1

| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|------------|---------------------|------------------------|
| 4k         | 429.40               | 323.3                  |
| 16k        | 1297.01              | 911.74                |

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-24 17:16:00 +08:00
Zetong Li
a253235a59 [Doc] Add note for unsupported PCP + FULL (#7559)
### What this PR does / why we need it?
This PR aims to add note in doc that FULL mode is not supported in PCP
scenario.

Signed-off-by: Zetong Li <slippersss@126.com>
2026-03-23 17:34:51 +08:00
pz1116
6fc190b44a [Doc][KV Pool]Revision KV Pool User Guide [2/2] (#7456)
### What this PR does / why we need it?
Revise the KV Pool user guide:
4. Revise parameters for Memcache for better clarity, at notification
that currently heterogeneous protocol setting is not supported (e.g.
enable `device_rdma` and `device_sdma` at the same time, a example
scenario would be data transfer by memcache across different super pods)
5. Modify the condition for Mooncakestore warmup, warmup is now needed
only when `ASCEND_BUFFER_POOL` is enabled.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
8a680463fa

---------

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
2026-03-19 16:17:34 +08:00
pz1116
3effc4bc70 [Doc][KV Pool]Revision KV Pool User Guide (#7434)
### What this PR does / why we need it?
Revise the KV Pool user guide:
1. Revise Mooncake environment variables and kvconnector extra configs.
2. Delete `use_ascend_direct` in kv connector extra config as it is
deprecated
3. Delete `kv_buffer_device` and `kv_rank` in P2P mooncake config
4. Unifies default `max-model-len` and `max-num-batch-tokens` in
examples given.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
2026-03-19 10:13:13 +08:00
Junyuan
6852a2e267 [feat] add LMCacheAscendConnector (#6882)
### What this PR does / why we need it?

LMCache-Ascend is LMCache's solution on the Ascend platform and one of
the KVCache pooling solutions for Ascend. We hope to integrate
LMCache-Ascend into the vLLM-Ascend community as one of the official
KVCache pooling solutions for vLLM-Ascend.

We added a new LMCacheAscendConnector in vLLM-Ascend and registered it.

### Does this PR introduce _any_ user-facing change?

Users can specify the kvconnector using `--kv-transfer-config`, allowing
them to freely choose which kvconnector to use, without any user-facing
change.

### How was this patch tested?

Test by specifying `--kv-transfer-config
'{"kv_connector":"LMCacheAscendConnector","kv_role":"kv_both"}'`

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: chloroethylene <jjysama@gmail.com>
2026-03-13 17:41:35 +08:00
shaopeng-666
592661e787 [Doc] EPD doc and load-balance proxy example (#6221)
Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
2026-03-12 16:17:17 +08:00
Frank Chen
14c71b19e1 [Doc][CPU binding] Add user/developer guide for CPU binding (#7045)
### What this PR does / why we need it?
This PR adds comprehensive documentation for the CPU binding feature on
Ascend NPUs. It includes:

- A detailed developer guide
(`docs/source/developer_guide/feature_guide/cpu_binding.md`) covering
the design, internal logic, allocation examples, and troubleshooting for
the CPU binding mechanism.
- A concise user guide
(`docs/source/user_guide/feature_guide/cpu_binding.md`) explaining the
core concepts, usage, and common issues for end-users.
- An update to `additional_config.md` to use consistent terminology for
binding strategies (`global-slicing` and `topo-affinity`).

This documentation is needed to help both developers and users
understand, use, and debug the CPU binding feature, which is critical
for performance on ARM+Ascend platforms.

### Does this PR introduce _any_ user-facing change?
No. This is a documentation-only update.

### How was this patch tested?
The documentation has been reviewed for clarity and technical accuracy.
The examples and descriptions align with the implementation in
`vllm_ascend/cpu_binding.py`.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Signed-off-by: c00818886 <chenchuwei@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
2026-03-10 15:59:31 +08:00
pz1116
a7820d20f4 [Doc][KV Pool]Update Memcache local service config example: increase default world size to 256 and update description (#7025)
### What this PR does / why we need it?
Update Memcache local service config example: increase default world
size to 256 and update the description for better clarity.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
2026-03-06 10:23:55 +08:00
LI SHENGYONG
ccd00798f3 [EPLB] Display the expert hotness comparison before and after eplb. (#6877)
### What this PR does / why we need it?
To intuitively show the effect of the eplb algorithm, we print the
expert heat before and after eplb.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

![Snipaste_2026-02-28_17-23-42](https://github.com/user-attachments/assets/db1dadd1-cf96-44da-af34-57d41ccf412f)


- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-03-06 09:53:29 +08:00
fems14
ae394767d4 【main】ADXL/HIXL supports FabricMem Mode (#6806)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: fems14 <1804143737@qq.com>
2026-03-05 21:04:11 +08:00
wangxiyuan
13777bf3f0 [Spec Decode]clean up spec decode interface (#6947)
This pull request refactors the speculative decoding proposer interface
to align with upstream vLLM, removing the local `Proposer` interface and
renaming methods to `propose`.

This is the first step. In the future we should remove the class
register and just add few Ascend specified method once the arch in vLLM
is ready.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-05 14:30:10 +08:00
Ronald
77e009d9fc [Feature] Add docs of batch invariance and make some extra operators patch (#6910)
### What this PR does / why we need it?

This PR add docs of batch invariance and make some extra operators
according to validation result.
please see https://github.com/vllm-project/vllm-ascend/issues/5487 to
track progress.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-03-05 09:12:40 +08:00
whx
16c879cdf7 [Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518)
### What this PR does / why we need it?
Add muls_add triton kernel with related fusion pass. What's more, this
PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI passed with new added test.


- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-03-02 17:54:25 +08:00
wangxiyuan
a95c0b8b82 [Doc] fix the nit in docs (#6826)
Refresh the doc, fix the nit in the docs

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-27 11:50:27 +08:00
realliujiaxu
5def28dcd3 [Feat]support sequence parallelism by pass for VL models (#5632) 2026-02-27 08:27:41 +08:00
Cao Yi
6de207de88 [main][Docs] Fix typos across documentation (#6728)
## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-02-13 15:50:05 +08:00
wangxiyuan
7d4833bce9 [Doc][Misc] Restructure tutorial documentation (#6501)
### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-10 15:03:35 +08:00
Cao Yi
1c7d1163f5 [main][Docs] Fix spelling errors across documentation (#6649)
Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-02-10 11:14:57 +08:00
DreamerLeader
905f0764e0 [DOC]Add Memcache Usage Guide (#6476)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
2026-02-09 21:55:00 +08:00
Li Wang
d018aeb5fa [Image] Bump mooncake version to v0.3.8.post1 (#6428)
### What this PR does / why we need it?
This patch bump the mooncake version to the latest
[release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1)
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test is locally
>>> from mooncake.engine import TransferEngine
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-02-06 10:54:03 +08:00
DreamerLeader
2dac18afea [Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (#6126)
### What this PR does / why we need it?
Fix of Pooling Code and Update of Pooling Usage Guide
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
pr:[[Bugfix]Fixed precision issues caused by pooled request
pooling](https://github.com/vllm-project/vllm-ascend/pull/6049)
readyhttps://github.com/vllm-project/vllm-ascend/pull/6049
read for review
- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Signed-off-by: fangjianwei <f30058701@china.huawei.com>
Signed-off-by: DreamerLeader <88812830+DreamerLeader@users.noreply.github.com>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: fangjianwei <f30058701@china.huawei.com>
2026-02-04 16:35:41 +08:00
Nengjun Ma
78fad4e348 [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442)
### What this PR does / why we need it?
Refactor MLP weight prefetch to consistency with MoE Model's prefetching
in terms of code and usage.
Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP,
VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and
VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following:

--additional-config '{"weight_prefetch_config": { "enabled": true,
"prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}'

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-04 09:08:18 +08:00
ChenCangtao
46cee945b3 [doc][npugraph_ex]add npugraph_ex introduction doc (#6306)
### What this PR does / why we need it?
As part of the preparation work for the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214)
We have added a documentation about npugraph_ex, which mainly explains
and introduces its usage and FX graph optimization.
The introduction to FX graph optimization also includes specific
explanations of the default passes, the implementation methods for
custom fusion passes, and how to capture the FX graph during the
optimization process through environment variable configuration.

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
2026-01-30 11:21:37 +08:00
Shanshan Shen
e3eefdecbd [Doc] Update max_tokens to max_completion_tokens in all docs (#6248)
### What this PR does / why we need it?

Fix:

```
DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field.
```

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-26 11:57:40 +08:00
zhaomingyu13
2dd68652bc [Doc] Add the setting description of cudagraph_capture_sizes in speculative decoding user guide (#5637)
### What this PR does / why we need it?
Add the setting description of cudagraph_capture_sizes, guide users to
avoid the common mistakes frequently made when using the EAGLE overlay
fullgraph.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need for testing
- vLLM version: v0.13.0
- vLLM main:
8be6432bda

---------

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-23 23:22:44 +08:00
Li Wang
4d780a8b01 [Misc] Revert "[Misc] Bump mooncake version to v0.3.8.post1 (#6110)" (#6164)
### What this PR does / why we need it?
The new version of moonkcake lead to the image build failure. see
https://github.com/vllm-project/vllm-ascend/actions/runs/21236469259/job/61105443733,
we should revert it first
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-23 09:53:32 +08:00
Li Wang
37a9cf818a [Misc] Bump mooncake version to v0.3.8.post1 (#6110)
### What this PR does / why we need it?
Since the mooncake has the newer
[release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1),
we pin the tag to latest release

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-22 11:03:16 +08:00
Magnus
5b129cf0a1 [1/N][Feat] Xlite Qwen3 MoE Support (#5951)
### What this PR does / why we need it?
This patch adds support for the Qwen3-MoE model in Xlite. For more
details about Xlite, please refer to the following
link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.

Qwen3-MoE TODO List:
- [ ] Qwen3-235B-A22B support
- [ ] Qwen3-MoE weights NZ support
- [ ] Qwen3-MoE data parallel support

## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance
Comparison
- aclgraph: main(69b170b8b5)
- xlite-full: main + xlite-full
- xlite-decode-only: main + xlite-decode-only
- diff1: Performance comparison between xlite-full and aclgraph
- diff2: Performance comparison between xlite-decode-only and aclgraph

| maxconcurrency | item | TTFT(ms) | | TPOT(ms) | | QPS (req/s) |
OutputSpeed (token/s) |
| --- | --- | --- | --- | --- | --- | --- | --- |
|  |  | Avg | P99 | Avg | P99 |  |  |
| 1 | baseline-aclgraph | 205.07 | 287.29 | 12.34 | 12.65 | 0.14 | 78.81
|
| 1 | xlite-full | 66.40 | 113.69 | 11.71 | 12.40 | 0.15 | 84.73 |
| 1 | xlite-decode-only | 221.15 | 316.40 | 12.16 | 12.91 | 0.14 | 79.70
|
| 1 | diff1 | -67.62% | -60.43% | -5.11% | -1.98% | 7.14% | 7.51% |
| 1 | diff2 | 7.84% | 10.13% | -1.46% | 2.06% | 0.00% | 1.13% |
|  |  |  |  |  |  |  |  |
| 16 | baseline-aclgraph | 1892.16 | 13916.86 | 22.78 | 39.28 | 1.15 |
589.89 |
| 16 | xlite-full | 1355.40 | 8907.45 | 15.96 | 25.15 | 1.65 | 850.21 |
| 16 | xlite-decode-only | 1519.42 | 8711.64 | 19.23 | 29.73 | 1.38 |
711.60 |
| 16 | diff1 | -28.37% | -36.00% | -29.94% | -35.97% | 43.48% | 44.13% |
| 16 | diff2 | -19.70% | -37.40% | -15.58% | -24.31% | 20.00% | 20.63% |
|  |  |  |  |  |  |  |  |
| 32 | baseline-aclgraph | 673.80 | 3914.90 | 32.20 | 37.95 | 1.80 |
928.54 |
| 32 | xlite-full | 481.65 | 2710.50 | 19.95 | 25.35 | 2.91 | 1506.67 |
| 32 | xlite-decode-only | 372.22 | 1095.25 | 25.19 | 28.47 | 2.33 |
1202.82 |
| 32 | diff1 | -28.52% | -30.76% | -38.04% | -33.20% | 61.67% | 62.26% |
| 32 | diff2 | -44.76% | -72.02% | -21.77% | -24.98% | 29.44% | 29.54% |
|  |  |  |  |  |  |  |  |
| 48 | baseline-aclgraph | 583.18 | 3277.65 | 41.02 | 46.05 | 2.17 |
1115.08 |
| 48 | xlite-full | 973.42 | 8237.33 | 23.29 | 30.50 | 3.71 | 1908.09 |
| 48 | xlite-decode-only | 480.79 | 2026.98 | 31.48 | 35.41 | 2.83 |
1453.75 |
| 48 | diff1 | 66.92% | 151.32% | -43.22% | -33.77% | 70.97% | 71.12% |
| 48 | diff2 | -17.56% | -38.16% | -23.26% | -23.11% | 30.41% | 30.37% |
|  |  |  |  |  |  |  |  |
| 64 | baseline-aclgraph | 742.74 | 5953.39 | 47.79 | 53.15 | 2.48 |
1272.37 |
| 64 | xlite-full | 545.22 | 3941.34 | 25.09 | 30.41 | 4.64 | 2376.44 |
| 64 | xlite-decode-only | 752.40 | 4534.29 | 38.67 | 43.28 | 3.06 |
1567.94 |
| 64 | diff1 | -26.59% | -33.80% | -47.50% | -42.78% | 87.10% | 86.77% |
| 64 | diff2 | 1.30% | -23.84% | -19.08% | -18.57% | 23.39% | 23.23% |
|  |  |  |  |  |  |  |  |
| 100 | baseline-aclgraph | 565.52 | 1716.81 | 60.89 | 68.69 | 3.08 |
1580.64 |
| 100 | xlite-full | 398.14 | 2328.88 | 30.70 | 32.45 | 6.01 | 3086.42 |
| 100 | xlite-decode-only | 712.53 | 4875.94 | 52.71 | 60.78 | 3.53 |
1813.58 |
| 100 | diff1 | -29.60% | 35.65% | -49.58% | -52.76% | 95.13% | 95.26% |
| 100 | diff2 | 26.00% | 184.01% | -13.43% | -11.52% | 14.61% | 14.74% |
|  |  |  |  |  |  |  |  |
| 150 | baseline-aclgraph | 842.42 | 5175.01 | 73.60 | 88.18 | 3.80 |
1952.26 |
| 150 | xlite-full | 568.52 | 4204.33 | 37.90 | 40.01 | 7.27 | 3734.72 |
| 150 | xlite-decode-only | 654.43 | 2504.06 | 67.40 | 77.00 | 4.18 |
2145.11 |
| 150 | diff1 | -32.51% | -18.76% | -48.51% | -54.63% | 91.32% | 91.30%
|
| 150 | diff2 | -22.32% | -51.61% | -8.42% | -12.68% | 10.00% | 9.88% |
|  |  |  |  |  |  |  |  |
| 200 | baseline-aclgraph | 750.63 | 3049.91 | 88.26 | 101.95 | 4.28 |
2189.72 |
| 200 | xlite-full | 558.48 | 3791.98 | 45.54 | 49.04 | 8.17 | 4175.52 |
| 200 | xlite-decode-only | 807.09 | 4254.95 | 85.18 | 101.79 | 4.44 |
2271.52 |
| 200 | diff1 | -25.60% | 24.33% | -48.40% | -51.90% | 90.89% | 90.69% |
| 200 | diff2 | 7.52% | 39.51% | -3.49% | -0.16% | 3.74% | 3.74% |
|  |  |  |  |  |  |  |  |

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: changdawei1 <changdawei3@huawei.com>
Co-authored-by: LVYANGGUO <275926687@qq.com>
Co-authored-by: lulina <lina.lulina@huawei.com>
2026-01-21 09:26:03 +08:00
zzhxxx
05e69b99e5 [Doc] Remove Chinese characters from the icons in the doc. (#5959)
### What this PR does / why we need it?
Remove Chinese characters from the icons in the doc.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2026-01-18 07:22:57 +08:00
LI SHENGYONG
da958ee386 [EPLB]Eplb Config Renaming (#5533)
### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817

### Does this PR introduce _any_ user-facing change?

before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`

after this pr: 
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`

### How was this patch tested?

#### test qwen3-235b eplb num_redundant_experts=16

without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-15 10:26:44 +08:00
SILONG ZENG
4811ba62e0 [Lint]Style: reformat markdown files via markdownlint (#5884)
### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2026-01-15 09:06:01 +08:00
lty
295018ec0f [Refactor]Refactor of vllm_ascend/distributed module (#5719)
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604

This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: lty <linhebiwen@gmail.com>
2026-01-15 08:57:40 +08:00
LHXuuu
0415e694cd [Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718)
### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
2026-01-14 09:17:26 +08:00
liziyu
451bbdc292 [Doc] add tls check to pd disaggregation readme (#5638)
### What this PR does / why we need it?

update pd disaggregation multi_node readme, update the environment check
command for A3, add tls check
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
8be6432bda

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-12 15:49:18 +08:00
wangxiyuan
354ee3b330 [Doc] Update doc url link (#5781)
Drop `dev` suffix for doc url.
Rename url to `https://docs.vllm.ai/projects/ascend`

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-12 11:21:31 +08:00
zzhxxx
36d74aba58 [Doc][fix] Fix the title of the document for the layer_sharding feature (#5759)
### What this PR does / why we need it?
Fix the title of the document for the layer_sharding feature

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2026-01-09 14:15:22 +08:00
zhenwenqi2024
97f6be8108 [feature]dcp&pcp support mlapo (#5672)
### What this PR does / why we need it?
mlapo in deepseek is a huge performance improvement in decode, this pr
support pcp & dcp with mlapo

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2026-01-08 23:49:23 +08:00
zzhxxx
f7db812ed7 [refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181)
### What this PR does / why we need it?
- Delete the environment variable
`VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED`
- Introduce layer_sharding as a configurable feature in
additional_config
- Revise the term "shared weight" to "shard weight."
Configuration : The feature is opt-in via the additional_config
argument:
```
--additional-config '{
  "layer_sharding": ["o_proj", "q_b_proj"]
}'
```

This is orthogonal to standard tensor parallelism and weight replication
strategies. It is treated as a separate, explicit feature.It can be used
in any scenario, combined with the
flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature
or the ShardedCP #4702 feature, to achieve significant performance.



- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
2026-01-08 09:05:02 +08:00
lilinsiman
52863c4165 [Refactor][EAGLE] 2/N: load model and generate token (#5437)
### What this PR does / why we need it?
1. Refactor eagle and mtp function: load_model and generate_token_ids
2. Remove redundant code in mtp and eagle file
3. Refactor the UT of file

2/N of Refactor and merge mtp and eagle
Relational RFC: https://github.com/vllm-project/vllm-ascend/issues/5467

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut and tests

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-01-05 14:07:54 +08:00
InSec
7cf65d0581 [Doc]modify the quantization user guide and add a quantization adaptation developer guide (#5554)
### What this PR does / why we need it?
This PR makes the following modifications:
1.delete the `user_guide/feature_guide/quantization-llm-compressor.md`
and merge it into `user_guide/feature_guide/quantization.md`.
2.update the content of `user_guide/feature_guide/quantization.md`.
3.add guidance `developer_guide/feature_guide/quantization.md' on the
adaptation of quantization algorithms and quantized models.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

---------

Signed-off-by: IncSec <1790766300@qq.com>
Signed-off-by: InSec <1790766300@qq.com>
2026-01-05 09:12:11 +08:00
baxingpiaochong
46c2fc6a3c [KVPOOL]decode save kvcache (#5168)
### What this PR does / why we need it?

kvpool decode save kvcache
now only support mla

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: baxingpiaochong <771405853@qq.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
2026-01-04 22:22:01 +08:00
fems14
2ef4d1979e [bugfix][main]KV Pool for KV Transfer in PD Disaggregation Scenarios (#5398)
### What this PR does / why we need it?
1.KV Pool for KV Transfer in PD Disaggregation Scenarios Error
Resolution
2.Update KV Pool Documentation

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-12-27 09:53:57 +08:00
zhangsicheng5
8ed87dfa84 [doc] Add context parallel user guide (#5358)
1. Add context parallel user guide
2. Add context parallel related message in supported features/models
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-12-26 17:03:47 +08:00
lvjunqi
55beac9c91 [Feat]Xlite Qwen3-vl Support (#5228)
### What this PR does / why we need it?
This patch adds support for the Qwen3-VL model in Xlite. For more
details about Xlite, please refer to the following
link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:

### Does this PR introduce _any_ user-facing change?
XLite graph mode supports the Qwen3-VL model.

### How was this patch tested?
vLLM version: v0.12.0 

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: lvjunqi <lvjunqi1@huawei.com>
Co-authored-by: lvjunqi <lvjunqi1@huawei.com>
2025-12-22 16:30:52 +08:00
zzhxxx
17f2eead99 [Doc]Add the user_guide doc file regarding fine-grained TP. (#5084)
### What this PR does / why we need it?
Add user guide for **Fine-Grained Tensor Parallelism** feature.  
Documents usage, supported components (`embedding`, `lm_head`, `o_proj`,
`mlp`/`dense_ffn`), model compatibility, and deployment guidelines.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-19 16:37:25 +08:00
zxr2333
073a3a6e6c [Doc][P/D] Fix MooncakeConnector's name (#5172)
### What this PR does / why we need it?
vLLM community has integrated their MooncakeConnector. The original
scripts will now find this MooncakeConnector instead of the one from
vLLM-Ascend. All scripts that involve using the MooncakeConnector need
to be modified to another name.

### Does this PR introduce _any_ user-facing change?
Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector.

### How was this patch tested?
By CI.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2025-12-18 22:29:19 +08:00
liziyu
190ae55e9f Add a Mooncake installation tutorial for kv pool and update Mooncake installation tutorial (#5069)
### What this PR does / why we need it?
Add a Mooncake installation tutorial for kv pool and update Mooncake
installation tutorial

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-16 19:53:23 +08:00
zhaomingyu13
039cc65e58 [Doc] Add user guide of speculative decoding (#5074)
### What this PR does / why we need it?
Add user guide of speculative decoding that includes n-grams, EAGLE,
MTP, and suffix.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2025-12-16 17:01:44 +08:00