Commit Graph

606 Commits

Author SHA1 Message Date
bazingazhou233-hub
9e6c547d98 [Doc] Replace deprecated full_cuda_graph with cudagraph_mode in Qwen2.5-Omni (#7286)
## Summary
- Replace `full_cuda_graph: 1` with `cudagraph_mode: FULL_DECODE_ONLY`
in both single-NPU and multi-NPU examples
- `full_cuda_graph` is deprecated and falls back to `NONE` on NPU

Fixes #4696
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>
Co-authored-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>
2026-03-14 22:38:36 +08:00
NJX
bb506a1c99 [Doc][Installation] Clarify SOC_VERSION for CPU-only source builds (#7278)
### What this PR does / why we need it?
- Clarify that `SOC_VERSION` must be set when building from source in a
CPU-only environment where `npu-smi` is unavailable.
- Add concrete `SOC_VERSION` examples (A2/A3/300I/A5) and point users to
`Dockerfile*` defaults.
- Improve the `setup.py` error message so users get actionable guidance
when `SOC_VERSION` is missing.

Fixes #6816.

### Does this PR introduce _any_ user-facing change?
- Yes. Documentation is updated and the build-time error message is more
informative.

### How was this patch tested?
- (Local) Syntax check: `python -m compileall setup.py`.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: NJX-njx <3771829673@qq.com>
2026-03-14 22:38:25 +08:00
Junyuan
6852a2e267 [feat] add LMCacheAscendConnector (#6882)
### What this PR does / why we need it?

LMCache-Ascend is LMCache's solution on the Ascend platform and one of
the KVCache pooling solutions for Ascend. We hope to integrate
LMCache-Ascend into the vLLM-Ascend community as one of the official
KVCache pooling solutions for vLLM-Ascend.

We added a new LMCacheAscendConnector in vLLM-Ascend and registered it.

### Does this PR introduce _any_ user-facing change?

Users can specify the kvconnector using `--kv-transfer-config`, allowing
them to freely choose which kvconnector to use, without any user-facing
change.

### How was this patch tested?

Test by specifying `--kv-transfer-config
'{"kv_connector":"LMCacheAscendConnector","kv_role":"kv_both"}'`

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: chloroethylene <jjysama@gmail.com>
2026-03-13 17:41:35 +08:00
Mengqing Cao
986cd45397 [Version] Drop 0.16.0 support (#7153)
### What this PR does / why we need it?
Drop 0.16.0 support in main
- Fix eagle proposer break introduced by
https://github.com/vllm-project/vllm/pull/34552. Mainly change to use
the draft attention group to initialize the attention metadata builder.
- Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes`
error, which is a bug in vLLM v0.17.0, and fixed by a later pr
https://github.com/vllm-project/vllm/pull/30515

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-13 16:14:15 +08:00
shaopeng-666
592661e787 [Doc] EPD doc and load-balance proxy example (#6221)
Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
2026-03-12 16:17:17 +08:00
herizhen
e5024d0264 [doc] Add Ascend PyTorch Profiler section (#7117)
### What this PR does / why we need it?
add Ascend PyTorch Profiler section

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
Documentation Format Checks
Technical Content Validation
Build Verification
Version Compatibility
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: herizhen <1270637059@qq.com>
2026-03-12 15:51:00 +08:00
MengLong Chen
bbffe58b63 [Doc] fix DSV3.1 PD configs (#7187)
### What this PR does / why we need it?
Modify the `kv_port` and `engine_id` config of DeepSeek-V3.1/R1 in the
2P1D scenario

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2026-03-12 14:24:49 +08:00
Canlin Guo
a78a00e0b1 [Doc][ReleaseNote] Add release notes for v0.16.0rc1 (#7067)
Add release notes for v0.16.0rc1

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-03-10 22:45:05 +08:00
Frank Chen
14c71b19e1 [Doc][CPU binding] Add user/developer guide for CPU binding (#7045)
### What this PR does / why we need it?
This PR adds comprehensive documentation for the CPU binding feature on
Ascend NPUs. It includes:

- A detailed developer guide
(`docs/source/developer_guide/feature_guide/cpu_binding.md`) covering
the design, internal logic, allocation examples, and troubleshooting for
the CPU binding mechanism.
- A concise user guide
(`docs/source/user_guide/feature_guide/cpu_binding.md`) explaining the
core concepts, usage, and common issues for end-users.
- An update to `additional_config.md` to use consistent terminology for
binding strategies (`global-slicing` and `topo-affinity`).

This documentation is needed to help both developers and users
understand, use, and debug the CPU binding feature, which is critical
for performance on ARM+Ascend platforms.

### Does this PR introduce _any_ user-facing change?
No. This is a documentation-only update.

### How was this patch tested?
The documentation has been reviewed for clarity and technical accuracy.
The examples and descriptions align with the implementation in
`vllm_ascend/cpu_binding.py`.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Signed-off-by: c00818886 <chenchuwei@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
2026-03-10 15:59:31 +08:00
NJX
bb7ed759d4 [Doc] Fix broken chunked-prefill URL in supported features (#6963)
## What this PR does / why we need it?

Fixes the broken URL for chunked-prefill in the supported features
documentation page.

The chunked prefill documentation URL was moved from
`performance/optimization.html` to `configuration/optimization.html` in
upstream vLLM docs. This PR updates the link to point to the correct
location.

**Before**:
https://docs.vllm.ai/en/stable/performance/optimization.html#chunked-prefill
(404)
**After**:
https://docs.vllm.ai/en/stable/configuration/optimization.html#chunked-prefill
(working)

## Does this PR introduce _any_ user-facing change?

Yes - fixes a broken documentation link that users encounter when
clicking 'Chunked Prefill' in the supported features page.

## How was this patch tested?

- Verified the new URL resolves correctly
- Documentation change only

Closes #4217
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: NJX-njx <3771829673@qq.com>
2026-03-10 10:10:07 +08:00
Yikun Jiang
326fd359aa [Docs] add and publish llms.txt for LLM discovery (#6886)
### What this PR does / why we need it?
- move llms.txt under docs/source and publish it at /llms.txt via
html_extra_path
- rewrite llms.txt to an LLM-friendly link index
- use _sources markdown links and include missing entry points such as
FAQs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2026-03-10 10:06:27 +08:00
ZKSU
bdad11e9a8 [doc] Update GLM4.x.md, add GLM4.x multi-node deploy tutorial (#6872)
### What this PR does / why we need it?

This PR updates the GLM4.x documentation by adding multi-node like 2 ×
Atlas 800 A2 (64G × 8) deployment tutorial.

- **What changed**: Added instructions for deploying GLM-4.X models
across multiple nodes, including environment variables and example
commands.
- **Why needed**: Although the previous tutorial stated that multi-node
deployment on Atlas 800 A2 (64GB × 8) is **not recommended**, but we
still face some situation that must deploy GLM-4.7 on 2 × Atlas 800 A2
(64G × 8). And we successfully run GLM-4.7 on 2 nodes and it works fine,
so we think it might be the time to update this part.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- Verified that the new documentation renders correctly in Markdown
format.
- Tested the multi-node deployment steps on 2 × Atlas 800 A2 (64G × 8)
to ensure the commands work as described.
- Confirmed that existing GLM4.x documentation links and structure
remain intact.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: ZKSU <zksu@outlook.com>
2026-03-10 10:01:53 +08:00
Shaoxu Cheng
ba1c82e758 [DOC] Add explaination of 310p special param: max-model-len (#7065)
### What this PR does / why we need it?

This PR updates the documentation for running vLLM on Atlas 300I series
(310p) hardware. It adds a warning to explicitly set `--max-model-len`
to prevent potential Out-of-Memory (OOM) errors that can occur with the
default configuration.

The example commands and Python scripts for online and offline inference
have been updated to:
- Include `--max-model-len 4096` (or `max_model_len=4096`).
- Remove the `compilation-config` parameter, which is no longer
necessary for 310p devices.

These changes ensure users have a clearer and more stable experience
when using vLLM on Atlas 300I hardware.

### Does this PR introduce _any_ user-facing change?
No, this is a documentation-only update.

### How was this patch tested?
The changes are to documentation and do not require testing.


- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-09 16:54:43 +08:00
wangxiyuan
482d39c1b0 [commuinty]update contributor and refresh tool (#7072)
### What this PR does / why we need it?
This PR refactors the `tools/collect_user_first_contribution.sh` script
to improve how we track and update our contributors list.

Key changes include:
- **Incremental Updates**: The script can now perform incremental
updates by storing and reading the last processed commit hash from
`docs/source/community/contributors.md`. This is much more efficient
than re-processing all commits every time.
- **Full Refresh Option**: A `--full` flag is added to allow forcing a
full recalculation of all contributors, useful for correcting errors or
initial setup.
- **Improved Usage**: Replaced positional arguments with command-line
flags (`--repo`, `--file`, `--full`) for better usability and clarity.
- **Robust Contributor-ID detection**: Improved logic to find a
contributor's GitHub login, including a fallback to parse it from
`noreply` email addresses.
- **In-place File Updates**: The script now directly updates the
`contributors.md` file with new contributors and correct numbering,
automating the entire process.

These changes make the process of maintaining the contributors list more
automated, reliable, and efficient.

### Does this PR introduce _any_ user-facing change?
No, this only changes a developer tool and does not affect the vLLM
library's public API or behavior.

### How was this patch tested?
The script can be tested locally by running it against the repository.
For an incremental update:
`GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh`

For a full refresh:
`GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh
--full`

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-09 15:19:35 +08:00
pz1116
a7820d20f4 [Doc][KV Pool]Update Memcache local service config example: increase default world size to 256 and update description (#7025)
### What this PR does / why we need it?
Update Memcache local service config example: increase default world
size to 256 and update the description for better clarity.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
2026-03-06 10:23:55 +08:00
LI SHENGYONG
ccd00798f3 [EPLB] Display the expert hotness comparison before and after eplb. (#6877)
### What this PR does / why we need it?
To intuitively show the effect of the eplb algorithm, we print the
expert heat before and after eplb.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

![Snipaste_2026-02-28_17-23-42](https://github.com/user-attachments/assets/db1dadd1-cf96-44da-af34-57d41ccf412f)


- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-03-06 09:53:29 +08:00
SILONG ZENG
bd571cf6d6 [Main2Main] Upgrade vLLM to 0303 (#6944)
### What this PR does / why we need it?
break:
- https://github.com/vllm-project/vllm/pull/34102 
Disable_full param replaced with valid_modes/invalid_modes API
- https://github.com/vllm-project/vllm/pull/35503
Now must return float compilation_time
- https://github.com/vllm-project/vllm/pull/35564
New sequence_lengths param added
- https://github.com/vllm-project/vllm/pull/33807
A check was performed (if runner_backend != "auto")
- https://github.com/vllm-project/vllm/pull/34861
`BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to
check process group state
- https://github.com/vllm-project/vllm/pull/35274

**Important change:**
- https://github.com/vllm-project/vllm/pull/28672

`matcher_utils` directly accesses `torch.ops._C.*` during the import
phase. In the Ascend environment, some unregistered ops trigger
`AttributeError`, causing e2e initialization failure.

https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323

https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29

This PR adds temporary compatibility placeholders (rms_norm,
fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant,
silu_and_mul) to
`vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to
ensure no crashes during the import phase. Upstream repairs will be
considered later.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
2026-03-06 09:08:52 +08:00
fems14
ae394767d4 【main】ADXL/HIXL supports FabricMem Mode (#6806)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: fems14 <1804143737@qq.com>
2026-03-05 21:04:11 +08:00
wangxiyuan
13777bf3f0 [Spec Decode]clean up spec decode interface (#6947)
This pull request refactors the speculative decoding proposer interface
to align with upstream vLLM, removing the local `Proposer` interface and
renaming methods to `propose`.

This is the first step. In the future we should remove the class
register and just add few Ascend specified method once the arch in vLLM
is ready.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-05 14:30:10 +08:00
Ronald
77e009d9fc [Feature] Add docs of batch invariance and make some extra operators patch (#6910)
### What this PR does / why we need it?

This PR add docs of batch invariance and make some extra operators
according to validation result.
please see https://github.com/vllm-project/vllm-ascend/issues/5487 to
track progress.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-03-05 09:12:40 +08:00
NJX
c7fd7a25f7 [Doc][Misc] Fix msprobe_guide.md documentation issues (#6965)
## What this PR does / why we need it?

Fixes several documentation issues in the msprobe debugging guide as
reported in #6065:

1. **Remove unnecessary `cat` heredoc wrapper**: The example
configuration section used a `cat <<'JSON'` bash wrapper around the JSON
config. Simplified to a plain JSON code block.
2. **Fix duplicate chapter numbering**: Two sections were both numbered
'2'. Renumbered sections sequentially (0-6).
3. **Fix msprobe command**: Changed `msprobe graph_visualize` to
`msprobe -f pytorch graph` in section 5.2 Visualization.
4. **Remove backward-related content**: Since vllm is inference-only (no
training), removed all backward pass references including backward
tensor examples, parameter gradient examples, and backward descriptions
from dump.json explanations.

## Does this PR introduce _any_ user-facing change?

Documentation improvement only. No code changes.

## How was this patch tested?

Manual review of the markdown file to verify all 4 issues from #6065 are
addressed.

Closes #6065
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: NJX-njx <3771829673@qq.com>
2026-03-04 10:28:31 +08:00
zzzzwwjj
f19f7b1fe2 [doc] fix supported_models (#6930)
### What this PR does / why we need it?

Add Experimental supported model/feature for supported_models.md

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2026-03-03 09:47:50 +08:00
Xiaoshuang Wang
f7a8befc20 [CI] Upgrade CANN to 8.5.1 (#6897)
### What this PR does / why we need it?
[CI] Upgrade CANN to 8.5.1

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-03-03 09:02:42 +08:00
whx
16c879cdf7 [Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518)
### What this PR does / why we need it?
Add muls_add triton kernel with related fusion pass. What's more, this
PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI passed with new added test.


- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-03-02 17:54:25 +08:00
zyz111222
81fb7d5779 [Doc] add 310P3 guidance of PaddleOCR-VL (#6837)
### What this PR does / why we need it?
add 310P3 guidance of PaddleOCR-VL model, refresh PaddleOCR-VL.md in the
docs/source/tutorials/

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
by CI

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: zouyizhou <zouyizhou@huawei.com>
2026-02-28 16:03:07 +08:00
wangxiyuan
3d563292f3 clean 0.15.0 support (#6852)
Clean up vllm 0.15.0 related code

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-28 09:20:57 +08:00
wangxiyuan
9cd0d6c33d [Doc][Misc] Update release notes for v0.15.0rc1 (#6859)
### What this PR does / why we need it?

This PR updates the release notes for `v0.15.0rc1` to:
- Mark the `310P MoE and W8A8 Support` feature as experimental.
- Add a note for `Kimi-K2.5 Model Support` clarifying that it has known
issues in vLLM 0.15.0 and requires manual patching to work correctly.

### Does this PR introduce _any_ user-facing change?

No, this is a documentation-only update.

### How was this patch tested?

N/A (documentation change).

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-27 22:35:09 +08:00
Canlin Guo
e4458b2d2b [Main2Main] Upgrade vLLM to 0226 (#6813)
### What this PR does / why we need it?

Breaking:
1. https://github.com/vllm-project/vllm/pull/33452
2. https://github.com/vllm-project/vllm/pull/33451
3. https://github.com/vllm-project/vllm/pull/32567
4. https://github.com/vllm-project/vllm/pull/32344

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: MrZ20 <2609716663@qq.com>
2026-02-27 16:05:21 +08:00
starmountain1997
80316c5824 [DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
wangxiyuan
3d43ed997e add release note for 0.15.0rc1 (#6839)
Add release note for 0.15.0rc1

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-27 11:55:55 +08:00
wangxiyuan
a95c0b8b82 [Doc] fix the nit in docs (#6826)
Refresh the doc, fix the nit in the docs

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-27 11:50:27 +08:00
realliujiaxu
5def28dcd3 [Feat]support sequence parallelism by pass for VL models (#5632) 2026-02-27 08:27:41 +08:00
starmountain1997
2260af405f [DOC] add request forwarding (#6780)
### What this PR does / why we need it?

- New section: "Request Forwarding" documentation in
docs/source/tutorials/models/DeepSeek-V3.2.md
- Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in
the DeepSeek-V3 configuration examples

### Does this PR introduce _any_ user-facing change?

Documentation update only - provides new configuration guidance for
request forwarding setups

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-25 14:43:51 +08:00
Frank Chen
3da2ba22eb [Platform] Enable ARM-only CPU binding with NUMA-balanced A3 policy and update docs/tests (#6686)
### What this PR does / why we need it?

- Keeps enable_cpu_binding default on, but skips binding on non‑ARM CPUs
inside bind_cpus, with a clear log.
- Uses a table-driven binding policy: A3 uses NUMA‑balanced binding;
other device types use NUMA‑affinity binding.
- Updates docs to reflect the exact behavior and adds/updates unit tests
for the new logic.

### Does this PR introduce _any_ user-facing change?

- Yes. CPU binding is now enabled by default via additional_config, and
documented in the user guide.
- CPU binding behavior differs by device type (A3 vs. others).

### How was this patch tested?

Added/updated unit tests:

test_cpu_binding.py
1.   test_binding_mode_table covers A2 vs A3 binding mode mapping.
2. test_build_cpu_pools_fallback_to_numa_balanced covers fallback when
affinity info is missing.
3. TestBindingSwitch.test_is_arm_cpu covers ARM/x86/unknown arch
detection.
4.   test_bind_cpus_skip_non_arm covers non‑ARM skip path in bind_cpus.

test_worker_v1.py
1. Updated mocks for enable_cpu_binding default True to align with new
config default.

- vLLM version: v0.14.1
- vLLM main: d7de043

---------

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
2026-02-25 11:15:14 +08:00
Icey
ee59429015 upgrade main to 0212 (#6712)
### What this PR does / why we need it?
Fixes `transformers_utils/processors/__init__` import error, due to
https://github.com/vllm-project/vllm/pull/33247
Fixes Fused MoE break introduced by `MoERunner abstraction,` due to
https://github.com/vllm-project/vllm/pull/32344

> delete AscendMoERunnere when
https://github.com/vllm-project/vllm/pull/35178 is merged

Fixes `Make Qwen3VL compatible with Transformers v5`, due to
https://github.com/vllm-project/vllm/pull/34262

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-02-25 09:17:29 +08:00
zzzzwwjj
5c8ab7af39 [main]update release note & support matrix (#6759)
### What this PR does / why we need it?

Update release note & support matrix to add experimental tag for
features and models.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

0.13.0 branch: https://github.com/vllm-project/vllm-ascend/pull/6751

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2026-02-24 17:39:35 +08:00
yydyzr
70e26551cf [Doc] modify glm doc (#6770)
### What this PR does / why we need it?
1. add description of another version of glm5-w4a8 weight 
2. update the introduction of  installation 
3. introduce a script to enable bf16 MTP
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
N/A
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: yydyzr <liuyuncong1@huawei.com>
2026-02-14 16:47:23 +08:00
SILONG ZENG
e2237819a9 [CI]Fixed the spell check function in typos.toml (#6753)
### What this PR does / why we need it?
The incorrect regular expression syntax `.*[UE4M3|ue4m3].*` actually
ignores all words containing any of the following characters: `u, e, 4,
m, 3, |`

```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
    ".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*",
    ".*ot.*", ".*[Tt]h[rR].*"]
```
===fix===>
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
    ".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*",
    ".*ot.*", ".*[Tt]h[rR].*"]
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-02-14 11:57:26 +08:00
Cao Yi
6de207de88 [main][Docs] Fix typos across documentation (#6728)
## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-02-13 15:50:05 +08:00
taoyao1221
41d056f947 [doc] add A2 series doc for GLM5.md (#6717)
### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
9562912cea

- vLLM version: v0.15.0
- vLLM main:
9562912cea
2026-02-12 16:08:17 +08:00
Canlin Guo
052cc4e61b [Docs] Fix GLM-5 deploy command (#6711)
This pull request refines the GLM-5 deployment documentation by updating
the Docker run command to include a more comprehensive set of device
mappings and by removing an extraneous quantization flag from the `vllm
serve` commands. These changes aim to correct and clarify the deployment
instructions, ensuring users can successfully set up and run the GLM-5
model as intended.


- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: Canlin Guo <961750412@qq.com>
2026-02-12 08:55:48 +08:00
iiiklw
a0315f6697 [npugraph_ex]enable npugraph_ex by default (#6664)
### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214).


### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
2026-02-12 08:44:06 +08:00
rika
b86ea66b0a [doc]add GLM5.md (#6709)
### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: nakairika <982275964@qq.com>
2026-02-12 04:00:40 +08:00
Icey
88773bb101 [main to main] upgrade main 0210 (#6673)
### What this PR does / why we need it?
upgrade vllm commit to `9562912cead1f11e8540fb91306c5cbda66f0007`

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
all tests passed

- vLLM version: v0.15.0
- vLLM main:
13397841ab

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-02-11 18:10:14 +08:00
wangxiyuan
7d4833bce9 [Doc][Misc] Restructure tutorial documentation (#6501)
### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-10 15:03:35 +08:00
wangxiyuan
2a826b5fad [Misc] upgrade to vllm main (#6646)
### What this PR does / why we need it?
This PR upgrades the core vLLM dependency to a newer version from the
main branch (`13397841ab469cecf1ed425c3f52a9ffc38139b5`). This is
necessary to keep our project up-to-date with the latest features and
fixes from upstream vLLM.

1.
ac32e66cf9
pass file is moved.

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: wxsIcey <1790571317@qq.com>
2026-02-10 14:08:59 +08:00
Cao Yi
1c7d1163f5 [main][Docs] Fix spelling errors across documentation (#6649)
Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-02-10 11:14:57 +08:00
DreamerLeader
905f0764e0 [DOC]Add Memcache Usage Guide (#6476)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
2026-02-09 21:55:00 +08:00
Li Wang
d018aeb5fa [Image] Bump mooncake version to v0.3.8.post1 (#6428)
### What this PR does / why we need it?
This patch bump the mooncake version to the latest
[release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1)
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test is locally
>>> from mooncake.engine import TransferEngine
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-02-06 10:54:03 +08:00
wangxiyuan
c38166eefa [Doc] backport 0.13.0 release note (#6584)
### What this PR does / why we need it?
Backport 0.13.0 release note to main branch and update related doc link

### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
by doc CI

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-06 10:29:15 +08:00