### What this PR does / why we need it?
1. upgrade to 0.18.0
2. ensure kernel_block_sizes is int for Eagle drafter
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples
- adjust some command snippets and notes for better copy-paste usability
- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`(**Offline** inference currently **does not support** the
"max_completion_tokens")
``` bash
Traceback (most recent call last):
File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048,
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) +
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```
- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
model=model_name,
task="score", # need fix
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
- modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```
These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4497431df6
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
Upgrade vllm commit to 0318.
Main content: Added a pre-operation for cleaning up and waiting(default
max 50s) for the completion of the clean up of the NPU memory to some
test cases that failed due to the failure to release the NPU memory in a
timely manner when the previous test cases were executed.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
Fix issues in the GLM4.7 documentation and add some missing
explanations.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
document test
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
---------
Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
### What this PR does / why we need it?
Revise the KV Pool user guide:
4. Revise parameters for Memcache for better clarity, at notification
that currently heterogeneous protocol setting is not supported (e.g.
enable `device_rdma` and `device_sdma` at the same time, a example
scenario would be data transfer by memcache across different super pods)
5. Modify the condition for Mooncakestore warmup, warmup is now needed
only when `ASCEND_BUFFER_POOL` is enabled.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
---------
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
### What this PR does / why we need it?
remove deprecated environment variables related to MLP prefetching
### Does this PR introduce _any_ user-facing change?
yes, the deprecated env vars can not be used then.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Revise the KV Pool user guide:
1. Revise Mooncake environment variables and kvconnector extra configs.
2. Delete `use_ascend_direct` in kv connector extra config as it is
deprecated
3. Delete `kv_buffer_device` and `kv_rank` in P2P mooncake config
4. Unifies default `max-model-len` and `max-num-batch-tokens` in
examples given.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
### What this PR does / why we need it?
This PR adds a new CI log summarizer, `ci_log_summary.py`, and wires it
into unit-test and e2e workflows so failed jobs publish a structured
failure summary to the GitHub step summary.
Examples:
- `python3 .github/workflows/scripts/ci_log_summary.py --log-file
/tmp/unit-test.log --mode ut --step-name "Unit test"`
- `python3 .github/workflows/scripts/ci_log_summary.py --run-id
23127187822 --format json`
A maintenance note is added to `ci_utils.py` to clarify that the `START`
/ `PASSED` / `FAILED (exit code X)` log lines are parsed by
`ci_log_summary.py`, so any future format changes must be coordinated
with the corresponding summarizer regexes.
🤖 Generated with [Codex]<noreply@openai.com>
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: meihanc <jcccx.cmh@gmail.com>
Co-authored-by: Codex <noreply@openai.com>
### What this PR does / why we need it?
1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122)
2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027)
3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](https://github.com/vllm-project/vllm/pull/37031)
4.fix [Support multiple KV groups in OffloadingSpec
](https://github.com/vllm-project/vllm/pull/36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.
5.fix [Consolidate
SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it?
1. Add nightly test on MiniMax-M2.5 with deployment method on A3
2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
### What this PR does / why we need it?
Documented an issue in the 2-node PD mixed deployment scenario where
inference may hang when concurrency exceeds 8.(GLM5)
Noted that the issue has been fixed in PR:
- #7235
- #7290.
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.
- vLLM version: v0.17.0
- vLLM main:
4497431df6
Signed-off-by: wyh145 <1987244901@qq.com>
### What this PR does / why we need it?
Upload doc for qwen3.5-27B and qwen3.5-397B-A17B on Ascend
Base on vllm-ascend:v0.17.0rc1
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pppeng <zepengliu912@qq.com>
Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>
### What this PR does / why we need it?
This pull request adds the release notes for `v0.17.0rc1`. It also
updates version numbers across various documentation files, including
`README.md`, `README.zh.md`,
`docs/source/community/versioning_policy.md`, and `docs/source/conf.py`
to reflect the new release.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
## Summary
- Add `USE_MODELSCOPE_HUB=0` to both Online and Offline lm-eval sections
- Add explanatory notes about Docker containers launching with
`VLLM_USE_MODELSCOPE=True`
The Docker containers set `VLLM_USE_MODELSCOPE=True`, which causes
lm-eval to download datasets from ModelScope instead of HuggingFace,
resulting in "Repo not exists" errors. Setting `USE_MODELSCOPE_HUB=0`
disables this behavior.
Fixes#607
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>
Co-authored-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>
## Summary
- Replace `full_cuda_graph: 1` with `cudagraph_mode: FULL_DECODE_ONLY`
in both single-NPU and multi-NPU examples
- `full_cuda_graph` is deprecated and falls back to `NONE` on NPU
Fixes#4696
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>
Co-authored-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>
### What this PR does / why we need it?
- Clarify that `SOC_VERSION` must be set when building from source in a
CPU-only environment where `npu-smi` is unavailable.
- Add concrete `SOC_VERSION` examples (A2/A3/300I/A5) and point users to
`Dockerfile*` defaults.
- Improve the `setup.py` error message so users get actionable guidance
when `SOC_VERSION` is missing.
Fixes#6816.
### Does this PR introduce _any_ user-facing change?
- Yes. Documentation is updated and the build-time error message is more
informative.
### How was this patch tested?
- (Local) Syntax check: `python -m compileall setup.py`.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: NJX-njx <3771829673@qq.com>
### What this PR does / why we need it?
LMCache-Ascend is LMCache's solution on the Ascend platform and one of
the KVCache pooling solutions for Ascend. We hope to integrate
LMCache-Ascend into the vLLM-Ascend community as one of the official
KVCache pooling solutions for vLLM-Ascend.
We added a new LMCacheAscendConnector in vLLM-Ascend and registered it.
### Does this PR introduce _any_ user-facing change?
Users can specify the kvconnector using `--kv-transfer-config`, allowing
them to freely choose which kvconnector to use, without any user-facing
change.
### How was this patch tested?
Test by specifying `--kv-transfer-config
'{"kv_connector":"LMCacheAscendConnector","kv_role":"kv_both"}'`
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: chloroethylene <jjysama@gmail.com>
### What this PR does / why we need it?
Drop 0.16.0 support in main
- Fix eagle proposer break introduced by
https://github.com/vllm-project/vllm/pull/34552. Mainly change to use
the draft attention group to initialize the attention metadata builder.
- Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes`
error, which is a bug in vLLM v0.17.0, and fixed by a later pr
https://github.com/vllm-project/vllm/pull/30515
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
add Ascend PyTorch Profiler section
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Documentation Format Checks
Technical Content Validation
Build Verification
Version Compatibility
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: herizhen <1270637059@qq.com>
### What this PR does / why we need it?
Modify the `kv_port` and `engine_id` config of DeepSeek-V3.1/R1 in the
2P1D scenario
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
### What this PR does / why we need it?
This PR adds comprehensive documentation for the CPU binding feature on
Ascend NPUs. It includes:
- A detailed developer guide
(`docs/source/developer_guide/feature_guide/cpu_binding.md`) covering
the design, internal logic, allocation examples, and troubleshooting for
the CPU binding mechanism.
- A concise user guide
(`docs/source/user_guide/feature_guide/cpu_binding.md`) explaining the
core concepts, usage, and common issues for end-users.
- An update to `additional_config.md` to use consistent terminology for
binding strategies (`global-slicing` and `topo-affinity`).
This documentation is needed to help both developers and users
understand, use, and debug the CPU binding feature, which is critical
for performance on ARM+Ascend platforms.
### Does this PR introduce _any_ user-facing change?
No. This is a documentation-only update.
### How was this patch tested?
The documentation has been reviewed for clarity and technical accuracy.
The examples and descriptions align with the implementation in
`vllm_ascend/cpu_binding.py`.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Signed-off-by: c00818886 <chenchuwei@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
## What this PR does / why we need it?
Fixes the broken URL for chunked-prefill in the supported features
documentation page.
The chunked prefill documentation URL was moved from
`performance/optimization.html` to `configuration/optimization.html` in
upstream vLLM docs. This PR updates the link to point to the correct
location.
**Before**:
https://docs.vllm.ai/en/stable/performance/optimization.html#chunked-prefill
(404)
**After**:
https://docs.vllm.ai/en/stable/configuration/optimization.html#chunked-prefill
(working)
## Does this PR introduce _any_ user-facing change?
Yes - fixes a broken documentation link that users encounter when
clicking 'Chunked Prefill' in the supported features page.
## How was this patch tested?
- Verified the new URL resolves correctly
- Documentation change only
Closes#4217
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: NJX-njx <3771829673@qq.com>
### What this PR does / why we need it?
- move llms.txt under docs/source and publish it at /llms.txt via
html_extra_path
- rewrite llms.txt to an LLM-friendly link index
- use _sources markdown links and include missing entry points such as
FAQs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This PR updates the GLM4.x documentation by adding multi-node like 2 ×
Atlas 800 A2 (64G × 8) deployment tutorial.
- **What changed**: Added instructions for deploying GLM-4.X models
across multiple nodes, including environment variables and example
commands.
- **Why needed**: Although the previous tutorial stated that multi-node
deployment on Atlas 800 A2 (64GB × 8) is **not recommended**, but we
still face some situation that must deploy GLM-4.7 on 2 × Atlas 800 A2
(64G × 8). And we successfully run GLM-4.7 on 2 nodes and it works fine,
so we think it might be the time to update this part.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Verified that the new documentation renders correctly in Markdown
format.
- Tested the multi-node deployment steps on 2 × Atlas 800 A2 (64G × 8)
to ensure the commands work as described.
- Confirmed that existing GLM4.x documentation links and structure
remain intact.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: ZKSU <zksu@outlook.com>
### What this PR does / why we need it?
This PR updates the documentation for running vLLM on Atlas 300I series
(310p) hardware. It adds a warning to explicitly set `--max-model-len`
to prevent potential Out-of-Memory (OOM) errors that can occur with the
default configuration.
The example commands and Python scripts for online and offline inference
have been updated to:
- Include `--max-model-len 4096` (or `max_model_len=4096`).
- Remove the `compilation-config` parameter, which is no longer
necessary for 310p devices.
These changes ensure users have a clearer and more stable experience
when using vLLM on Atlas 300I hardware.
### Does this PR introduce _any_ user-facing change?
No, this is a documentation-only update.
### How was this patch tested?
The changes are to documentation and do not require testing.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
### What this PR does / why we need it?
This PR refactors the `tools/collect_user_first_contribution.sh` script
to improve how we track and update our contributors list.
Key changes include:
- **Incremental Updates**: The script can now perform incremental
updates by storing and reading the last processed commit hash from
`docs/source/community/contributors.md`. This is much more efficient
than re-processing all commits every time.
- **Full Refresh Option**: A `--full` flag is added to allow forcing a
full recalculation of all contributors, useful for correcting errors or
initial setup.
- **Improved Usage**: Replaced positional arguments with command-line
flags (`--repo`, `--file`, `--full`) for better usability and clarity.
- **Robust Contributor-ID detection**: Improved logic to find a
contributor's GitHub login, including a fallback to parse it from
`noreply` email addresses.
- **In-place File Updates**: The script now directly updates the
`contributors.md` file with new contributors and correct numbering,
automating the entire process.
These changes make the process of maintaining the contributors list more
automated, reliable, and efficient.
### Does this PR introduce _any_ user-facing change?
No, this only changes a developer tool and does not affect the vLLM
library's public API or behavior.
### How was this patch tested?
The script can be tested locally by running it against the repository.
For an incremental update:
`GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh`
For a full refresh:
`GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh
--full`
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Update Memcache local service config example: increase default world
size to 256 and update the description for better clarity.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
### What this PR does / why we need it?
To intuitively show the effect of the eplb algorithm, we print the
expert heat before and after eplb.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: fems14 <1804143737@qq.com>
This pull request refactors the speculative decoding proposer interface
to align with upstream vLLM, removing the local `Proposer` interface and
renaming methods to `propose`.
This is the first step. In the future we should remove the class
register and just add few Ascend specified method once the arch in vLLM
is ready.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR add docs of batch invariance and make some extra operators
according to validation result.
please see https://github.com/vllm-project/vllm-ascend/issues/5487 to
track progress.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
## What this PR does / why we need it?
Fixes several documentation issues in the msprobe debugging guide as
reported in #6065:
1. **Remove unnecessary `cat` heredoc wrapper**: The example
configuration section used a `cat <<'JSON'` bash wrapper around the JSON
config. Simplified to a plain JSON code block.
2. **Fix duplicate chapter numbering**: Two sections were both numbered
'2'. Renumbered sections sequentially (0-6).
3. **Fix msprobe command**: Changed `msprobe graph_visualize` to
`msprobe -f pytorch graph` in section 5.2 Visualization.
4. **Remove backward-related content**: Since vllm is inference-only (no
training), removed all backward pass references including backward
tensor examples, parameter gradient examples, and backward descriptions
from dump.json explanations.
## Does this PR introduce _any_ user-facing change?
Documentation improvement only. No code changes.
## How was this patch tested?
Manual review of the markdown file to verify all 4 issues from #6065 are
addressed.
Closes#6065
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: NJX-njx <3771829673@qq.com>
### What this PR does / why we need it?
Add Experimental supported model/feature for supported_models.md
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
[CI] Upgrade CANN to 8.5.1
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Add muls_add triton kernel with related fusion pass. What's more, this
PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
CI passed with new added test.
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
add 310P3 guidance of PaddleOCR-VL model, refresh PaddleOCR-VL.md in the
docs/source/tutorials/
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
by CI
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: zouyizhou <zouyizhou@huawei.com>
### What this PR does / why we need it?
This PR updates the release notes for `v0.15.0rc1` to:
- Mark the `310P MoE and W8A8 Support` feature as experimental.
- Add a note for `Kimi-K2.5 Model Support` clarifying that it has known
issues in vLLM 0.15.0 and requires manual patching to work correctly.
### Does this PR introduce _any_ user-facing change?
No, this is a documentation-only update.
### How was this patch tested?
N/A (documentation change).
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>