Commit Graph

2869 Commits

Author SHA1 Message Date
aipaes
45f75b4178 [Doc][v0.18.0]Fix minimax2.5 readme (#8638)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Correct the descriptive errors in the document.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
no
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
doc test

---------

Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
2026-04-23 22:56:24 +08:00
wangxiaoteng888
c47371c1af [BugFix]Backport validate pd mode feature gates no fused mc2 v0.18.0 clean (#8583)
### What this PR does / why we need it?
Backport validate pd mode feature gates no fused mc2 v0.18.0 clean
backport #8582
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-04-23 19:44:07 +08:00
zzzzwwjj
1fc7bc056d [0.18.0][Doc] Add NPU soft partitioning + cudagraph.piecewise limitation (#8595)
### What this PR does / why we need it?

Added NPU soft partitioning + cudagraph.piecewise limitation in graph
mode user guide doc.

### Does this PR introduce _any_ user-facing change?


### How was this patch tested?

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2026-04-23 19:09:55 +08:00
linfeng-yuan
5c048a9b71 [Doc][releases/v0.18.0] fix documentation error or non-standard description (#8626)
### What this PR does / why we need it?
fix documentation error or non-standard description in releases/v0.18.0
branch
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Documentation check.

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-04-23 18:55:44 +08:00
sunshine202600
786eaf8b07 [Doc][Misc] Improve readability and fix typos in documentation (#8633)
### What this PR does / why we need it?

This PR improves the readability of the documentation by fixing typos,
correcting command extensions.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Documentation changes only.

Signed-off-by: sunshine202600 <sunshine202600@163.com>
2026-04-23 18:45:17 +08:00
jack
d81101acdd [releases/v0.18.0][Platform][BugFix] Guard forced tool choice with empty content (#8400)
### What this PR does / why we need it?

This backports the forced-tool-choice `content=None` guard to the
`releases/v0.18.0` compatibility layer.

Upstream vLLM still has forced named tool-choice branches that assert
`content is not None` after reasoning extraction. Some reasoning parsers
can legally consume the full output and return `(reasoning, None)`,
which makes the assert reachable and can surface as a server-side
failure.

This PR follows the same compatibility-patch pattern used by:
- `7314bbe2` fix(platform): reimplement MiniMax usage accounting patch
(#7835)
- `f83cb0e6` [Bugfix][Platform] Fix GLM47 tool-call finish backfill
(#7710)

The patch is intentionally narrow:
- normalize `content=None` to `""` only for forced named tool choice
- patch both chat-completions and responses parser entry points
- keep the rest of upstream behavior unchanged

Upstream tracking:
- issue: vllm-project/vllm#40147
- PR: vllm-project/vllm#40148

### Does this PR introduce _any_ user-facing change?

Yes.

Forced named tool choice becomes robust when the reasoning parser
returns no post-reasoning content, avoiding an internal assertion
failure and emitting an empty-argument function call instead.

### How was this patch tested?

Unit tests:
```bash
pytest -sv tests/ut/patch/platform/test_patch_tool_choice_none_content.py \
  tests/ut/patch/platform/test_patch_glm_tool_call_parser.py \
  tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
```

Result: 22 passed.

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-04-23 16:46:10 +08:00
herizhen
ff76c6780e [releases/v0.18.0][Doc][Misc] Modifying Configuration Parameters (#8618)
### What this PR does / why we need it?
This PR renames the environment variable VLLM_NIXL_ABORT_REQUEST_TIMEOUT
to VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT to align with the Mooncake
connector naming convention. It also updates the documentation and test
configurations to reflect this change and adjusts the suggested timeout
value in the documentation to 480 seconds for consistency.

### Does this PR introduce _any_ user-facing change?
Yes. The environment variable for configuring the abort request timeout
has been renamed. Users should update their environment settings from
VLLM_NIXL_ABORT_REQUEST_TIMEOUT to VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT.

### How was this patch tested?
The changes were verified by updating the corresponding test
configuration files and ensuring consistency across the documentation.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-23 16:23:31 +08:00
Frank Chen
ce92be29d2 [Doc] Clarify irqbalance service management (#8614)
### What this PR does / why we need it?

This PR clarifies the CPU binding documentation for managing the
`irqbalance` service.

The previous wording only mentioned Ubuntu while the command shown is
specific to systemd-based Linux distributions. This update describes the
command as applicable to Ubuntu and other systemd-based distributions,
and adds a note for non-systemd systems to use the distribution-specific
service-management command.

### Does this PR introduce _any_ user-facing change?

No. This is a documentation-only update and does not change vLLM or
vllm-ascend runtime behavior.

### How was this patch tested?

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
2026-04-23 16:10:07 +08:00
Frank Chen
a4ba82e138 [BugFix] Require kv producer for layer sharding (#8563)
### What this PR does / why we need it?

This PR introduce stricter Ascend `additional_config.layer_sharding`
validation to the 0.18 release branch so it is only accepted on
PD-disaggregated P nodes with `kv_role="kv_producer"`.


### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

E2E test

---------

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
2026-04-23 16:06:53 +08:00
aipaes
4a254ba59a [Doc] [v0.18.0]Fix glm4.7 readme v18 (#8460)
### What this PR does / why we need it?
update GLM4.7 doc. Fix configuration issues,
including:VLLM_ASCEND_ENABLE_FLASHCOMM1、VLLM_ASCEND_BALANCE_SCHEDULING、VLLM_NIXL_ABORT_REQUEST_TIMEOUT
etc.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
doc test

---------

Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Signed-off-by: aipaes <82140963+aipaes@users.noreply.github.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
2026-04-23 14:42:28 +08:00
liziyu
58c87bd15b [BugFix][0.18.0] Remove unused layers assignment in mooncake connector (#8602)
### What this PR does / why we need it?
Remove unused layers assignment in mooncake connector

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
by nightly

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-04-23 14:08:53 +08:00
pz1116
30d08ced2d [Doc][0.18.0] Fix kv pool CLI flag typo and formatting (#8608)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Fix kv pool CLI flag typo and formatting

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
2026-04-23 12:07:47 +08:00
vllm-ascend-ci
0c458aa6dc [v0.18.0][Doc] Translated Doc files 2026-04-22 (#8565)
## Auto-Translation Summary

Translated **43** file(s):

-
<code>docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/disaggregated_prefill.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/eplb_swift_balancer.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/npugraph_ex.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/patch.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/quantization.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/index.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/multi_node_test.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_ais_bench.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_evalscope.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_lm_eval.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_opencompass.po</code>
- <code>docs/source/locale/zh_CN/LC_MESSAGES/faqs.po</code>
- <code>docs/source/locale/zh_CN/LC_MESSAGES/installation.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_colocated_mooncake_multi_instance.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/DeepSeek-V3.1.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM4.x.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM5.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/PaddleOCR-VL.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen-VL-Dense.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-235B-A22B.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-397B-A17B.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3_embedding.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/batch_invariance.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/context_parallel.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/cpu_binding.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/dynamic_batch.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/eplb_swift_balancer.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/kv_pool.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/layer_sharding.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/netloader.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/npugraph_ex.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sleep_mode.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/ucm_deployment.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/weight_prefetch.po</code>

---

[Workflow
run](https://github.com/vllm-project/vllm-ascend/actions/runs/24767290887)

Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
2026-04-23 11:06:05 +08:00
Yang Yuxi
9e31e4f234 [Doc]change format (#8592)
### What this PR does / why we need it?
change --compilation_config to --compilation-config
change --max-model-len 133008 to --max-model-len 131072 for matching
128k

### Does this PR introduce _any_ user-facing change?
No

Signed-off-by: Yang Yuxi <907276627@qq.com>
2026-04-23 10:46:09 +08:00
wangx700
4020b3df60 [BugFix] fix tl.extract_slice and tl.insert_slice. (#8567)
### What this PR does / why we need it?
fix tl.extract_slice and tl.insert_slice to extract_slice and
insert_slice from torch_utils
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

---------

Signed-off-by: wangx700 <wangxin700@huawei.com>
2026-04-23 09:50:29 +08:00
liziyu
c3b1d409a9 [BugFix] [P/D] [CherryPick] 8540 In scenarios where TP is not equal, the KV cache at the MTP layer is not handled. (#8541)
### What this PR does / why we need it?
Fix the issue where the Mooncake connector does not handle the MTP layer
KV cache when TP is unbalanced.
backport: #8540
### Does this PR introduce _any_ user-facing change?


### How was this patch tested?
by nightly

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-04-23 09:16:37 +08:00
LQLlulu
fcf4d477a7 [BugFix][0.18.0]dispatch_ffn_combine kernal rollback combine 、unpermute part and scale part (#8534)
cherry-pick https://github.com/vllm-project/vllm-ascend/pull/8539

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Due to end-to-end testing , three optimization points for the decode
scenario have been reverted in dispatch_ffn_combine kernel.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: l00893928 <liuquanlu@huawei.com>
Co-authored-by: l00893928 <liuquanlu@huawei.com>
2026-04-22 23:27:02 +08:00
pz1116
69a57bc9ec [Doc][0.18.0]Fix typo in KV Cache Pool developr guide (#8575)
### What this PR does / why we need it?
Fix typo in KV Cache Pool developr guide

Signed-off-by: pz1116 <47019764+Pz1116@users.noreply.github.com>
2026-04-22 17:59:16 +08:00
Li Wang
c335710b82 [Misc] Bump mooncake version to v0.3.9 (#8445)
### What this PR does / why we need it?
This PR updates the `MOONCAKE_TAG` version from `v0.3.8.post1` to
`v0.3.9` across all Dockerfiles.

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-04-22 15:41:06 +08:00
liziyu
9e15ce7074 [Bugfix] [P/D] Fix connector with pcp dcp (#8538)
### What this PR does / why we need it?
Fix the issue where a request does not return due to a specific NPU on
node D having no transmission tasks in the scenario where node D is
enabled with DCP.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
by nightly

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-04-21 22:23:21 +08:00
starmountain1997
2cb9f76a0f [CI] Ds32 ep aime2025 (#8496)
Backport of #7882 to releases/v0.18.0. Adds aime2025 benchmark test for
DeepSeek-V3.2-W8A8 EP with disaggregated prefill on A3 (4-node, 16 NPUs
per node, accuracy benchmark baseline 66.67%).

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-04-21 19:49:06 +08:00
herizhen
862bd8143c [Releases/v0.18.0][Doc][Misc] Added version description in the writing template (#8451)
### What this PR does / why we need it?
This PR updates the model deployment tutorial template to include a
requirement for authors to add a comment when code examples contain
version numbers. This ensures that users are prompted to use the version
appropriate for their specific environment.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
N/A (Documentation change)

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-21 16:33:43 +08:00
1kzk
7850264324 [v0.18.0][BugFix] PIECEWISE mode also requires synchronization (#8469)
### What this PR does / why we need it?

This PR enables synchronization for the `PIECEWISE` runtime mode in ACL
graph replay. Previously, synchronization was only performed in `FULL`
mode. However, `PIECEWISE` mode also requires this barrier to ensure
that parameter updates are completed before the graph is replayed,
preventing accuracy loss.

The logic is also corrected to skip synchronization specifically for
EAGLE draft models, as intended.

Fixes #

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

CI passed.

---------

Signed-off-by: 1zzk <785396250@qq.com>
2026-04-21 16:22:32 +08:00
yangjiuhua
b717dc17a3 [v0.18.0][Test][Misc] Update CI for GLM-5 configuration on vllm-ascend/releases/v0.18.0 branch (#8322)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Update CI for GLM-5 configuration on vllm-ascend/releases/v0.18.0 branch
在0.18.0版本上对glm5-w4a8做测试

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: yangjiuhua <y00845194@china.huawei.com>
Co-authored-by: yangjiuhua <y00845194@china.huawei.com>
2026-04-21 14:10:11 +08:00
Li Wang
36a0470de1 [Doc] Upgrade env VLLM_ASCEND_ENABLE_FUSED_MC2 used in nightly test and tutorials (#8441)
### What this PR does / why we need it?
The env `VLLM_ASCEND_ENABLE_FUSED_MC2` should only enabled in the
decoder node during Prefill-Decode Disaggregation scenario

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-04-20 22:39:23 +08:00
ZT-AIA
3db5048d74 [CI]repair ci for custom op (#8455)
### What this PR does / why we need it?
repair ci for custom op nightly

Signed-off-by: ZT-AIA <1028681969@qq.com>
2026-04-20 17:51:37 +08:00
zyz111222
dd7e08c6db [Performance] Use forward_native for Conv3dLayer and add UT (#8375)
What this PR does / why we need it?
switch Ascend conv3d forward_oot to use forward_native and  add ut

Does this PR introduce any user-facing change?
No

How was this patch tested?
by CI

---------

Signed-off-by: zouyizhou <zouyizhou@huawei.com>
2026-04-20 17:20:40 +08:00
LQLlulu
c124e8df07 Revert "[BugFix] dispatch_ffn_combine kernal rollback combinev2 part … (#8439)
…(#8405)"

This reverts commit b992b11545.

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: l00893928 <liuquanlu@huawei.com>
Co-authored-by: l00893928 <liuquanlu@huawei.com>
2026-04-20 14:26:55 +08:00
zhangxinyuehfad
7a706fb197 [v0.18.0][CI] fix report_template.md (#8429)
### What this PR does / why we need it?

fix report_template.md

the error caused by
https://github.com/vllm-project/vllm-ascend/pull/8340


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-04-20 11:13:14 +08:00
wangbj127
e6ba5a88f7 [v0.18.0][BugFix] Fix Qwen3.5 MoE FC1 error under high concurrency when dp>1 (#8395)
### What this PR does / why we need it?
GDN Attention uses FIA's query_start_loc (padded), which may cause
conv1d update errors under high concurrency when dp > 1, and this PR is
to make GDN use its own query_start_loc (unpadded).

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- vLLM version: v0.18.0

Signed-off-by: Wangbingjie <wangbj1207@126.com>
2026-04-20 10:26:19 +08:00
LQLlulu
b992b11545 [BugFix] dispatch_ffn_combine kernal rollback combinev2 part (#8405)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: l00893928 <liuquanlu@huawei.com>
Co-authored-by: l00893928 <liuquanlu@huawei.com>
2026-04-18 22:45:08 +08:00
wangbj127
6bdc72949b Revert "[v0.18.0][BugFix] Fix dimension mismatch error when SP padding causes num_tokens_padded != num_tokens_unpadded" (#8413)
Reverts vllm-project/vllm-ascend#8133
- Reversion of Logic: This pull request reverts the changes introduced
in a previous commit that attempted to handle dimension mismatches
during SP padding.

Signed-off-by: Wangbingjie <wangbj1207@126.com>
2026-04-18 20:43:42 +08:00
wangxiaoteng888
363febb6cb [BugFix][v0.18.0] Gate recompute/balance/fused_mc2 by PD mode (#8374)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
- Enforce recompute scheduler only in PD-disaggregated mode.
- Enforce balance scheduling only in PD-mixed mode.
- Enforce fused MC2 only on PD-disaggregated D-side (kv_consumer).
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
No
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
By ci
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-04-18 18:06:42 +08:00
1kzk
c995a959e6 [BugFix] fix hang in async scheduling while open ENPU (#8354)
### What this PR does / why we need it?
1. there is no synchronization between steps. However, in async
scheduling with aclgraph, it is possible that the CPU's record event for
the current iteration completes before the previous iteration's graph
execution has finished. If cpu is fast enough, device will hang on
event_wait in interation i+1 (assume that event_record is executed
immediately on update stream of device).
2. Under ENPU, eagle proposers also need to follow event.record first,
and then event.Wait.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?

---------

Signed-off-by: 1zzk <785396250@qq.com>
2026-04-18 00:07:15 +08:00
shaopeng-666
f81f9a3c89 [Doc] Add Qwen3.5 fused MC2 known issue for 0.18.0 release (#8378)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->show known issues for Qwen3.5-397B

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->NO

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->NA

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
2026-04-17 22:54:21 +08:00
wangbj127
f2956ce944 [v0.18.0][BugFix] Fix dimension mismatch error when SP padding causes num_tokens_padded != num_tokens_unpadded (#8133)
Cherry-picked from https://github.com/vllm-project/vllm-ascend/pull/7858

### What this PR does / why we need it?
This PR fixes a `RuntimeError` (dimension mismatch) that occurs when
Sequence Parallelism (SP) is enabled and the padding added for SP causes
`num_tokens_padded` to differ from `num_tokens_unpadded`. In such cases,
`_pad_query_start_loc_for_fia` adds a dummy request, increasing
`num_reqs_padded`. This mismatch between the actual number of requests
and the padded number of requests leads to errors in downstream token
count computations (e.g., `compute_num_computed_tokens`).

The fix modifies the restrictive condition `num_tokens_padded ==
num_tokens_unpadded` when reverting the dummy request padding if SP is
enabled, as SP padding is handled by stripping it after communication
and should not be treated as an additional request in the attention
metadata.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
vLLM version: v0.18.0
vLLM-Ascend version: releases/v0.18.0

Signed-off-by: Wangbj127 <wangbj1207@126.com>
2026-04-17 22:50:22 +08:00
aipaes
0954fd0912 [BugFix][0.18.0] Fix quant_bias missing in w8a8_static when flashcomm1 is enabled for GLM-5 (#8304)
### What this PR does / why we need it?
PR #8220 in v0.18.0

In a previous PR #7843 , the o_proj layer of GLM-5 was reverted to TP
(Tensor Parallel) splitting when flashcomm1 was enabled. However, this
was a temporary workaround and did not address the root cause of the
precision issues observed in the o_proj layer under flashcomm1.

I am working on a definitive fix for this issue. Currently, a clear bug
has been identified in
880e20fdde/vllm_ascend/quantization/methods/w8a8_static.py (L124):
during quantized matrix multiplication, quant_bias is not added if
tp_rank > 0. In the flashcomm1 scenario, all ranks actually require the
addition of quant_bias, meaning tp_rank=0 should be passed to ensure the
bias is applied correctly.

This PR aims to resolve this logic error and fix the underlying
precision issue.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
glm5 e2e test

---------

Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: triomino <15924998+triomino@users.noreply.github.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
2026-04-17 22:46:36 +08:00
Zetong Li
b72ade9acd [0.18.0][BugFix] Update capture sizes after rounding operations (#8380)
### What this PR does / why we need it?
This PR is partially cherry-picked from #8172.

This PR aims to fix mismatched capture sizes after rounding operations
when using sp or speculative. The reason is that original
`self.cudagraph_capture_sizes` is no longer updated and remains as the
initial sizes. Now we use `self.cudagraph_dispatcher.get_capture_descs`
to the get up-to-date sizes.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
by ci

Signed-off-by: Zetong Li <slippersss@126.com>
2026-04-17 22:46:16 +08:00
herizhen
76cc2204bd [Doc] Sensitive word modification (#8303)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This PR updates the documentation to replace specific hardware terms
(e.g., HBM, 910B, 310P) with more generic or branded terms (e.g.,
on-chip memory, Atlas inference products) to comply with sensitive word
requirements.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-17 16:30:00 +08:00
vllm-ascend-ci
9c1d58f4d2 [v0.18.0][Doc] Translated Doc files 2026-04-15 (#8309)
## Auto-Translation Summary

Translated **19** file(s):

-
<code>docs/source/locale/zh_CN/LC_MESSAGES/community/contributors.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/ModelRunner_prepare_inputs.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_single_node.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Kimi-K2.5.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen2.5-Omni.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Dense.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-397B-A17B.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/epd_disaggregation.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/external_dp.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/large_scale_ep.po</code>
-
<code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/release_notes.po</code>

---

[Workflow
run](https://github.com/vllm-project/vllm-ascend/actions/runs/24447109402)

Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
2026-04-17 16:29:30 +08:00
pz1116
ceb1e49661 [BugFix][v0.18.0] fix remote KV waiting promotion in balance scheduler (#8280)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
## Problem

In PD-disaggregated serving with `mooncake_connector` and
`VLLM_ASCEND_BALANCE_SCHEDULING=1`, requests may enter
`WAITING_FOR_REMOTE_KVS` and never be promoted back to runnable state
after remote KV transfer finishes.

The issue is in `BalanceScheduler`'s handling of
`WAITING_FOR_REMOTE_KVS` requests. The current code treats
`_update_waiting_for_remote_kv()` as if it returns a boolean readiness
flag:

```python
is_ready = self._update_waiting_for_remote_kv(request)
if is_ready:
    ...
else:
    ...
```
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
2026-04-17 10:06:36 +08:00
Frank Chen
f85144cc57 [BugFix][DSv32] Fix DSA-CP PD role gating for deepseek v3.2 (v0.18.0) (#8291)
### What this PR does / why we need it?
This PR backports the DSA-CP PD role gating fix to `releases/v0.18.0`.

The existing helper logic on the release branch does not handle the PD
mixed-role case correctly when deciding whether layer sharding or TP
`o_proj` handling should be enabled. Layer sharding should only run on
the P-side instance, while TP `o_proj` handling should stay enabled for
normal non-PD deployments and for the PD mixed-role (`kv_both`)
instance. This patch makes those conditions explicit and adds unit
coverage for the allowed and disallowed combinations, including the
DSA-CP-disabled path.

Such wrong condition lead to **vllm serve failures** in case: **FC1 +
PD-colocated KV pooling + no layer_sharding**, specifically causing:
1. insufficient Available KV cache memory
2. o_proj shape error in sfa_v1 attention module

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
E2E test with dsv32 + FC1 + FULL_DECODE_ONLY +
kv_transfer_config(kv_both) + no layer_sharding

---------

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
2026-04-17 10:05:40 +08:00
sunshine202600
1dd1de8153 [Doc][Misc] Improve readability and fix typos in documentation (#8340)
### What this PR does / why we need it?

This PR improves the readability of the documentation by fixing typos,
correcting command extensions, and fixing broken links in the Chinese
README.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Documentation changes only.

---------

Signed-off-by: sunshine202600 <sunshine202600@163.com>
2026-04-17 08:54:38 +08:00
csoulnd
8952fddc7e [BugFix][310p][Cherry-pick] Handle null quantization config in ShardedStateLoader310&[Feature][310P] Support W8A8 dynamic linear method (#8296)
### What this PR does / why we need it?
This PR implements the `AscendW8A8DynamicLinearMethod310` quantization
scheme specifically for 310P hardware. It includes the logic for weight
retrieval, per-channel parameter generation, and the application of
dynamic quantization using NPU-specific kernels. Additionally, it
updates `ShardedStateLoader310` to handle quantization configurations
more robustly when generating parameter type maps.

Feedback from the review identified two critical issues in the
implementation:
1. The tensor squeezing logic in the `apply` method incorrectly handles
2D inputs, which may lead to shape mismatches in subsequent layers.
2. The weight tensor in `process_weights_after_loading` is transposed
after being converted to the private NZ format; the transpose operation
should be performed on the ND tensor before conversion to ensure correct
physical layout.

cherry-pick from : #7546 #7725
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New unit tests were added in
`tests/ut/_310p/quantization/test_w8a8_dynamic_310.py` to verify the
quantization method, and
`tests/ut/_310p/test_sharded_state_loader_310p.py` was updated to test
the state loader changes.

---------

Signed-off-by: csoulnd <daidaicurry@foxmail.com>
2026-04-16 16:53:39 +08:00
1kzk
52f0f9b5e4 [0.18.0][BugFix]: order acl graph updates before model forward for ENPU (#8317)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
For the ENPU scenario, it is required that device events follow the
principle of "record first, wait later", otherwise the inference process
may become stuck. However, in the current model_forward function,
event.wait precedes event.record. Therefore, for the ENPU scenario,
graph parameter updates should be performed before model execution.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
N/A

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: 1zzk <785396250@qq.com>
Signed-off-by: 1kzk <785396250@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-16 16:26:59 +08:00
Frank Chen
2ac8bfb4cb [BugFix] Enforce C locale for CPU binding subprocess parsing (#8261)
### What this PR does / why we need it?
This PR backports the CPU binding locale normalization fix from #7274 to
`releases/v0.18.0`, including the follow-up review fixes already applied
on `main`.

The change forces `LC_ALL`, `LANG`, and `LC_MESSAGES` to `C` before
spawning subprocesses in `vllm_ascend.cpu_binding.execute_command()`, so
parser-dependent command output stays stable on localized systems. It
also handles `subprocess.TimeoutExpired` by killing the child process
before collecting output, and updates the existing unit tests to keep
command-argument coverage while adding timeout-path coverage.

Fixes #6992

### Does this PR introduce _any_ user-facing change?
Yes.

Users running CPU binding on non-English OS environments should now get
consistent English subprocess output for parser-dependent commands,
avoiding failures caused by inherited locale settings.

### How was this patch tested?
- Updated the existing unit tests in
`tests/ut/device_allocator/test_cpu_binding.py` to assert the locale
environment, retain command argument coverage, and cover the timeout
cleanup path.
- Attempted to run targeted pytest cases locally, but the pytest
invocation did not complete normally in this environment, so I could not
record a clean passing run here.

Attribution:
- Co-authored-by: stdjhs <1601599324@qq.com>
- Signed-off-by: chenchuw886 <chenchuw@huawei.com>

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: stdjhs <1601599324@qq.com>
2026-04-16 16:17:10 +08:00
pz1116
8bc72a807a [BugFix][v0.18.0] require piecewise cudagraph for layerwise AscendSto… (#8282)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
ref:https://github.com/vllm-project/vllm-ascend/issues/8184
following https://github.com/vllm-project/vllm/pull/31057, add
`requires_piecewise_for_cudagraph` for `AscendStoreConnector`

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
2026-04-16 10:40:14 +08:00
chenweiqiang11
028b8cabc4 [BugFix][Platform] Fix extra function name in final chunk of streaming tool calls (#8178)
### What this PR does / why we need it?

Fix a bug in the GLM tool call parser where the `function.name` field
was incorrectly included in the final (non-first) chunks of streaming
tool calls.

Per OpenAI streaming semantics, `id`, `type`, and `function.name` must
only appear in the **first** chunk for a given tool call index. When
`_create_remaining_args_delta` was called for continuing/finishing
chunks, it was incorrectly reading the function name from
`delta_message.tool_calls` and re-emitting it, causing clients to see a
duplicate/extra function name in the final chunk.

**Root cause**: The original code always looked up the tool call in
`delta_message.tool_calls` to get the name, id, and type — even when
this was not the first chunk being streamed. This caused the function
name to appear again in the final argument-completion chunk.

**Fix**:
- Track whether arguments have already been streamed
(`already_streamed_args`) for each tool call index.
- Only populate `fallback_tool_call_id`, `fallback_tool_call_type`, and
`fallback_tool_call_name` when `already_streamed_args` is empty (i.e.,
this is genuinely the first chunk).
- Refactored `_create_remaining_args_delta` to omit header fields
entirely when all fallback values are `None`, which is the correct
behavior for continuing/finishing chunks.

### Does this PR introduce _any_ user-facing change?

Yes. Clients consuming the streaming tool call response will no longer
receive a duplicate `function.name` in the final chunk. This fixes
incorrect behavior visible in the OpenAI-compatible streaming API output
for GLM models using tool calls.

### How was this patch tested?

- Code review and logic analysis of the streaming tool call path in
`patch_glm_tool_call_parser.py`.
- Existing unit tests in
`tests/ut/platform/test_patch_glm_tool_call_parser.py`.

---------

Signed-off-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com>
Signed-off-by: chenweiqiang11 <chenweiqiang11@noreply.github.com>
Co-authored-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com>
2026-04-15 17:50:10 +08:00
zhangxinyuehfad
808d00406f [v0.18.0][CI]Add rank0 process count check for DeepSeek-R1-W8A8-HBM test (#8072)
### What this PR does / why we need it?
Adds a `check_rank0_process_count` validation step to the
DeepSeek-R1-W8A8-HBM nightly single-node test.

The check verifies that after the server starts, there is **exactly 1**
`vllm serve` process running on rank0. This guards against the
regression fixed in #8041 (extra NPU context leaking on device 0),
ensuring it does not silently reappear in future releases.

#### Changes

-
**`tests/e2e/nightly/single_node/models/scripts/test_single_node.py`**:
Add `run_check_rank0_process_count` async handler. It calls `npu-smi
info` for diagnostics, then uses `psutil` to assert exactly one `vllm
serve` process exists on rank0.
-
**`tests/e2e/nightly/single_node/models/configs/DeepSeek-R1-W8A8-HBM.yaml`**:
Register `check_rank0_process_count` in the `test_content` list for the
HBM test case.

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-04-15 17:16:27 +08:00
herizhen
95726d20eb [Doc][Misc] Correcting the document and uploading the model deployment template (#8287)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Correcting the document and uploading the model deployment template

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-15 16:03:11 +08:00