### What this PR does / why we need it?
Introduce a check to not using asynchronous communication under
`enable_dsa_cp_with_layer_shard` branch on capturing mode. This change
prevents potential stream and event issues when operating in
graph/capturing mode, ensuring safer communication practices.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E test with dsv32 + FC1 + FULL_DECODE_ONLY +
kv_transfer_config(kv_both)
---------
Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
### What this PR does / why we need it?
1. Add version notes for GLM5.
2. Add paramter modification for GLM5.
3. Add GLM5 to supported model list.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
35141a7eed
---------
Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com>
Co-authored-by: Zhu Jiyang <zhujiyang2@huawei.com>
### What this PR does / why we need it?
This PR updates the GLM-5 documentation to include:
- Information about the first supported version
(`vllm-ascend:v0.17.0rc1`).
- Updated `--additional-config` parameters to use the new nested
`ascend_compilation_config` structure.
- Added `VLLM_ASCEND_BALANCE_SCHEDULING` environment variable to
deployment scripts.
- Improved formatting of deployment steps.
- A new "Notice" section explaining optimization environment variables
(`VLLM_ASCEND_ENABLE_FLASHCOMM1`, `VLLM_ASCEND_ENABLE_FUSED_MC2`,
`VLLM_ASCEND_ENABLE_MLAPO`).
- A "Best Practices" section for prefill-decode disaggregation.
- An "FAQ" section addressing common tokenizer issues and function
calling configuration.
### Does this PR introduce _any_ user-facing change?
No, this is a documentation-only update.
### How was this patch tested?
Documentation changes were verified for correctness and formatting.
---------
Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com>
### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples
- adjust some command snippets and notes for better copy-paste usability
- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`(**Offline** inference currently **does not support** the
"max_completion_tokens")
``` bash
Traceback (most recent call last):
File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048,
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) +
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```
- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
model=model_name,
task="score", # need fix
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
- modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```
These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4497431df6
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
1. add description of another version of glm5-w4a8 weight
2. update the introduction of installation
3. introduce a script to enable bf16 MTP
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
N/A
- vLLM version: v0.15.0
- vLLM main:
9562912cea
---------
Signed-off-by: yydyzr <liuyuncong1@huawei.com>
### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: v0.15.0
vLLM main:
9562912cea
- vLLM version: v0.15.0
- vLLM main:
9562912cea
This pull request refines the GLM-5 deployment documentation by updating
the Docker run command to include a more comprehensive set of device
mappings and by removing an extraneous quantization flag from the `vllm
serve` commands. These changes aim to correct and clarify the deployment
instructions, ensuring users can successfully set up and run the GLM-5
model as intended.
- vLLM version: v0.15.0
- vLLM main:
9562912cea
Signed-off-by: Canlin Guo <961750412@qq.com>
### What this PR does / why we need it?
Add GLM5 doc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
9562912cea
Signed-off-by: nakairika <982275964@qq.com>