[Doc][Misc] Improve readability and fix typos in documentation (#8340)

### What this PR does / why we need it?

This PR improves the readability of the documentation by fixing typos,
correcting command extensions, and fixing broken links in the Chinese
README.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Documentation changes only.

---------

Signed-off-by: sunshine202600 <sunshine202600@163.com>
This commit is contained in:
sunshine202600
2026-04-17 08:54:38 +08:00
committed by GitHub
parent 8952fddc7e
commit 1dd1de8153
46 changed files with 90 additions and 92 deletions

View File

@@ -257,7 +257,7 @@ Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes)
- `DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16).
To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests.
To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_online_dp.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests.
1. `launch_online_dp.py` to launch external dp vllm servers.
[launch\_online\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py)

View File

@@ -22,7 +22,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
- `GLM-4.6`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6).
- `GLM-4.7`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7).
- `GLM-4.5-w8a8-with-float-mtp`(Quantized version with mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.5-w8a8).
- `GLM-4.6-w8a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because vllm do not support GLM4.6 mtp in October, so we do not provide mtp version. And last month, it supported, you can use the following quantization scheme to add mtp weights to Quantized weights.
- `GLM-4.6-w8a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because vllm does not support GLM4.6 mtp in October, we do not provide an mtp version. Last month, it was supported; you can use the following quantization scheme to add mtp weights to the quantized weights.
- `GLM-4.7-w8a8-with-float-mtp`(Quantized version without mtp): [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-4.7-W8A8-floatmtp).
- `Method of Quantify`: [quantization scheme](https://blog.csdn.net/qq_37368095/article/details/156429653?spm=1011.2124.3001.6209). You can use these methods to quantify the model.
@@ -38,7 +38,7 @@ You can use our official docker image to run `GLM-4.x` directly.
::::{tab-item} A3 series
:sync: A3
Start the docker image on your each node.
Start the docker image on each node.
```{code-block} bash
:substitutions:

View File

@@ -2,7 +2,7 @@
## Introduction
[GLM-5](https://huggingface.co/zai-org/GLM-5) use a Mixture-of-Experts (MoE) architecture and targeting at complex systems engineering and long-horizon agentic tasks.
[GLM-5](https://huggingface.co/zai-org/GLM-5) use a Mixture-of-Experts (MoE) architecture and targets at complex systems engineering and long-horizon agentic tasks.
The `GLM-5` model is first supported in `vllm-ascend:v0.17.0rc1`. In `vllm-ascend:v0.17.0rc1` and `vllm-ascend:v0.18.0rc1` , the version of transformers need to be upgraded to 5.2.0.

View File

@@ -2,7 +2,7 @@
## Introduction
Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output.
Qwen3-Omni is a native end-to-end multilingual omni-modal foundation model. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. The Thinking model of Qwen3-Omni-30B-A3B, which contains the thinker component, is equipped with chain-of-thought reasoning and supports audio, video, and text input, with text output.
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation.

View File

@@ -130,14 +130,14 @@ model_name = "Qwen/Qwen3-VL-Reranker-8B"
# What is the difference between the official original version and one
# that has been converted into a sequence classification model?
# Qwen3-Reranker is a language model that doing reranker by using the
# Qwen3-VL-Reranker is a language model that doing reranker by using the
# logits of "no" and "yes" tokens.
# It needs to computing 151669 tokens logits, making this method extremely
# inefficient, not to mention incompatible with the vllm score API.
# It needs to compute 151669 tokens logits, making this method extremely
# inefficient, not to mention incompatible with the vLLM score API.
# A method for converting the original model into a sequence classification
# model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
# Models converted offline using this method can not only be more efficient
# and support the vllm score API, but also make the init parameters more
# and support the vLLM score API, but also make the init parameters more
# concise, for example.
# model = LLM(model="Qwen/Qwen3-VL-Reranker-8B", runner="pooling")
@@ -163,9 +163,9 @@ model = LLM(
)
# Why do we need hf_overrides for the official original version:
# vllm converts it to Qwen3VLForSequenceClassification when loaded for
# vLLM converts it to Qwen3VLForSequenceClassification when loaded for
# better performance.
# - Firstly, we need using `"architectures": ["Qwen3VLForSequenceClassification"],`
# - Firstly, we need to use `"architectures": ["Qwen3VLForSequenceClassification"],`
# to manually route to Qwen3VLForSequenceClassification.
# - Then, we will extract the vector corresponding to classifier_from_token
# from lm_head using `"classifier_from_token": ["no", "yes"]`.

View File

@@ -514,7 +514,7 @@ To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to depl
- `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency.
- `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node.
- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
- `no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944)
- `--no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944)
4. Run the `proxy.sh` script on the prefill master node

View File

@@ -95,7 +95,7 @@ Processed prompts: 100%|██████████████████
## Performance
Run performance of `Qwen3-Reranker-8B` as an example.
Run performance of `Qwen3-Embedding-8B` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/) for more details.
Take the `serve` as an example. Run the code as follows.

View File

@@ -95,14 +95,14 @@ model_name = "Qwen/Qwen3-Reranker-8B"
# What is the difference between the official original version and one
# that has been converted into a sequence classification model?
# Qwen3-Reranker is a language model that doing reranker by using the
# Qwen3-Reranker is a language model that does reranker by using the
# logits of "no" and "yes" tokens.
# It needs to computing 151669 tokens logits, making this method extremely
# inefficient, not to mention incompatible with the vllm score API.
# It needs to compute 151669 tokens logits, making this method extremely
# inefficient, not to mention incompatible with the vLLM score API.
# A method for converting the original model into a sequence classification
# model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
# Models converted offline using this method can not only be more efficient
# and support the vllm score API, but also make the init parameters more
# and support the vLLM score API, but also make the init parameters more
# concise, for example.
# model = LLM(model="Qwen/Qwen3-Reranker-8B", task="score")
@@ -120,7 +120,7 @@ model = LLM(
)
# Why do we need hf_overrides for the official original version:
# vllm converts it to Qwen3ForSequenceClassification when loaded for
# vLLM converts it to Qwen3ForSequenceClassification when loaded for
# better performance.
# - Firstly, we need using `"architectures": ["Qwen3ForSequenceClassification"],`
# to manually route to Qwen3ForSequenceClassification.