[Doc][releases/v0.18.0] fix documentation error or non-standard description (#8626)

### What this PR does / why we need it?
fix documentation error or non-standard description in releases/v0.18.0
branch
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Documentation check.

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
This commit is contained in:
linfeng-yuan
2026-04-23 18:55:44 +08:00
committed by GitHub
parent 786eaf8b07
commit 5c048a9b71
15 changed files with 39 additions and 40 deletions

View File

@@ -132,12 +132,11 @@ Once the script completes, you can find the results in the benchmarks/results fo
```shell ```shell
. .
|-- serving_qwen2_5_7B_tp1_qps_1.json |-- serving_qwen2_5_7Bvl_tp1_qps_1.json
|-- serving_qwen2_5_7B_tp1_qps_16.json |-- serving_qwen2_5_7Bvl_tp1_qps_16.json
|-- serving_qwen2_5_7B_tp1_qps_4.json |-- serving_qwen2_5_7Bvl_tp1_qps_4.json
|-- serving_qwen2_5_7B_tp1_qps_inf.json |-- serving_qwen2_5_7Bvl_tp1_qps_inf.json
|-- latency_qwen2_5_7B_tp1.json |-- throughput_qwen2_5_7Bvl_tp1.json
|-- throughput_qwen2_5_7B_tp1.json
``` ```
These files contain detailed benchmarking results for further analysis. These files contain detailed benchmarking results for further analysis.

View File

@@ -138,23 +138,23 @@ msgstr "**注意:**"
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:108 #: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:108
#, python-brace-format #, python-brace-format
msgid "" msgid ""
"for vllm version below `v0.12.0` use parameter: `--rope_scaling " "for vllm version below `v0.12.0` use parameter: `--rope-scaling "
"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'" "'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'"
" \\`" " \\`"
msgstr "" msgstr ""
"对于 vllm 版本低于 `v0.12.0`,使用参数:`--rope_scaling " "对于 vllm 版本低于 `v0.12.0`,使用参数:`--rope-scaling "
"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'" "'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'"
" \\`" " \\`"
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:109 #: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:109
#, python-brace-format #, python-brace-format
msgid "" msgid ""
"for vllm version `v0.12.0` use parameter: `--hf-overrides " "for vllm version same as or newer than `v0.12.0` use parameter: `--hf-overrides "
"'{\"rope_parameters\": " "'{\"rope_parameters\": "
"{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}'" "{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}'"
" \\`" " \\`"
msgstr "" msgstr ""
"对于 vllm 版本 `v0.12.0`,使用参数:`--hf-overrides '{\"rope_parameters\": " "对于 vllm 版本 `v0.12.0`及以上,使用参数:`--hf-overrides '{\"rope_parameters\": "
"{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}'" "{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}'"
" \\`" " \\`"

View File

@@ -210,8 +210,8 @@ msgstr "详情请参考[使用 AISBench](../../developer_guide/evaluation/using_
#: ../../source/tutorials/models/Qwen2.5-Omni.md:181 #: ../../source/tutorials/models/Qwen2.5-Omni.md:181
msgid "" msgid ""
"After execution, you can get the result, here is the result of `Qwen2.5" "After execution, you can get the result, here is the result of `Qwen2.5"
"-Omni-7B` with `vllm-ascend:0.11.0rc0` for reference only." "-Omni-7B` with `vllm-ascend:v0.11.0rc0` for reference only."
msgstr "执行后,您可以获得结果,以下是 `Qwen2.5-Omni-7B` 在 `vllm-ascend:0.11.0rc0` 上的结果,仅供参考。" msgstr "执行后,您可以获得结果,以下是 `Qwen2.5-Omni-7B` 在 `vllm-ascend:v0.11.0rc0` 上的结果,仅供参考。"
#: ../../source/tutorials/models/Qwen2.5-Omni.md:91 #: ../../source/tutorials/models/Qwen2.5-Omni.md:91
msgid "dataset" msgid "dataset"

View File

@@ -218,14 +218,14 @@ msgstr ""
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:130 #: ../../source/tutorials/models/Qwen3-235B-A22B.md:130
#, python-brace-format #, python-brace-format
msgid "" msgid ""
"For vllm version below `v0.12.0`, use parameter: `--rope_scaling " "For vllm version below `v0.12.0`, use parameter: `--rope-scaling "
"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'" "'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'"
" \\`. If you are using weights like [Qwen3-235B-A22B-" " \\`. If you are using weights like [Qwen3-235B-A22B-"
"Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)" "Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)"
" which originally supports long contexts, there is no need to add this " " which originally supports long contexts, there is no need to add this "
"parameter." "parameter."
msgstr "" msgstr ""
"对于 `v0.12.0` 以下版本的 vLLM使用参数`--rope_scaling " "对于 `v0.12.0` 以下版本的 vLLM使用参数`--rope-scaling "
"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'" "'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'"
" \\`。如果您使用的是像 [Qwen3-235B-A22B-" " \\`。如果您使用的是像 [Qwen3-235B-A22B-"
"Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)" "Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)"
@@ -452,8 +452,8 @@ msgstr "详情请参阅 [使用 AISBench](../../developer_guide/evaluation/using
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:285 #: ../../source/tutorials/models/Qwen3-235B-A22B.md:285
msgid "" msgid ""
"After execution, you can get the result, here is the result of `Qwen3" "After execution, you can get the result, here is the result of `Qwen3"
"-235B-A22B-w8a8` in `vllm-ascend:0.11.0rc0` for reference only." "-235B-A22B-w8a8` in `vllm-ascend:v0.11.0rc0` for reference only."
msgstr "执行后,您将获得结果。以下是 `vllm-ascend:0.11.0rc0` 中 `Qwen3-235B-A22B-w8a8` 的结果,仅供参考。" msgstr "执行后,您将获得结果。以下是 `vllm-ascend:v0.11.0rc0` 中 `Qwen3-235B-A22B-w8a8` 的结果,仅供参考。"
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:76 #: ../../source/tutorials/models/Qwen3-235B-A22B.md:76
msgid "dataset" msgid "dataset"

View File

@@ -155,8 +155,8 @@ msgstr "详情请参考[使用 AISBench](../../developer_guide/evaluation/using_
#: ../../source/tutorials/models/Qwen3-Coder-30B-A3B.md:95 #: ../../source/tutorials/models/Qwen3-Coder-30B-A3B.md:95
msgid "" msgid ""
"After execution, you can get the result, here is the result of `Qwen3" "After execution, you can get the result, here is the result of `Qwen3"
"-Coder-30B-A3B-Instruct` in `vllm-ascend:0.11.0rc0` for reference only." "-Coder-30B-A3B-Instruct` in `vllm-ascend:v0.11.0rc0` for reference only."
msgstr "执行后,您可以获得结果。以下是 `Qwen3-Coder-30B-A3B-Instruct` 在 `vllm-ascend:0.11.0rc0` 中的结果,仅供参考。" msgstr "执行后,您可以获得结果。以下是 `Qwen3-Coder-30B-A3B-Instruct` 在 `vllm-ascend:v0.11.0rc0` 中的结果,仅供参考。"
#: ../../source/tutorials/models/Qwen3-Coder-30B-A3B.md:29 #: ../../source/tutorials/models/Qwen3-Coder-30B-A3B.md:29
msgid "dataset" msgid "dataset"

View File

@@ -105,8 +105,8 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
**Notice:** **Notice:**
- for vllm version below `v0.12.0` use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \` - for vllm version below `v0.12.0` use parameter: `--rope-scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`
- for vllm version `v0.12.0` use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \` - for vllm version same as or newer than `v0.12.0` use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`
The parameters are explained as follows: The parameters are explained as follows:

View File

@@ -244,7 +244,7 @@ python load_balance_proxy_server_example.py \
--host 192.0.0.1 \ --host 192.0.0.1 \
--port 8080 \ --port 8080 \
--prefiller-hosts 192.0.0.1 \ --prefiller-hosts 192.0.0.1 \
--prefiller-port 13700 \ --prefiller-ports 13700 \
--decoder-hosts 192.0.0.1 \ --decoder-hosts 192.0.0.1 \
--decoder-ports 13701 --decoder-ports 13701
``` ```
@@ -252,7 +252,7 @@ python load_balance_proxy_server_example.py \
|Parameter | Meaning | |Parameter | Meaning |
| --- | --- | | --- | --- |
| --port | Port of proxy | | --port | Port of proxy |
| --prefiller-port | All ports of prefill | | --prefiller-ports | All ports of prefill |
| --decoder-ports | All ports of decoder | | --decoder-ports | All ports of decoder |
## Verification ## Verification

View File

@@ -178,7 +178,7 @@ Qwen2.5-Omni on vllm-ascend has been tested on AISBench.
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details. 1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
2. After execution, you can get the result, here is the result of `Qwen2.5-Omni-7B` with `vllm-ascend:0.11.0rc0` for reference only. 2. After execution, you can get the result, here is the result of `Qwen2.5-Omni-7B` with `vllm-ascend:v0.11.0rc0` for reference only.
| dataset | platform | metric | mode | vllm-api-stream-chat | | dataset | platform | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----| |----- | ----- | ----- | ----- | -----|

View File

@@ -127,7 +127,7 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
- [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) originally only supports 40960 context(max_position_embeddings). If you want to use it and its related quantization weights to run long seqs (such as 128k context), it is required to use yarn rope-scaling technique. - [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) originally only supports 40960 context(max_position_embeddings). If you want to use it and its related quantization weights to run long seqs (such as 128k context), it is required to use yarn rope-scaling technique.
- For vLLM version same as or new than `v0.12.0`, use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`. - For vLLM version same as or new than `v0.12.0`, use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`.
- For vllm version below `v0.12.0`, use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`. - For vllm version below `v0.12.0`, use parameter: `--rope-scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`.
If you are using weights like [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) which originally supports long contexts, there is no need to add this parameter. If you are using weights like [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) which originally supports long contexts, there is no need to add this parameter.
The parameters are explained as follows: The parameters are explained as follows:
@@ -150,7 +150,7 @@ The parameters are explained as follows:
### Multi-node Deployment with MP (Recommended) ### Multi-node Deployment with MP (Recommended)
Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes. Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to deploy the `Qwen3-235B-A22B-Instruct` model across multiple nodes.
Node 0 Node 0
@@ -282,7 +282,7 @@ Here are two accuracy evaluation methods.
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details. 1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` in `vllm-ascend:0.11.0rc0` for reference only. 2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` in `vllm-ascend:v0.11.0rc0` for reference only.
| dataset | version | metric | mode | vllm-api-general-chat | | dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----| |----- | ----- | ----- | ----- | -----|
@@ -310,7 +310,7 @@ Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=true
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input-len 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
``` ```
After about several minutes, you can get the performance evaluation result. After about several minutes, you can get the performance evaluation result.
@@ -589,7 +589,7 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
PD proxy: PD proxy:
```shell ```shell
python load_balance_proxy_server_example.py --port 12347 --prefiller-hosts prefill_node_1_ip --prefiller-port 8000 --decoder-hosts decode_node_1_ip --decoder-ports 8000 python load_balance_proxy_server_example.py --port 12347 --prefiller-hosts prefill_node_1_ip --prefiller-ports 8000 --decoder-hosts decode_node_1_ip --decoder-ports 8000
``` ```
Benchmark scripts: Benchmark scripts:

View File

@@ -108,10 +108,10 @@ curl http://localhost:8000/v1/completions \
-d '{ -d '{
"model": "qwen3-32b-w4a4", "model": "qwen3-32b-w4a4",
"prompt": "what is large language model?", "prompt": "what is large language model?",
"max_completion_tokens": "128", "max_completion_tokens": 128,
"top_p": "0.95", "top_p": 0.95,
"top_k": "40", "top_k": 40,
"temperature": "0.0" "temperature": 0
}' }'
``` ```

View File

@@ -106,10 +106,10 @@ curl http://localhost:8000/v1/completions \
-d '{ -d '{
"model": "qwen3-8b-w4a8", "model": "qwen3-8b-w4a8",
"prompt": "what is large language model?", "prompt": "what is large language model?",
"max_completion_tokens": "128", "max_completion_tokens": 128,
"top_p": "0.95", "top_p": 0.95,
"top_k": "40", "top_k": 40,
"temperature": "0.0" "temperature": 0
}' }'
``` ```

View File

@@ -92,7 +92,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details. 1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
2. After execution, you can get the result, here is the result of `Qwen3-Coder-30B-A3B-Instruct` in `vllm-ascend:0.11.0rc0` for reference only. 2. After execution, you can get the result, here is the result of `Qwen3-Coder-30B-A3B-Instruct` in `vllm-ascend:v0.11.0rc0` for reference only.
| dataset | version | metric | mode | vllm-api-general-chat | | dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----| |----- | ----- | ----- | ----- | -----|

View File

@@ -71,7 +71,7 @@ Start the server with the following command:
vllm serve Qwen/Qwen3-VL-Reranker-8B \ vllm serve Qwen/Qwen3-VL-Reranker-8B \
--runner pooling \ --runner pooling \
--max-model-len 4096 \ --max-model-len 4096 \
--hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \ --hf-overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \
--chat-template ./qwen3_vl_reranker.jinja --chat-template ./qwen3_vl_reranker.jinja
``` ```

View File

@@ -35,7 +35,7 @@ Using the Qwen3-Reranker-8B model as an example, first run the docker container
### Online Inference ### Online Inference
```bash ```bash
vllm serve Qwen/Qwen3-Reranker-8B --host 127.0.0.1 --port 8888 --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' vllm serve Qwen/Qwen3-Reranker-8B --host 127.0.0.1 --port 8888 --hf-overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
``` ```
Once your server is started, you can send request with follow examples. Once your server is started, you can send request with follow examples.

View File

@@ -42,7 +42,7 @@ vllm_bench_cases = {
"random_input_len": 128, "random_input_len": 128,
"max_concurrency": 40, "max_concurrency": 40,
"random_output_len": 100, "random_output_len": 100,
"temperature": 0.0, "temperature": 0,
} }
# NOTE: Any changes for the baseline throughput should be approved by team members. # NOTE: Any changes for the baseline throughput should be approved by team members.