[Doc][releases/v0.18.0] fix documentation error or non-standard description (#8626)
### What this PR does / why we need it? fix documentation error or non-standard description in releases/v0.18.0 branch ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation check. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>
This commit is contained in:
@@ -105,8 +105,8 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
|
||||
**Notice:**
|
||||
|
||||
- for vllm version below `v0.12.0` use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`
|
||||
- for vllm version `v0.12.0` use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`
|
||||
- for vllm version below `v0.12.0` use parameter: `--rope-scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`
|
||||
- for vllm version same as or newer than `v0.12.0` use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`
|
||||
|
||||
The parameters are explained as follows:
|
||||
|
||||
|
||||
@@ -244,7 +244,7 @@ python load_balance_proxy_server_example.py \
|
||||
--host 192.0.0.1 \
|
||||
--port 8080 \
|
||||
--prefiller-hosts 192.0.0.1 \
|
||||
--prefiller-port 13700 \
|
||||
--prefiller-ports 13700 \
|
||||
--decoder-hosts 192.0.0.1 \
|
||||
--decoder-ports 13701
|
||||
```
|
||||
@@ -252,7 +252,7 @@ python load_balance_proxy_server_example.py \
|
||||
|Parameter | Meaning |
|
||||
| --- | --- |
|
||||
| --port | Port of proxy |
|
||||
| --prefiller-port | All ports of prefill |
|
||||
| --prefiller-ports | All ports of prefill |
|
||||
| --decoder-ports | All ports of decoder |
|
||||
|
||||
## Verification
|
||||
|
||||
@@ -178,7 +178,7 @@ Qwen2.5-Omni on vllm-ascend has been tested on AISBench.
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `Qwen2.5-Omni-7B` with `vllm-ascend:0.11.0rc0` for reference only.
|
||||
2. After execution, you can get the result, here is the result of `Qwen2.5-Omni-7B` with `vllm-ascend:v0.11.0rc0` for reference only.
|
||||
|
||||
| dataset | platform | metric | mode | vllm-api-stream-chat |
|
||||
|----- | ----- | ----- | ----- | -----|
|
||||
|
||||
@@ -127,7 +127,7 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
|
||||
- [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) originally only supports 40960 context(max_position_embeddings). If you want to use it and its related quantization weights to run long seqs (such as 128k context), it is required to use yarn rope-scaling technique.
|
||||
- For vLLM version same as or new than `v0.12.0`, use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`.
|
||||
- For vllm version below `v0.12.0`, use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`.
|
||||
- For vllm version below `v0.12.0`, use parameter: `--rope-scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`.
|
||||
If you are using weights like [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) which originally supports long contexts, there is no need to add this parameter.
|
||||
|
||||
The parameters are explained as follows:
|
||||
@@ -150,7 +150,7 @@ The parameters are explained as follows:
|
||||
|
||||
### Multi-node Deployment with MP (Recommended)
|
||||
|
||||
Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes.
|
||||
Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to deploy the `Qwen3-235B-A22B-Instruct` model across multiple nodes.
|
||||
|
||||
Node 0
|
||||
|
||||
@@ -282,7 +282,7 @@ Here are two accuracy evaluation methods.
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` in `vllm-ascend:0.11.0rc0` for reference only.
|
||||
2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` in `vllm-ascend:v0.11.0rc0` for reference only.
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat |
|
||||
|----- | ----- | ----- | ----- | -----|
|
||||
@@ -310,7 +310,7 @@ Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input-len 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
@@ -589,7 +589,7 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
PD proxy:
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py --port 12347 --prefiller-hosts prefill_node_1_ip --prefiller-port 8000 --decoder-hosts decode_node_1_ip --decoder-ports 8000
|
||||
python load_balance_proxy_server_example.py --port 12347 --prefiller-hosts prefill_node_1_ip --prefiller-ports 8000 --decoder-hosts decode_node_1_ip --decoder-ports 8000
|
||||
```
|
||||
|
||||
Benchmark scripts:
|
||||
|
||||
@@ -108,10 +108,10 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "qwen3-32b-w4a4",
|
||||
"prompt": "what is large language model?",
|
||||
"max_completion_tokens": "128",
|
||||
"top_p": "0.95",
|
||||
"top_k": "40",
|
||||
"temperature": "0.0"
|
||||
"max_completion_tokens": 128,
|
||||
"top_p": 0.95,
|
||||
"top_k": 40,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
|
||||
@@ -106,10 +106,10 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "qwen3-8b-w4a8",
|
||||
"prompt": "what is large language model?",
|
||||
"max_completion_tokens": "128",
|
||||
"top_p": "0.95",
|
||||
"top_k": "40",
|
||||
"temperature": "0.0"
|
||||
"max_completion_tokens": 128,
|
||||
"top_p": 0.95,
|
||||
"top_k": 40,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
|
||||
@@ -92,7 +92,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `Qwen3-Coder-30B-A3B-Instruct` in `vllm-ascend:0.11.0rc0` for reference only.
|
||||
2. After execution, you can get the result, here is the result of `Qwen3-Coder-30B-A3B-Instruct` in `vllm-ascend:v0.11.0rc0` for reference only.
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat |
|
||||
|----- | ----- | ----- | ----- | -----|
|
||||
|
||||
@@ -71,7 +71,7 @@ Start the server with the following command:
|
||||
vllm serve Qwen/Qwen3-VL-Reranker-8B \
|
||||
--runner pooling \
|
||||
--max-model-len 4096 \
|
||||
--hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \
|
||||
--hf-overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \
|
||||
--chat-template ./qwen3_vl_reranker.jinja
|
||||
```
|
||||
|
||||
|
||||
@@ -35,7 +35,7 @@ Using the Qwen3-Reranker-8B model as an example, first run the docker container
|
||||
### Online Inference
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Reranker-8B --host 127.0.0.1 --port 8888 --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
|
||||
vllm serve Qwen/Qwen3-Reranker-8B --host 127.0.0.1 --port 8888 --hf-overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
|
||||
```
|
||||
|
||||
Once your server is started, you can send request with follow examples.
|
||||
|
||||
Reference in New Issue
Block a user