[main][Docs] Fix typos across documentation (#6728)
## Summary
Fix typos and improve grammar consistency across 50 documentation files.
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
This commit is contained in:
@@ -39,7 +39,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
@@ -70,7 +70,7 @@ If you want to deploy multi-node environment, you need to set up environment on
|
||||
|
||||
## Deployment
|
||||
|
||||
### Service-oriented Deployment
|
||||
### Service-oriented Deployment
|
||||
|
||||
- `DeepSeek-R1-W8A8`: require 1 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8).
|
||||
|
||||
@@ -303,7 +303,7 @@ Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
@@ -10,7 +10,7 @@ DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinkin
|
||||
|
||||
- Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.
|
||||
|
||||
The `DeepSeek-V3.1` model is first supported in `vllm-ascend:v0.9.1rc3`
|
||||
The `DeepSeek-V3.1` model is first supported in `vllm-ascend:v0.9.1rc3`.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
|
||||
|
||||
@@ -30,7 +30,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
- `DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot).
|
||||
- `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model.
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
|
||||
|
||||
### Verify Multi-node Communication(Optional)
|
||||
|
||||
@@ -52,7 +52,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
@@ -580,7 +580,7 @@ The parameters are explained as follows:
|
||||
- `multistream_overlap_shared_expert: true`: When the Tensor Parallelism (TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream is enabled to overlap the computation process of shared experts for improved efficiency.
|
||||
- `lmhead_tensor_parallel_size: 16`: When the Tensor Parallelism (TP) size of the decode node is 1, this parameter allows the TP size of the LMHead embedding layer to be greater than 1, which is used to reduce the computational load of each card on the LMHead embedding layer.
|
||||
|
||||
6. run server for each node
|
||||
6. run server for each node:
|
||||
|
||||
```shell
|
||||
# p0
|
||||
@@ -593,7 +593,7 @@ python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
```
|
||||
|
||||
7. Run proxy `proxy.sh` scripts on the prefill master node
|
||||
7. Run the `proxy.sh` script on the prefill master node
|
||||
|
||||
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
@@ -716,7 +716,7 @@ There are three `vllm bench` subcommands:
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
vllm bench serve --model /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
@@ -21,7 +21,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
- `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. Model weight in BF16 not found now.
|
||||
- `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
|
||||
|
||||
### Verify Multi-node Communication(Optional)
|
||||
|
||||
@@ -390,7 +390,7 @@ We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environm
|
||||
|
||||
Before you start, please
|
||||
|
||||
1. prepare the script `launch_online_dp.py` on each node.
|
||||
1. prepare the script `launch_online_dp.py` on each node:
|
||||
|
||||
```python
|
||||
import argparse
|
||||
@@ -905,7 +905,7 @@ Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
## Function Call
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
## Introduction
|
||||
|
||||
GLM-4.x series models use a Mixture-of-Experts (MoE) architecture and are foundational models specifically designed for agent applications
|
||||
GLM-4.x series models use a Mixture-of-Experts (MoE) architecture and are foundational models specifically designed for agent applications.
|
||||
|
||||
The `GLM-4.5` model is first supported in `vllm-ascend:v0.10.0rc1`
|
||||
The `GLM-4.5` model is first supported in `vllm-ascend:v0.10.0rc1`.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
|
||||
|
||||
@@ -25,7 +25,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
- `GLM-4.6-w8a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because vllm do not support GLM4.6 mtp in October, so we do not provide mtp version. And last month, it supported, you can use the following quantization scheme to add mtp weights to Quantized weights.
|
||||
- `Method of Quantify`: [quantization scheme](https://blog.csdn.net/qq_37368095/article/details/156429653?spm=1011.2124.3001.6209). You can use these methods to quantify the model.
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
|
||||
|
||||
### Installation
|
||||
|
||||
@@ -43,7 +43,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
|
||||
@@ -78,7 +78,7 @@ Single-node deployment is recommended.
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
Not supported yet
|
||||
Not supported yet.
|
||||
|
||||
## Functional Verification
|
||||
|
||||
@@ -190,7 +190,7 @@ pip install safetensors
|
||||
```
|
||||
|
||||
:::{note}
|
||||
The OpenCV component may be missing:
|
||||
The OpenCV component may be missing:
|
||||
|
||||
```bash
|
||||
apt-get update
|
||||
|
||||
@@ -563,7 +563,7 @@ The performance evaluation must be conducted in an online mode. Take the `serve`
|
||||
:sync: single
|
||||
|
||||
```shell
|
||||
vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
::::
|
||||
@@ -571,7 +571,7 @@ vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --rand
|
||||
:sync: multi
|
||||
|
||||
```shell
|
||||
vllm bench serve --model Qwen/Qwen2.5-VL-32B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model Qwen/Qwen2.5-VL-32B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen2.5-7B-Instruct`(BF16 version): require 1 910B4 cards(32G × 1). [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
|
||||
- `Qwen2.5-7B-Instruct`(BF16 version): require 1 910B4 cards(32G × 1) [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
|
||||
|
||||
It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment.
|
||||
|
||||
@@ -171,7 +171,7 @@ vllm bench serve \
|
||||
--model ./Qwen2.5-7B-Instruct/ \
|
||||
--dataset-name random \
|
||||
--random-input 200 \
|
||||
--num-prompt 200 \
|
||||
--num-prompts 200 \
|
||||
--request-rate 1 \
|
||||
--save-result \
|
||||
--result-dir ./perf_results/
|
||||
|
||||
@@ -95,7 +95,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"
|
||||
|
||||
:::
|
||||
|
||||
`--allowed-local-media-path` is optional, only set it if you need infer model with local media file
|
||||
`--allowed-local-media-path` is optional, only set it if you need infer model with local media file.
|
||||
|
||||
`--gpu-memory-utilization` should not be set manually only if you know what this parameter aims to.
|
||||
|
||||
@@ -118,11 +118,11 @@ vllm serve ${MODEL_PATH}\
|
||||
--no-enable-prefix-caching
|
||||
```
|
||||
|
||||
`--tensor_parallel_size` no need to set for this 7B model, but if you really need tensor parallel, tp size can be one of `1\2\4`
|
||||
`--tensor_parallel_size` no need to set for this 7B model, but if you really need tensor parallel, tp size can be one of `1/2/4`.
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
Not supported yet
|
||||
Not supported yet.
|
||||
|
||||
## Functional Verification
|
||||
|
||||
@@ -145,7 +145,7 @@ curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/j
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What is the text in the illustrate?"
|
||||
"text": "What is the text in the illustration?"
|
||||
},
|
||||
{
|
||||
"type": "image_url",
|
||||
@@ -170,7 +170,7 @@ If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Qwen2.5-Omni on vllm-ascend has been test on AISBench.
|
||||
Qwen2.5-Omni on vllm-ascend has been tested on AISBench.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
@@ -204,7 +204,7 @@ There are three `vllm bench` subcommands:
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
vllm bench serve --model Qwen/Qwen2.5-Omni-7B --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model Qwen/Qwen2.5-Omni-7B --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
@@ -18,10 +18,10 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-235B-A22B)
|
||||
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
|
||||
- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-235B-A22B)
|
||||
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
|
||||
|
||||
### Verify Multi-node Communication(Optional)
|
||||
|
||||
@@ -46,7 +46,7 @@ Select an image based on your machine type and start the docker image on your no
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
@@ -87,7 +87,7 @@ If you want to deploy multi-node environment, you need to set up environment on
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
`Qwen3-235B-A22B` and `Qwen3-235B-A22B-w8a8` can both be deployed on 1 Atlas 800 A3(64G*16)、 1 Atlas 800 A2(64G*8).
|
||||
`Qwen3-235B-A22B` and `Qwen3-235B-A22B-w8a8` can both be deployed on 1 Atlas 800 A3(64G*16), 1 Atlas 800 A2(64G*8).
|
||||
Quantized version need to start with parameter `--quantization ascend`.
|
||||
|
||||
Run the following script to execute online 128k inference.
|
||||
@@ -310,7 +310,7 @@ Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
@@ -328,7 +328,7 @@ In this section, we provide simple scripts to re-produce our latest performance.
|
||||
- HDK/driver 25.3.RC1
|
||||
- triton_ascend 3.2.0
|
||||
|
||||
### Single Node A3 (64G*16)
|
||||
### Single Node A3 (64G*16)
|
||||
|
||||
Example server scripts:
|
||||
|
||||
@@ -394,7 +394,7 @@ Note:
|
||||
|
||||
### Three Node A3 -- PD disaggregation
|
||||
|
||||
On three Atlas 800 A3(64G*16)server, we recommend to use one node as one prefill instance and two nodes as one decode instance. Example server scripts:
|
||||
On three Atlas 800 A3(64G*16) server, we recommend to use one node as one prefill instance and two nodes as one decode instance. Example server scripts:
|
||||
Prefill Node 1
|
||||
|
||||
```shell
|
||||
|
||||
@@ -304,7 +304,7 @@ Take the `serve` as an example. Run the code as follows.
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
|
||||
```shell
|
||||
vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
@@ -389,4 +389,4 @@ If this list is not manually specified, it will be filled with a series of evenl
|
||||
|
||||
Therefore, like the above real-world scenario, when adjusting the benchmark request concurrency, we always ensure that the concurrency is actually included in the cudagraph_capture_sizes list. This way, during the decode phase, padding operations are essentially avoided, ensuring the reliability of the experimental data.
|
||||
|
||||
It’s important to note that if you enable FlashComm_v1, the values in this list must be integer multiples of the TP size. Any values that do not meet this condition will be automatically filtered out. Therefore, I recommend incrementally adding concurrency based on the TP size after enabling FlashComm_v1.
|
||||
It's important to note that if you enable FlashComm_v1, the values in this list must be integer multiples of the TP size. Any values that do not meet this condition will be automatically filtered out. Therefore, I recommend incrementally adding concurrency based on the TP size after enabling FlashComm_v1.
|
||||
|
||||
@@ -164,7 +164,7 @@ Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
@@ -285,7 +285,7 @@ python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-s
|
||||
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
pip install -r vllm-ascend/benchmarks/requirements-bench.txt
|
||||
|
||||
vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only.
|
||||
|
||||
@@ -270,7 +270,7 @@ Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
Reference in New Issue
Block a user