xc-llm-ascend/docs/source/tutorials/models/Qwen2.5-7B.md

# Qwen2.5-7B

## Introduction

Qwen2.5-7B-Instruct is the flagship instruction-tuned variant of Alibaba Cloud’s Qwen 2.5 LLM series. It supports a maximum context window of 128K, enables generation of up to 8K tokens, and delivers enhanced capabilities in multilingual processing, instruction following, programming, mathematical computation, and structured data handling.

This document details the complete deployment and verification workflow for the model, including supported features, environment preparation, single-node deployment, functional verification, accuracy and performance evaluation, and troubleshooting of common issues. It is designed to help users quickly complete model deployment and validation.

The `Qwen2.5-7B-Instruct` model was supported since `vllm-ascend:v0.9.0`.

## Supported Features

Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.

Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.

## Environment Preparation

### Model Weight

- `Qwen2.5-7B-Instruct`(BF16 version): require 1 Atlas 910B4 (32G × 1) card. [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)

It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment.

### Installation

You can use our official docker image and install extra operator for supporting `Qwen2.5-7B-Instruct`.

:::::{tab-set}
:sync-group: install

::::{tab-item} A3 series
:sync: A3

1. Start the docker image on your each node.

```{code-block} bash
   :substitutions:

export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
docker run --rm \
    --name vllm-ascend \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash
```

::::
::::{tab-item} A2 series
:sync: A2

Start the docker image on your each node.

```{code-block} bash
   :substitutions:

export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
    --name vllm-ascend \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash
```

::::
:::::

## Deployment

### Single-node Deployment

Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service:

1. Prepare model weights: Ensure the downloaded model weights are stored in the `./Qwen2.5-7B-Instruct/` directory.
2. Create and execute the deployment script (save as `deploy.sh`):

```shell
#!/bin/sh
export ASCEND_RT_VISIBLE_DEVICES=0
export MODEL_PATH="Qwen/Qwen2.5-7B-Instruct"

vllm serve ${MODEL_PATH} \
          --host 0.0.0.0 \
          --port 8000 \
          --served-model-name qwen-2.5-7b-instruct \
          --trust-remote-code \
          --max-model-len 32768
```

### Multi-node Deployment

Single-node deployment is recommended.

### Prefill-Decode Disaggregation

Not supported yet.

## Functional Verification

After starting the service, verify functionality using a `curl` request:

```shell
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-2.5-7b-instruct",
        "prompt": "Beijing is a",
        "max_completion_tokens": 5,
        "temperature": 0
    }'
```

A valid response (e.g., `"Beijing is a vibrant and historic capital city"`) indicates successful deployment.

## Accuracy Evaluation

### Using AISBench

Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.

Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy report is shown below:

| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- |--------------|
| gsm8k | - | accuracy | gen | 75.00  |

## Performance

### Using AISBench

Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.

### Using vLLM Benchmark

Run performance evaluation of `Qwen2.5-7B-Instruct` as an example.

Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.

There are three `vllm bench` subcommands:

- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

Take the `serve` as an example. Run the code as follows.

```shell
vllm bench serve \
  --model ./Qwen2.5-7B-Instruct/ \
  --dataset-name random \
  --random-input 200 \
  --num-prompts 200 \
  --request-rate 1 \
  --save-result \
  --result-dir ./perf_results/
```

After about several minutes, you can get the performance evaluation result.
-												[Doc] Update tutorial index (#4920)

Update tutorial index and remove useless doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 20:53:13 +08:00
+								# Qwen2.5-7B
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
 								## Introduction
 								Qwen2.5-7B-Instruct is the flagship instruction-tuned variant of Alibaba Cloud’s Qwen 2.5 LLM series. It supports a maximum context window of 128K, enables generation of up to 8K tokens, and delivers enhanced capabilities in multilingual processing, instruction following, programming, mathematical computation, and structured data handling.
 								This document details the complete deployment and verification workflow for the model, including supported features, environment preparation, single-node deployment, functional verification, accuracy and performance evaluation, and troubleshooting of common issues. It is designed to help users quickly complete model deployment and validation.
 								The `Qwen2.5-7B-Instruct` model was supported since `vllm-ascend:v0.9.0`.
 								## Supported Features
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
 								## Environment Preparation
 								### Model Weight
-												[Doc] fix the nit in docs (#6826)

Refresh the doc, fix the nit in the docs

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-27 11:50:27 +08:00
+								- `Qwen2.5-7B-Instruct`(BF16 version): require 1 Atlas 910B4 (32G × 1) card. [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
 								It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment.
 								### Installation
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								You can use our official docker image and install extra operator for supporting `Qwen2.5-7B-Instruct`.
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
 								:::::{tab-set}
 								:sync-group: install
 								::::{tab-item} A3 series
 								:sync: A3
 . Start the docker image on your each node.
 								```{code-block} bash
 								   :substitutions:
 								export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
 								docker run --rm \
 								    --name vllm-ascend \
 								    --shm-size=1g \
 								    --net=host \
 								    --device /dev/davinci0 \
 								    --device /dev/davinci_manager \
 								    --device /dev/devmm_svm \
 								    --device /dev/hisi_hdc \
 								    -v /usr/local/dcmi:/usr/local/dcmi \
 								    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
 								    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 								    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 								    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 								    -v /etc/ascend_install.info:/etc/ascend_install.info \
 								    -v /root/.cache:/root/.cache \
 								    -it $IMAGE bash
 								```
 								::::
 								::::{tab-item} A2 series
 								:sync: A2
 								Start the docker image on your each node.
 								```{code-block} bash
 								   :substitutions:
 								export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
 								docker run --rm \
 								    --name vllm-ascend \
 								    --shm-size=1g \
 								    --net=host \
 								    --device /dev/davinci0 \
 								    --device /dev/davinci_manager \
 								    --device /dev/devmm_svm \
 								    --device /dev/hisi_hdc \
 								    -v /usr/local/dcmi:/usr/local/dcmi \
 								    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
 								    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 								    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 								    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 								    -v /etc/ascend_install.info:/etc/ascend_install.info \
 								    -v /root/.cache:/root/.cache \
 								    -it $IMAGE bash
 								```
 								::::
 								:::::
 								## Deployment
 								### Single-node Deployment
 								Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service:
 . Prepare model weights: Ensure the downloaded model weights are stored in the `./Qwen2.5-7B-Instruct/` directory.
 . Create and execute the deployment script (save as `deploy.sh`):
 								```shell
 								#!/bin/sh
-												[Doc] Fix typo in ASCEND_RT_VISIBLE_DEVICES (#5581)

Fixed a typo in the environment variable name.
`ASCEBD_RT_VISIBLE_DEVICES` -> `ASCEND_RT_VISIBLE_DEVICES`
Fixes #5580

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-01-04 17:01:02 +08:00
+								export ASCEND_RT_VISIBLE_DEVICES=0
-												[Doc] Upgrade some outdated doc (#5062)

### What this PR does / why we need it?
Upgrade some outdated doc to make run happily

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-12-16 11:48:19 +08:00
+								export MODEL_PATH="Qwen/Qwen2.5-7B-Instruct"
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
 								vllm serve ${MODEL_PATH} \
 								          --host 0.0.0.0 \
 								          --port 8000 \
 								          --served-model-name qwen-2.5-7b-instruct \
 								          --trust-remote-code \
 								          --max-model-len 32768
 								```
 								### Multi-node Deployment
 								Single-node deployment is recommended.
 								### Prefill-Decode Disaggregation
 								Not supported yet.
 								## Functional Verification
 								After starting the service, verify functionality using a `curl` request:
 								```shell
-												[Doc]Refresh model tutorial examples and serving commands (#7426)

### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples

- adjust some command snippets and notes for better copy-paste usability

- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`（**Offline** inference currently **does not support** the
"max_completion_tokens"）
``` bash
Traceback (most recent call last):
  File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```

- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999):  HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048, 
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine 
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) + 
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```

- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
    model=model_name,
    task="score",       # need fix
    hf_overrides={
        "architectures": ["Qwen3ForSequenceClassification"],
        "classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
    model=model_name,
    runner="pooling",
    hf_overrides={
        "architectures": ["Qwen3ForSequenceClassification"],
        "classifier_from_token": ["no", "yes"],
```

- modify **PaddleOCR-VL**  parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
 is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```

These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: MrZ20 <2609716663@qq.com>
											
										
										
											2026-03-20 11:34:18 +08:00
+								curl http://localhost:8000/v1/completions \
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
+								    -H "Content-Type: application/json" \
 								    -d '{
 								        "model": "qwen-2.5-7b-instruct",
 								        "prompt": "Beijing is a",
-												[Doc] Update `max_tokens` to `max_completion_tokens` in all docs (#6248)

### What this PR does / why we need it?

Fix:

```
DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field.
```

- vLLM version: v0.14.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: shen-shanshan <467638484@qq.com>
											
										
										
											2026-01-26 11:57:40 +08:00
+								        "max_completion_tokens": 5,
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
+								        "temperature": 0
 								    }'
 								```
 								A valid response (e.g., `"Beijing is a vibrant and historic capital city"`) indicates successful deployment.
 								## Accuracy Evaluation
 								### Using AISBench
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
 								Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy report is shown below:
 								| dataset | version | metric | mode | vllm-api-general-chat |
 								|----- | ----- | ----- | ----- |--------------|
 								| gsm8k | - | accuracy | gen | 75.00  |
 								## Performance
 								### Using AISBench
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
 								### Using vLLM Benchmark
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
+								Run performance evaluation of `Qwen2.5-7B-Instruct` as an example.
 								Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								There are three `vllm bench` subcommands:
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
+								- `latency`: Benchmark the latency of a single batch of requests.
 								- `serve`: Benchmark the online serving throughput.
 								- `throughput`: Benchmark offline inference throughput.
 								Take the `serve` as an example. Run the code as follows.
 								```shell
 								vllm bench serve \
 								  --model ./Qwen2.5-7B-Instruct/ \
 								  --dataset-name random \
 								  --random-input 200 \
-												[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-13 15:50:05 +08:00
+								  --num-prompts 200 \
-												[doc]  Add Qwen2.5 tutorials (#4636)

### What this PR does / why we need it?
Add qwen2.5 turorial

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: yangshihao6 <yangshihao6@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 17:30:05 +08:00
+								  --request-rate 1 \
 								  --save-result \
 								  --result-dir ./perf_results/
 								```
 								After about several minutes, you can get the performance evaluation result.