xc-llm-ascend/docs/source/developer_guide/evaluation/using_evalscope.md

# Using EvalScope

This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).

## 1. Online serving

You can run docker container to start the vLLM server on a single NPU:

```{code-block} bash
   :substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```

If your service start successfully, you can see the info shown below:

```
INFO:     Started server process [6873]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
```

Once your server is started, you can query the model with input prompts in new terminal:

```
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "prompt": "The future of AI is",
        "max_tokens": 7,
        "temperature": 0
    }'
```

## 2. Install EvalScope using pip

You can install EvalScope by using:

```bash
python3 -m venv .venv-evalscope
source .venv-evalscope/bin/activate
pip install gradio plotly evalscope
```

## 3. Run gsm8k accuracy test using EvalScope

You can `evalscope eval` run gsm8k accuracy test:

```
evalscope eval \
 --model Qwen/Qwen2.5-7B-Instruct \
 --api-url http://localhost:8000/v1 \
 --api-key EMPTY \
 --eval-type service \
 --datasets gsm8k \
 --limit 10
```

After 1-2 mins, the output is as shown below:

```shell
+---------------------+-----------+-----------------+----------+-------+---------+---------+
| Model               | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+=====================+===========+=================+==========+=======+=========+=========+
| Qwen2.5-7B-Instruct | gsm8k     | AverageAccuracy | main     |    10 |     0.8 | default |
+---------------------+-----------+-----------------+----------+-------+---------+---------+
```

See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).

## 4. Run model inference stress testing using EvalScope

### Install EvalScope[perf] using pip

```shell
pip install evalscope[perf] -U
```

### Basic usage

You can use `evalscope perf` run perf test:

```
evalscope perf \
    --url "http://localhost:8000/v1/chat/completions" \
    --parallel 5 \
    --model Qwen/Qwen2.5-7B-Instruct \
    --number 20 \
    --api openai \
    --dataset openqa \
    --stream
```

### Output results

After 1-2 mins, the output is as shown below:

```shell
Benchmarking summary:
+-----------------------------------+---------------------------------------------------------------+
| Key                               | Value                                                         |
+===================================+===============================================================+
| Time taken for tests (s)          | 38.3744                                                       |
+-----------------------------------+---------------------------------------------------------------+
| Number of concurrency             | 5                                                             |
+-----------------------------------+---------------------------------------------------------------+
| Total requests                    | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
| Succeed requests                  | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
| Failed requests                   | 0                                                             |
+-----------------------------------+---------------------------------------------------------------+
| Output token throughput (tok/s)   | 132.6926                                                      |
+-----------------------------------+---------------------------------------------------------------+
| Total token throughput (tok/s)    | 158.8819                                                      |
+-----------------------------------+---------------------------------------------------------------+
| Request throughput (req/s)        | 0.5212                                                        |
+-----------------------------------+---------------------------------------------------------------+
| Average latency (s)               | 8.3612                                                        |
+-----------------------------------+---------------------------------------------------------------+
| Average time to first token (s)   | 0.1035                                                        |
+-----------------------------------+---------------------------------------------------------------+
| Average time per output token (s) | 0.0329                                                        |
+-----------------------------------+---------------------------------------------------------------+
| Average input tokens per request  | 50.25                                                         |
+-----------------------------------+---------------------------------------------------------------+
| Average output tokens per request | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
| Average package latency (s)       | 0.0324                                                        |
+-----------------------------------+---------------------------------------------------------------+
| Average package per request       | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
| Expected number of requests       | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
| Result DB path                    | outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db |
+-----------------------------------+---------------------------------------------------------------+

Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
|    10%     |  0.0962  |  0.031  |   4.4571    |      42      |      135      |       29.9767        |
|    25%     |  0.0971  | 0.0318  |   6.3509    |      47      |      193      |       30.2157        |
|    50%     |  0.0987  | 0.0321  |   9.3387    |      49      |      285      |       30.3969        |
|    66%     |  0.1017  | 0.0324  |   9.8519    |      52      |      302      |       30.5182        |
|    75%     |  0.107   | 0.0328  |   10.2391   |      55      |      313      |       30.6124        |
|    80%     |  0.1221  | 0.0329  |   10.8257   |      58      |      330      |       30.6759        |
|    90%     |  0.1245  | 0.0333  |   13.0472   |      62      |      404      |       30.9644        |
|    95%     |  0.1247  | 0.0336  |   14.2936   |      66      |      432      |       31.6691        |
|    98%     |  0.1247  | 0.0353  |   14.2936   |      66      |      432      |       31.6691        |
|    99%     |  0.1247  | 0.0627  |   14.2936   |      66      |      432      |       31.6691        |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
```

See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).
Using EvalScope evaluation (#611) ### What this PR does / why we need it? Using EvalScope to hava a evaluation (include eval and test): - https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage - https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally --------- Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> 2025-04-23 00:50:09 +08:00			`# Using EvalScope`

			`This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).`

			`## 1. Online serving`

			`You can run docker container to start the vLLM server on a single NPU:`

			```{code-block} bash
			`:substitutions:`
			`# Update DEVICE according to your device (/dev/davinci[0-7])`
			`export DEVICE=/dev/davinci7`
			`# Update the vllm-ascend image`
			`export IMAGE=quay.io/ascend/vllm-ascend:\|vllm_ascend_version\|`
			`docker run --rm \`
			`--name vllm-ascend \`
			`--device $DEVICE \`
			`--device /dev/davinci_manager \`
			`--device /dev/devmm_svm \`
			`--device /dev/hisi_hdc \`
			`-v /usr/local/dcmi:/usr/local/dcmi \`
			`-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \`
			`-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \`
			`-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \`
			`-v /etc/ascend_install.info:/etc/ascend_install.info \`
			`-v /root/.cache:/root/.cache \`
			`-p 8000:8000 \`
			`-e VLLM_USE_MODELSCOPE=True \`
			`-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \`
			`-it $IMAGE \`
			`vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240`
			```

			`If your service start successfully, you can see the info shown below:`

			```
			`INFO: Started server process [6873]`
			`INFO: Waiting for application startup.`
			`INFO: Application startup complete.`
			```

			`Once your server is started, you can query the model with input prompts in new terminal:`

			```
			`curl http://localhost:8000/v1/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "Qwen/Qwen2.5-7B-Instruct",`
			`"prompt": "The future of AI is",`
			`"max_tokens": 7,`
			`"temperature": 0`
			`}'`
			```

			`## 2. Install EvalScope using pip`

			`You can install EvalScope by using:`

			```bash
			`python3 -m venv .venv-evalscope`
			`source .venv-evalscope/bin/activate`
			`pip install gradio plotly evalscope`
			```

			`## 3. Run gsm8k accuracy test using EvalScope`

			You can `evalscope eval` run gsm8k accuracy test:
[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011) ### What this PR does / why we need it? 1. Enable pymarkdown check 2. Enable python `__init__.py` check for vllm and vllm-ascend 3. Make clean code ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-07-25 22:16:10 +08:00
Using EvalScope evaluation (#611) ### What this PR does / why we need it? Using EvalScope to hava a evaluation (include eval and test): - https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage - https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally --------- Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> 2025-04-23 00:50:09 +08:00			```
			`evalscope eval \`
			`--model Qwen/Qwen2.5-7B-Instruct \`
			`--api-url http://localhost:8000/v1 \`
			`--api-key EMPTY \`
			`--eval-type service \`
			`--datasets gsm8k \`
			`--limit 10`
			```

			`After 1-2 mins, the output is as shown below:`

			```shell
			`+---------------------+-----------+-----------------+----------+-------+---------+---------+`
			`\| Model \| Dataset \| Metric \| Subset \| Num \| Score \| Cat.0 \|`
			`+=====================+===========+=================+==========+=======+=========+=========+`
			`\| Qwen2.5-7B-Instruct \| gsm8k \| AverageAccuracy \| main \| 10 \| 0.8 \| default \|`
			`+---------------------+-----------+-----------------+----------+-------+---------+---------+`
			```

			`See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).`

			`## 4. Run model inference stress testing using EvalScope`

			`### Install EvalScope[perf] using pip`

			```shell
			`pip install evalscope[perf] -U`
			```

			`### Basic usage`

			You can use `evalscope perf` run perf test:
[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011) ### What this PR does / why we need it? 1. Enable pymarkdown check 2. Enable python `__init__.py` check for vllm and vllm-ascend 3. Make clean code ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-07-25 22:16:10 +08:00
Using EvalScope evaluation (#611) ### What this PR does / why we need it? Using EvalScope to hava a evaluation (include eval and test): - https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage - https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally --------- Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> 2025-04-23 00:50:09 +08:00			```
			`evalscope perf \`
			`--url "http://localhost:8000/v1/chat/completions" \`
			`--parallel 5 \`
			`--model Qwen/Qwen2.5-7B-Instruct \`
			`--number 20 \`
			`--api openai \`
			`--dataset openqa \`
			`--stream`
			```

			`### Output results`

[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011) ### What this PR does / why we need it? 1. Enable pymarkdown check 2. Enable python `__init__.py` check for vllm and vllm-ascend 3. Make clean code ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-07-25 22:16:10 +08:00			`After 1-2 mins, the output is as shown below:`
Using EvalScope evaluation (#611) ### What this PR does / why we need it? Using EvalScope to hava a evaluation (include eval and test): - https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage - https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally --------- Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> 2025-04-23 00:50:09 +08:00
			```shell
			`Benchmarking summary:`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Key \| Value \|`
			`+===================================+===============================================================+`
			`\| Time taken for tests (s) \| 38.3744 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Number of concurrency \| 5 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Total requests \| 20 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Succeed requests \| 20 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Failed requests \| 0 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Output token throughput (tok/s) \| 132.6926 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Total token throughput (tok/s) \| 158.8819 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Request throughput (req/s) \| 0.5212 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Average latency (s) \| 8.3612 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Average time to first token (s) \| 0.1035 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Average time per output token (s) \| 0.0329 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Average input tokens per request \| 50.25 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Average output tokens per request \| 254.6 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Average package latency (s) \| 0.0324 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Average package per request \| 254.6 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Expected number of requests \| 20 \|`
			`+-----------------------------------+---------------------------------------------------------------+`
			`\| Result DB path \| outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db \|`
			`+-----------------------------------+---------------------------------------------------------------+`

			`Percentile results:`
			`+------------+----------+---------+-------------+--------------+---------------+----------------------+`
			`\| Percentile \| TTFT (s) \| ITL (s) \| Latency (s) \| Input tokens \| Output tokens \| Throughput(tokens/s) \|`
			`+------------+----------+---------+-------------+--------------+---------------+----------------------+`
			`\| 10% \| 0.0962 \| 0.031 \| 4.4571 \| 42 \| 135 \| 29.9767 \|`
			`\| 25% \| 0.0971 \| 0.0318 \| 6.3509 \| 47 \| 193 \| 30.2157 \|`
			`\| 50% \| 0.0987 \| 0.0321 \| 9.3387 \| 49 \| 285 \| 30.3969 \|`
			`\| 66% \| 0.1017 \| 0.0324 \| 9.8519 \| 52 \| 302 \| 30.5182 \|`
			`\| 75% \| 0.107 \| 0.0328 \| 10.2391 \| 55 \| 313 \| 30.6124 \|`
			`\| 80% \| 0.1221 \| 0.0329 \| 10.8257 \| 58 \| 330 \| 30.6759 \|`
			`\| 90% \| 0.1245 \| 0.0333 \| 13.0472 \| 62 \| 404 \| 30.9644 \|`
			`\| 95% \| 0.1247 \| 0.0336 \| 14.2936 \| 66 \| 432 \| 31.6691 \|`
			`\| 98% \| 0.1247 \| 0.0353 \| 14.2936 \| 66 \| 432 \| 31.6691 \|`
			`\| 99% \| 0.1247 \| 0.0627 \| 14.2936 \| 66 \| 432 \| 31.6691 \|`
			`+------------+----------+---------+-------------+--------------+---------------+----------------------+`
			```

			`See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).`