2025-04-23 00:50:09 +08:00
# Using EvalScope
2026-02-27 11:50:27 +08:00
This document will guide you through model inference stress testing and accuracy testing using [EvalScope ](https://github.com/modelscope/evalscope ).
2025-04-23 00:50:09 +08:00
2025-10-29 11:03:39 +08:00
## 1. Online server
2025-04-23 00:50:09 +08:00
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
2025-10-23 11:17:26 +08:00
--shm-size=1g \
2025-04-23 00:50:09 +08:00
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
2025-10-29 11:03:39 +08:00
If the vLLM server is started successfully, you can see information shown below:
2025-04-23 00:50:09 +08:00
2026-01-15 09:06:01 +08:00
```shell
2025-04-23 00:50:09 +08:00
INFO: Started server process [6873]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
2025-10-29 11:03:39 +08:00
Once your server is started, you can query the model with input prompts in a new terminal:
2025-04-23 00:50:09 +08:00
2026-01-15 09:06:01 +08:00
```shell
2025-04-23 00:50:09 +08:00
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
2026-01-26 11:57:40 +08:00
"max_completion_tokens": 7,
2025-04-23 00:50:09 +08:00
"temperature": 0
}'
```
## 2. Install EvalScope using pip
2025-10-29 11:03:39 +08:00
You can install EvalScope as follows:
2025-04-23 00:50:09 +08:00
```bash
python3 -m venv .venv-evalscope
source .venv-evalscope/bin/activate
pip install gradio plotly evalscope
```
2025-10-29 11:03:39 +08:00
## 3. Run GSM8K using EvalScope for accuracy testing
2025-04-23 00:50:09 +08:00
2025-10-29 11:03:39 +08:00
You can use `evalscope eval` to run GSM8K for accuracy testing:
2025-07-25 22:16:10 +08:00
2026-01-15 09:06:01 +08:00
```shell
2025-04-23 00:50:09 +08:00
evalscope eval \
--model Qwen/Qwen2.5-7B-Instruct \
--api-url http://localhost:8000/v1 \
--api-key EMPTY \
2026-01-05 11:17:39 +08:00
--eval-type server \
2025-04-23 00:50:09 +08:00
--datasets gsm8k \
--limit 10
```
2025-10-29 11:03:39 +08:00
After 1 to 2 minutes, the output is shown below:
2025-04-23 00:50:09 +08:00
```shell
+---------------------+-----------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=====================+===========+=================+==========+=======+=========+=========+
| Qwen2.5-7B-Instruct | gsm8k | AverageAccuracy | main | 10 | 0.8 | default |
+---------------------+-----------+-----------------+----------+-------+---------+---------+
```
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
See more details in [EvalScope doc - Model API Service Evaluation ](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation ).
2025-04-23 00:50:09 +08:00
## 4. Run model inference stress testing using EvalScope
### Install EvalScope[perf] using pip
```shell
pip install evalscope[perf] -U
```
### Basic usage
2025-10-29 11:03:39 +08:00
You can use `evalscope perf` to run perf testing:
2025-07-25 22:16:10 +08:00
2026-01-15 09:06:01 +08:00
```shell
2025-04-23 00:50:09 +08:00
evalscope perf \
--url "http://localhost:8000/v1/chat/completions" \
--parallel 5 \
--model Qwen/Qwen2.5-7B-Instruct \
--number 20 \
--api openai \
--dataset openqa \
--stream
```
### Output results
2025-10-29 11:03:39 +08:00
After 1 to 2 minutes, the output is shown below:
2025-04-23 00:50:09 +08:00
```shell
Benchmarking summary:
+-----------------------------------+---------------------------------------------------------------+
| Key | Value |
+===================================+===============================================================+
| Time taken for tests (s) | 38.3744 |
+-----------------------------------+---------------------------------------------------------------+
| Number of concurrency | 5 |
+-----------------------------------+---------------------------------------------------------------+
| Total requests | 20 |
+-----------------------------------+---------------------------------------------------------------+
| Succeed requests | 20 |
+-----------------------------------+---------------------------------------------------------------+
| Failed requests | 0 |
+-----------------------------------+---------------------------------------------------------------+
| Output token throughput (tok/s) | 132.6926 |
+-----------------------------------+---------------------------------------------------------------+
| Total token throughput (tok/s) | 158.8819 |
+-----------------------------------+---------------------------------------------------------------+
| Request throughput (req/s) | 0.5212 |
+-----------------------------------+---------------------------------------------------------------+
| Average latency (s) | 8.3612 |
+-----------------------------------+---------------------------------------------------------------+
| Average time to first token (s) | 0.1035 |
+-----------------------------------+---------------------------------------------------------------+
| Average time per output token (s) | 0.0329 |
+-----------------------------------+---------------------------------------------------------------+
| Average input tokens per request | 50.25 |
+-----------------------------------+---------------------------------------------------------------+
| Average output tokens per request | 254.6 |
+-----------------------------------+---------------------------------------------------------------+
| Average package latency (s) | 0.0324 |
+-----------------------------------+---------------------------------------------------------------+
| Average package per request | 254.6 |
+-----------------------------------+---------------------------------------------------------------+
| Expected number of requests | 20 |
+-----------------------------------+---------------------------------------------------------------+
| Result DB path | outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db |
+-----------------------------------+---------------------------------------------------------------+
Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
| 10% | 0.0962 | 0.031 | 4.4571 | 42 | 135 | 29.9767 |
| 25% | 0.0971 | 0.0318 | 6.3509 | 47 | 193 | 30.2157 |
| 50% | 0.0987 | 0.0321 | 9.3387 | 49 | 285 | 30.3969 |
| 66% | 0.1017 | 0.0324 | 9.8519 | 52 | 302 | 30.5182 |
| 75% | 0.107 | 0.0328 | 10.2391 | 55 | 313 | 30.6124 |
| 80% | 0.1221 | 0.0329 | 10.8257 | 58 | 330 | 30.6759 |
| 90% | 0.1245 | 0.0333 | 13.0472 | 62 | 404 | 30.9644 |
| 95% | 0.1247 | 0.0336 | 14.2936 | 66 | 432 | 31.6691 |
| 98% | 0.1247 | 0.0353 | 14.2936 | 66 | 432 | 31.6691 |
| 99% | 0.1247 | 0.0627 | 14.2936 | 66 | 432 | 31.6691 |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
```
2025-10-29 11:03:39 +08:00
See more detail in [EvalScope doc - Model Inference Stress Testing ](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage ).