Using EvalScope evaluation (#611)

### What this PR does / why we need it? Using EvalScope to hava a evaluation (include eval and test): - https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage - https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally --------- Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-04-23 00:50:09 +08:00
parent 4a0ce3660e
commit 848e041a54
2 changed files with 175 additions and 1 deletions
--- a/docs/source/developer_guide/evaluation/index.md
+++ b/docs/source/developer_guide/evaluation/index.md
@@ -5,4 +5,5 @@
 :maxdepth: 1
 using_opencompass
 using_lm_eval
-:::
+using_evalscope
+:::
--- a/docs/source/developer_guide/evaluation/using_evalscope.md
+++ b/docs/source/developer_guide/evaluation/using_evalscope.md
@@ -0,0 +1,173 @@
+# Using EvalScope
+
+This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
+
+## 1. Online serving
+
+You can run docker container to start the vLLM server on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
+```
+
+If your service start successfully, you can see the info shown below:
+
+```
+INFO:     Started server process [6873]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+Once your server is started, you can query the model with input prompts in new terminal:
+
+```
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen2.5-7B-Instruct",
+        "prompt": "The future of AI is",
+        "max_tokens": 7,
+        "temperature": 0
+    }'
+```
+
+## 2. Install EvalScope using pip
+
+You can install EvalScope by using:
+
+```bash
+python3 -m venv .venv-evalscope
+source .venv-evalscope/bin/activate
+pip install gradio plotly evalscope
+```
+
+## 3. Run gsm8k accuracy test using EvalScope
+
+You can `evalscope eval` run gsm8k accuracy test:
+```
+evalscope eval \
+ --model Qwen/Qwen2.5-7B-Instruct \
+ --api-url http://localhost:8000/v1 \
+ --api-key EMPTY \
+ --eval-type service \
+ --datasets gsm8k \
+ --limit 10
+```
+
+After 1-2 mins, the output is as shown below:
+
+```shell
+---------------------+-----------+-----------------+----------+-------+---------+---------+
+| Model               | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+=====================+===========+=================+==========+=======+=========+=========+
+| Qwen2.5-7B-Instruct | gsm8k     | AverageAccuracy | main     |    10 |     0.8 | default |
+---------------------+-----------+-----------------+----------+-------+---------+---------+
+```
+
+See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
+
+## 4. Run model inference stress testing using EvalScope
+
+### Install EvalScope[perf] using pip
+
+```shell
+pip install evalscope[perf] -U
+```
+
+### Basic usage
+
+You can use `evalscope perf` run perf test:
+```
+evalscope perf \
+    --url "http://localhost:8000/v1/chat/completions" \
+    --parallel 5 \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --number 20 \
+    --api openai \
+    --dataset openqa \
+    --stream
+```
+
+### Output results
+
+After 1-2 mins, the output is as shown below: 
+
+```shell
+Benchmarking summary:
+-----------------------------------+---------------------------------------------------------------+
+| Key                               | Value                                                         |
+===================================+===============================================================+
+| Time taken for tests (s)          | 38.3744                                                       |
+-----------------------------------+---------------------------------------------------------------+
+| Number of concurrency             | 5                                                             |
+-----------------------------------+---------------------------------------------------------------+
+| Total requests                    | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Succeed requests                  | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Failed requests                   | 0                                                             |
+-----------------------------------+---------------------------------------------------------------+
+| Output token throughput (tok/s)   | 132.6926                                                      |
+-----------------------------------+---------------------------------------------------------------+
+| Total token throughput (tok/s)    | 158.8819                                                      |
+-----------------------------------+---------------------------------------------------------------+
+| Request throughput (req/s)        | 0.5212                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average latency (s)               | 8.3612                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average time to first token (s)   | 0.1035                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average time per output token (s) | 0.0329                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average input tokens per request  | 50.25                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Average output tokens per request | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Average package latency (s)       | 0.0324                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average package per request       | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Expected number of requests       | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Result DB path                    | outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db |
+-----------------------------------+---------------------------------------------------------------+
+
+Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+|    10%     |  0.0962  |  0.031  |   4.4571    |      42      |      135      |       29.9767        |
+|    25%     |  0.0971  | 0.0318  |   6.3509    |      47      |      193      |       30.2157        |
+|    50%     |  0.0987  | 0.0321  |   9.3387    |      49      |      285      |       30.3969        |
+|    66%     |  0.1017  | 0.0324  |   9.8519    |      52      |      302      |       30.5182        |
+|    75%     |  0.107   | 0.0328  |   10.2391   |      55      |      313      |       30.6124        |
+|    80%     |  0.1221  | 0.0329  |   10.8257   |      58      |      330      |       30.6759        |
+|    90%     |  0.1245  | 0.0333  |   13.0472   |      62      |      404      |       30.9644        |
+|    95%     |  0.1247  | 0.0336  |   14.2936   |      66      |      432      |       31.6691        |
+|    98%     |  0.1247  | 0.0353  |   14.2936   |      66      |      432      |       31.6691        |
+|    99%     |  0.1247  | 0.0627  |   14.2936   |      66      |      432      |       31.6691        |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+```
+
+See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).