v0.10.1rc1

2025-09-09 09:40:35 +08:00
parent d6f6ef41fe
commit 9149384e03
432 changed files with 84698 additions and 1 deletions
--- a/docs/source/developer_guide/evaluation/accuracy_report/index.md
+++ b/docs/source/developer_guide/evaluation/accuracy_report/index.md
@@ -0,0 +1,6 @@
+# Accuracy Report
+
+:::{toctree}
+:caption: Accuracy Report
+:maxdepth: 1
+:::
--- a/docs/source/developer_guide/evaluation/index.md
+++ b/docs/source/developer_guide/evaluation/index.md
@@ -0,0 +1,10 @@
+# Accuracy
+
+:::{toctree}
+:caption: Accuracy
+:maxdepth: 1
+using_evalscope
+using_lm_eval
+using_opencompass
+accuracy_report/index
+:::
--- a/docs/source/developer_guide/evaluation/using_evalscope.md
+++ b/docs/source/developer_guide/evaluation/using_evalscope.md
@@ -0,0 +1,175 @@
+# Using EvalScope
+
+This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
+
+## 1. Online serving
+
+You can run docker container to start the vLLM server on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
+```
+
+If your service start successfully, you can see the info shown below:
+
+```
+INFO:     Started server process [6873]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+Once your server is started, you can query the model with input prompts in new terminal:
+
+```
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen2.5-7B-Instruct",
+        "prompt": "The future of AI is",
+        "max_tokens": 7,
+        "temperature": 0
+    }'
+```
+
+## 2. Install EvalScope using pip
+
+You can install EvalScope by using:
+
+```bash
+python3 -m venv .venv-evalscope
+source .venv-evalscope/bin/activate
+pip install gradio plotly evalscope
+```
+
+## 3. Run gsm8k accuracy test using EvalScope
+
+You can `evalscope eval` run gsm8k accuracy test:
+
+```
+evalscope eval \
+ --model Qwen/Qwen2.5-7B-Instruct \
+ --api-url http://localhost:8000/v1 \
+ --api-key EMPTY \
+ --eval-type service \
+ --datasets gsm8k \
+ --limit 10
+```
+
+After 1-2 mins, the output is as shown below:
+
+```shell
+---------------------+-----------+-----------------+----------+-------+---------+---------+
+| Model               | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+=====================+===========+=================+==========+=======+=========+=========+
+| Qwen2.5-7B-Instruct | gsm8k     | AverageAccuracy | main     |    10 |     0.8 | default |
+---------------------+-----------+-----------------+----------+-------+---------+---------+
+```
+
+See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
+
+## 4. Run model inference stress testing using EvalScope
+
+### Install EvalScope[perf] using pip
+
+```shell
+pip install evalscope[perf] -U
+```
+
+### Basic usage
+
+You can use `evalscope perf` run perf test:
+
+```
+evalscope perf \
+    --url "http://localhost:8000/v1/chat/completions" \
+    --parallel 5 \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --number 20 \
+    --api openai \
+    --dataset openqa \
+    --stream
+```
+
+### Output results
+
+After 1-2 mins, the output is as shown below:
+
+```shell
+Benchmarking summary:
+-----------------------------------+---------------------------------------------------------------+
+| Key                               | Value                                                         |
+===================================+===============================================================+
+| Time taken for tests (s)          | 38.3744                                                       |
+-----------------------------------+---------------------------------------------------------------+
+| Number of concurrency             | 5                                                             |
+-----------------------------------+---------------------------------------------------------------+
+| Total requests                    | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Succeed requests                  | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Failed requests                   | 0                                                             |
+-----------------------------------+---------------------------------------------------------------+
+| Output token throughput (tok/s)   | 132.6926                                                      |
+-----------------------------------+---------------------------------------------------------------+
+| Total token throughput (tok/s)    | 158.8819                                                      |
+-----------------------------------+---------------------------------------------------------------+
+| Request throughput (req/s)        | 0.5212                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average latency (s)               | 8.3612                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average time to first token (s)   | 0.1035                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average time per output token (s) | 0.0329                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average input tokens per request  | 50.25                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Average output tokens per request | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Average package latency (s)       | 0.0324                                                        |
+-----------------------------------+---------------------------------------------------------------+
+| Average package per request       | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
+| Expected number of requests       | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
+| Result DB path                    | outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db |
+-----------------------------------+---------------------------------------------------------------+
+
+Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+|    10%     |  0.0962  |  0.031  |   4.4571    |      42      |      135      |       29.9767        |
+|    25%     |  0.0971  | 0.0318  |   6.3509    |      47      |      193      |       30.2157        |
+|    50%     |  0.0987  | 0.0321  |   9.3387    |      49      |      285      |       30.3969        |
+|    66%     |  0.1017  | 0.0324  |   9.8519    |      52      |      302      |       30.5182        |
+|    75%     |  0.107   | 0.0328  |   10.2391   |      55      |      313      |       30.6124        |
+|    80%     |  0.1221  | 0.0329  |   10.8257   |      58      |      330      |       30.6759        |
+|    90%     |  0.1245  | 0.0333  |   13.0472   |      62      |      404      |       30.9644        |
+|    95%     |  0.1247  | 0.0336  |   14.2936   |      66      |      432      |       31.6691        |
+|    98%     |  0.1247  | 0.0353  |   14.2936   |      66      |      432      |       31.6691        |
+|    99%     |  0.1247  | 0.0627  |   14.2936   |      66      |      432      |       31.6691        |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
+```
+
+See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).
--- a/docs/source/developer_guide/evaluation/using_lm_eval.md
+++ b/docs/source/developer_guide/evaluation/using_lm_eval.md
@@ -0,0 +1,300 @@
+# Using lm-eval
+This document will guide you have a accuracy testing using [lm-eval][1].
+
+## Online Server
+### 1. start the vLLM server
+You can run docker container to start the vLLM server on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+/bin/bash
+vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
+```
+
+Started the vLLM server successfully,if you see log as below:
+
+```
+INFO:     Started server process [9446]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+### 2. Run gsm8k accuracy test using lm-eval
+
+You can query result with input prompts:
+
+```
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen2.5-0.5B-Instruct",
+        "prompt": "'"<|im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>\n"\
+"<|im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\
+"  Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\
+"  Non-current assets: Net fixed assets 12 million yuan\n"\
+"  Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\
+"  Non-current liabilities: Long-term loans 9 million yuan\n"\
+"  Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\
+"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\
+"Options:\n"\
+"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\
+"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\
+"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
+"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
+"<|im_start|>assistant\n"'",
+        "max_tokens": 1,
+        "temperature": 0,
+        "stop": ["<|im_end|>"]
+    }' | python3 -m json.tool
+```
+
+The output format matches the following:
+
+```
+{
+    "id": "cmpl-2f678e8bdf5a4b209a3f2c1fa5832e25",
+    "object": "text_completion",
+    "created": 1754475138,
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "choices": [
+        {
+            "index": 0,
+            "text": "A",
+            "logprobs": null,
+            "finish_reason": "length",
+            "stop_reason": null,
+            "prompt_logprobs": null
+        }
+    ],
+    "service_tier": null,
+    "system_fingerprint": null,
+    "usage": {
+        "prompt_tokens": 252,
+        "total_tokens": 253,
+        "completion_tokens": 1,
+        "prompt_tokens_details": null
+    },
+    "kv_transfer_params": null
+}
+```
+
+Install lm-eval in the container.
+
+```bash
+export HF_ENDPOINT="https://hf-mirror.com"
+pip install lm-eval[api]
+```
+
+Run the following command:
+
+```
+# Only test gsm8k dataset in this demo
+lm_eval \
+  --model local-completions \
+  --model_args model=Qwen/Qwen2.5-0.5B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
+  --tasks gsm8k \
+  --output_path ./
+```
+
+After 30 mins, the output is as shown below:
+
+```
+The markdown format results is as below:
+
+Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3215|±  |0.0129|
+|     |       |strict-match    |     5|exact_match|↑  |0.2077|±  |0.0112|
+
+```
+
+## Offline Server
+### 1. Run docker container
+
+You can run docker container on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+/bin/bash
+```
+
+### 2. Run gsm8k accuracy test using lm-eval
+Install lm-eval in the container.
+
+```bash
+export HF_ENDPOINT="https://hf-mirror.com"
+pip install lm-eval
+```
+
+Run the following command:
+
+```
+# Only test gsm8k dataset in this demo
+lm_eval \
+  --model vllm \
+  --model_args pretrained=Qwen/Qwen2.5-0.5B-Instruct,max_model_len=4096 \
+  --tasks gsm8k \
+  --batch_size auto
+```
+
+After 1-2 mins, the output is as shown below:
+
+```
+The markdown format results is as below:
+
+Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3412|±  |0.0131|
+|     |       |strict-match    |     5|exact_match|↑  |0.3139|±  |0.0128|
+
+```
+
+## Use offline Datasets
+
+Take gsm8k(single dataset) and mmlu(multi-subject dataset) as examples, and you can see more from [here][2].
+
+```bash
+# set HF_DATASETS_OFFLINE when using offline datasets
+export HF_DATASETS_OFFLINE=1
+git clone https://github.com/EleutherAI/lm-evaluation-harness.git
+cd lm-evaluation-harness
+pip install -e .
+# gsm8k yaml path
+cd lm_eval/tasks/gsm8k
+# mmlu yaml path
+cd lm_eval/tasks/mmlu/default
+```
+
+set [gsm8k.yaml][3] as follows:
+
+```yaml
+tag:
+  - math_word_problems
+task: gsm8k
+
+# set dataset_path arrow or json or parquet according to the downloaded dataset
+dataset_path: arrow
+
+# set dataset_name to null
+dataset_name: null
+output_type: generate_until
+
+# add dataset_kwargs 
+dataset_kwargs:
+  data_files:
+    # train and test data download path
+    train: /root/.cache/gsm8k/gsm8k-train.arrow
+    test: /root/.cache/gsm8k/gsm8k-test.arrow
+
+training_split: train
+fewshot_split: train
+test_split: test
+doc_to_text: 'Q: {{question}}
+  A(Please follow the summarize the result at the end with the format of "The answer is xxx", where xx is the result.):'
+doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: false
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+      - "(?s).*#### "
+      - "\\.$"
+generation_kwargs:
+  until:
+    - "Question:"
+    - "</s>"
+    - "<|im_end|>"
+  do_sample: false
+  temperature: 0.0
+repeats: 1
+num_fewshot: 5
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "regex"
+        regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: "regex"
+        group_select: -1
+        regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
+      - function: "take_first"
+metadata:
+  version: 3.0
+```
+
+set [_default_template_yaml][4] as follows:
+
+```yaml
+# set dataset_path according to the downloaded dataset
+dataset_path: /root/.cache/mmlu
+test_split: test
+fewshot_split: dev
+fewshot_config:
+  sampler: first_n
+output_type: multiple_choice
+doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
+doc_to_choice: ["A", "B", "C", "D"]
+doc_to_target: answer
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
+```
+
+You can see more usage on [Lm-eval Docs][5].
+
+[1]: https://github.com/EleutherAI/lm-evaluation-harness
+[2]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-local-datasets
+[3]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml
+[4]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/default/_default_template_yaml
+[5]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/README.md
--- a/docs/source/developer_guide/evaluation/using_opencompass.md
+++ b/docs/source/developer_guide/evaluation/using_opencompass.md
@@ -0,0 +1,123 @@
+# Using OpenCompass
+This document will guide you have a accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
+
+## 1. Online Serving
+
+You can run docker container to start the vLLM server on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
+```
+
+If your service start successfully, you can see the info shown below:
+
+```
+INFO:     Started server process [6873]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+Once your server is started, you can query the model with input prompts in new terminal:
+
+```
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen2.5-7B-Instruct",
+        "prompt": "The future of AI is",
+        "max_tokens": 7,
+        "temperature": 0
+    }'
+```
+
+## 2. Run ceval accuracy test using OpenCompass
+Install OpenCompass and configure the environment variables in the container.
+
+```bash
+# Pin Python 3.10 due to:
+# https://github.com/open-compass/opencompass/issues/1976
+conda create -n opencompass python=3.10
+conda activate opencompass
+pip install opencompass modelscope[framework]
+export DATASET_SOURCE=ModelScope
+git clone https://github.com/open-compass/opencompass.git
+```
+
+Add `opencompass/configs/eval_vllm_ascend_demo.py` with the following content:
+
+```python
+from mmengine.config import read_base
+from opencompass.models import OpenAISDK
+
+with read_base():
+    from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
+
+# Only test ceval-computer_network dataset in this demo
+datasets = ceval_datasets[:1]
+
+api_meta_template = dict(
+    round=[
+        dict(role='HUMAN', api_role='HUMAN'),
+        dict(role='BOT', api_role='BOT', generate=True),
+    ],
+    reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
+)
+
+models = [
+    dict(
+        abbr='Qwen2.5-7B-Instruct-vLLM-API',
+        type=OpenAISDK,
+        key='EMPTY', # API key
+        openai_api_base='http://127.0.0.1:8000/v1', 
+        path='Qwen/Qwen2.5-7B-Instruct', 
+        tokenizer_path='Qwen/Qwen2.5-7B-Instruct', 
+        rpm_verbose=True, 
+        meta_template=api_meta_template,
+        query_per_second=1, 
+        max_out_len=1024, 
+        max_seq_len=4096, 
+        temperature=0.01, 
+        batch_size=8,
+        retry=3,
+    )
+]
+```
+
+Run the following command:
+
+```
+python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
+```
+
+After 1-2 mins, the output is as shown below:
+
+```
+The markdown format results is as below:
+
+| dataset | version | metric | mode | Qwen2.5-7B-Instruct-vLLM-API |
+|----- | ----- | ----- | ----- | -----|
+| ceval-computer_network | db9ce2 | accuracy | gen | 68.42 |
+```
+
+You can see more usage on [OpenCompass Docs](https://opencompass.readthedocs.io/en/latest/index.html).