v0.10.1rc1

This commit is contained in:
2025-09-09 09:40:35 +08:00
parent d6f6ef41fe
commit 9149384e03
432 changed files with 84698 additions and 1 deletions

View File

@@ -0,0 +1,6 @@
# Accuracy Report
:::{toctree}
:caption: Accuracy Report
:maxdepth: 1
:::

View File

@@ -0,0 +1,10 @@
# Accuracy
:::{toctree}
:caption: Accuracy
:maxdepth: 1
using_evalscope
using_lm_eval
using_opencompass
accuracy_report/index
:::

View File

@@ -0,0 +1,175 @@
# Using EvalScope
This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
## 1. Online serving
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
If your service start successfully, you can see the info shown below:
```
INFO: Started server process [6873]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts in new terminal:
```
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```
## 2. Install EvalScope using pip
You can install EvalScope by using:
```bash
python3 -m venv .venv-evalscope
source .venv-evalscope/bin/activate
pip install gradio plotly evalscope
```
## 3. Run gsm8k accuracy test using EvalScope
You can `evalscope eval` run gsm8k accuracy test:
```
evalscope eval \
--model Qwen/Qwen2.5-7B-Instruct \
--api-url http://localhost:8000/v1 \
--api-key EMPTY \
--eval-type service \
--datasets gsm8k \
--limit 10
```
After 1-2 mins, the output is as shown below:
```shell
+---------------------+-----------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=====================+===========+=================+==========+=======+=========+=========+
| Qwen2.5-7B-Instruct | gsm8k | AverageAccuracy | main | 10 | 0.8 | default |
+---------------------+-----------+-----------------+----------+-------+---------+---------+
```
See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
## 4. Run model inference stress testing using EvalScope
### Install EvalScope[perf] using pip
```shell
pip install evalscope[perf] -U
```
### Basic usage
You can use `evalscope perf` run perf test:
```
evalscope perf \
--url "http://localhost:8000/v1/chat/completions" \
--parallel 5 \
--model Qwen/Qwen2.5-7B-Instruct \
--number 20 \
--api openai \
--dataset openqa \
--stream
```
### Output results
After 1-2 mins, the output is as shown below:
```shell
Benchmarking summary:
+-----------------------------------+---------------------------------------------------------------+
| Key | Value |
+===================================+===============================================================+
| Time taken for tests (s) | 38.3744 |
+-----------------------------------+---------------------------------------------------------------+
| Number of concurrency | 5 |
+-----------------------------------+---------------------------------------------------------------+
| Total requests | 20 |
+-----------------------------------+---------------------------------------------------------------+
| Succeed requests | 20 |
+-----------------------------------+---------------------------------------------------------------+
| Failed requests | 0 |
+-----------------------------------+---------------------------------------------------------------+
| Output token throughput (tok/s) | 132.6926 |
+-----------------------------------+---------------------------------------------------------------+
| Total token throughput (tok/s) | 158.8819 |
+-----------------------------------+---------------------------------------------------------------+
| Request throughput (req/s) | 0.5212 |
+-----------------------------------+---------------------------------------------------------------+
| Average latency (s) | 8.3612 |
+-----------------------------------+---------------------------------------------------------------+
| Average time to first token (s) | 0.1035 |
+-----------------------------------+---------------------------------------------------------------+
| Average time per output token (s) | 0.0329 |
+-----------------------------------+---------------------------------------------------------------+
| Average input tokens per request | 50.25 |
+-----------------------------------+---------------------------------------------------------------+
| Average output tokens per request | 254.6 |
+-----------------------------------+---------------------------------------------------------------+
| Average package latency (s) | 0.0324 |
+-----------------------------------+---------------------------------------------------------------+
| Average package per request | 254.6 |
+-----------------------------------+---------------------------------------------------------------+
| Expected number of requests | 20 |
+-----------------------------------+---------------------------------------------------------------+
| Result DB path | outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db |
+-----------------------------------+---------------------------------------------------------------+
Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
| 10% | 0.0962 | 0.031 | 4.4571 | 42 | 135 | 29.9767 |
| 25% | 0.0971 | 0.0318 | 6.3509 | 47 | 193 | 30.2157 |
| 50% | 0.0987 | 0.0321 | 9.3387 | 49 | 285 | 30.3969 |
| 66% | 0.1017 | 0.0324 | 9.8519 | 52 | 302 | 30.5182 |
| 75% | 0.107 | 0.0328 | 10.2391 | 55 | 313 | 30.6124 |
| 80% | 0.1221 | 0.0329 | 10.8257 | 58 | 330 | 30.6759 |
| 90% | 0.1245 | 0.0333 | 13.0472 | 62 | 404 | 30.9644 |
| 95% | 0.1247 | 0.0336 | 14.2936 | 66 | 432 | 31.6691 |
| 98% | 0.1247 | 0.0353 | 14.2936 | 66 | 432 | 31.6691 |
| 99% | 0.1247 | 0.0627 | 14.2936 | 66 | 432 | 31.6691 |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
```
See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).

View File

@@ -0,0 +1,300 @@
# Using lm-eval
This document will guide you have a accuracy testing using [lm-eval][1].
## Online Server
### 1. start the vLLM server
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
```
Started the vLLM server successfully,if you see log as below:
```
INFO: Started server process [9446]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
### 2. Run gsm8k accuracy test using lm-eval
You can query result with input prompts:
```
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "'"<|im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>\n"\
"<|im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\
" Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\
" Non-current assets: Net fixed assets 12 million yuan\n"\
" Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\
" Non-current liabilities: Long-term loans 9 million yuan\n"\
" Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\
"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\
"Options:\n"\
"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\
"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\
"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
"<|im_start|>assistant\n"'",
"max_tokens": 1,
"temperature": 0,
"stop": ["<|im_end|>"]
}' | python3 -m json.tool
```
The output format matches the following:
```
{
"id": "cmpl-2f678e8bdf5a4b209a3f2c1fa5832e25",
"object": "text_completion",
"created": 1754475138,
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"choices": [
{
"index": 0,
"text": "A",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 252,
"total_tokens": 253,
"completion_tokens": 1,
"prompt_tokens_details": null
},
"kv_transfer_params": null
}
```
Install lm-eval in the container.
```bash
export HF_ENDPOINT="https://hf-mirror.com"
pip install lm-eval[api]
```
Run the following command:
```
# Only test gsm8k dataset in this demo
lm_eval \
--model local-completions \
--model_args model=Qwen/Qwen2.5-0.5B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \
--output_path ./
```
After 30 mins, the output is as shown below:
```
The markdown format results is as below:
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3215|± |0.0129|
| | |strict-match | 5|exact_match|↑ |0.2077|± |0.0112|
```
## Offline Server
### 1. Run docker container
You can run docker container on a single NPU:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
```
### 2. Run gsm8k accuracy test using lm-eval
Install lm-eval in the container.
```bash
export HF_ENDPOINT="https://hf-mirror.com"
pip install lm-eval
```
Run the following command:
```
# Only test gsm8k dataset in this demo
lm_eval \
--model vllm \
--model_args pretrained=Qwen/Qwen2.5-0.5B-Instruct,max_model_len=4096 \
--tasks gsm8k \
--batch_size auto
```
After 1-2 mins, the output is as shown below:
```
The markdown format results is as below:
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3412|± |0.0131|
| | |strict-match | 5|exact_match|↑ |0.3139|± |0.0128|
```
## Use offline Datasets
Take gsm8k(single dataset) and mmlu(multi-subject dataset) as examples, and you can see more from [here][2].
```bash
# set HF_DATASETS_OFFLINE when using offline datasets
export HF_DATASETS_OFFLINE=1
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
# gsm8k yaml path
cd lm_eval/tasks/gsm8k
# mmlu yaml path
cd lm_eval/tasks/mmlu/default
```
set [gsm8k.yaml][3] as follows:
```yaml
tag:
- math_word_problems
task: gsm8k
# set dataset_path arrow or json or parquet according to the downloaded dataset
dataset_path: arrow
# set dataset_name to null
dataset_name: null
output_type: generate_until
# add dataset_kwargs
dataset_kwargs:
data_files:
# train and test data download path
train: /root/.cache/gsm8k/gsm8k-train.arrow
test: /root/.cache/gsm8k/gsm8k-test.arrow
training_split: train
fewshot_split: train
test_split: test
doc_to_text: 'Q: {{question}}
A(Please follow the summarize the result at the end with the format of "The answer is xxx", where xx is the result.):'
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
regexes_to_ignore:
- ","
- "\\$"
- "(?s).*#### "
- "\\.$"
generation_kwargs:
until:
- "Question:"
- "</s>"
- "<|im_end|>"
do_sample: false
temperature: 0.0
repeats: 1
num_fewshot: 5
filter_list:
- name: "strict-match"
filter:
- function: "regex"
regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
- function: "take_first"
- name: "flexible-extract"
filter:
- function: "regex"
group_select: -1
regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
- function: "take_first"
metadata:
version: 3.0
```
set [_default_template_yaml][4] as follows:
```yaml
# set dataset_path according to the downloaded dataset
dataset_path: /root/.cache/mmlu
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
output_type: multiple_choice
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
```
You can see more usage on [Lm-eval Docs][5].
[1]: https://github.com/EleutherAI/lm-evaluation-harness
[2]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-local-datasets
[3]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml
[4]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/default/_default_template_yaml
[5]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/README.md

View File

@@ -0,0 +1,123 @@
# Using OpenCompass
This document will guide you have a accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
## 1. Online Serving
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
If your service start successfully, you can see the info shown below:
```
INFO: Started server process [6873]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts in new terminal:
```
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```
## 2. Run ceval accuracy test using OpenCompass
Install OpenCompass and configure the environment variables in the container.
```bash
# Pin Python 3.10 due to:
# https://github.com/open-compass/opencompass/issues/1976
conda create -n opencompass python=3.10
conda activate opencompass
pip install opencompass modelscope[framework]
export DATASET_SOURCE=ModelScope
git clone https://github.com/open-compass/opencompass.git
```
Add `opencompass/configs/eval_vllm_ascend_demo.py` with the following content:
```python
from mmengine.config import read_base
from opencompass.models import OpenAISDK
with read_base():
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
# Only test ceval-computer_network dataset in this demo
datasets = ceval_datasets[:1]
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
models = [
dict(
abbr='Qwen2.5-7B-Instruct-vLLM-API',
type=OpenAISDK,
key='EMPTY', # API key
openai_api_base='http://127.0.0.1:8000/v1',
path='Qwen/Qwen2.5-7B-Instruct',
tokenizer_path='Qwen/Qwen2.5-7B-Instruct',
rpm_verbose=True,
meta_template=api_meta_template,
query_per_second=1,
max_out_len=1024,
max_seq_len=4096,
temperature=0.01,
batch_size=8,
retry=3,
)
]
```
Run the following command:
```
python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
```
After 1-2 mins, the output is as shown below:
```
The markdown format results is as below:
| dataset | version | metric | mode | Qwen2.5-7B-Instruct-vLLM-API |
|----- | ----- | ----- | ----- | -----|
| ceval-computer_network | db9ce2 | accuracy | gen | 68.42 |
```
You can see more usage on [OpenCompass Docs](https://opencompass.readthedocs.io/en/latest/index.html).