2025-11-04 18:58:33 +08:00
# Using AISBench
2026-01-15 09:06:01 +08:00
2025-11-04 18:58:33 +08:00
This document guides you to conduct accuracy testing using [AISBench ](https://gitee.com/aisbench/benchmark/tree/master ). AISBench provides accuracy and performance evaluation for many datasets.
## Online Server
2026-01-15 09:06:01 +08:00
2025-11-04 18:58:33 +08:00
### 1. Start the vLLM server
2026-01-15 09:06:01 +08:00
2025-11-04 18:58:33 +08:00
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
```
Run the vLLM server in the docker.
```{code-block} bash
:substitutions:
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &
```
:::{note}
`--max_model_len` should be greater than `35000` , this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
:::
The vLLM server is started successfully, if you see logs as below:
2026-01-15 09:06:01 +08:00
```shell
2025-11-04 18:58:33 +08:00
INFO: Started server process [9446]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
2026-02-27 11:50:27 +08:00
### 2. Run different datasets using AISBench
2025-11-04 18:58:33 +08:00
#### Install AISBench
Refer to [AISBench ](https://gitee.com/aisbench/benchmark/tree/master ) for details.
Install AISBench from source.
```shell
git clone https://gitee.com/aisbench/benchmark.git
cd benchmark/
pip3 install -e ./ --use-pep517
```
Install extra AISBench dependencies.
```shell
pip3 install -r requirements/api.txt
pip3 install -r requirements/extra.txt
```
Run `ais_bench -h` to check the installation.
#### Download Dataset
You can choose one or multiple datasets to execute accuracy evaluation.
1. `C-Eval` dataset.
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
Take `C-Eval` dataset as an example. You can refer to [Datasets ](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets ) for more datasets. Each dataset has a `README.md` with detailed download and installation instructions.
2025-11-04 18:58:33 +08:00
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
Download dataset and install it to specific path.
2025-11-04 18:58:33 +08:00
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
cd ais_bench/datasets
mkdir ceval/
mkdir ceval/formal_ceval
cd ceval/formal_ceval
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
unzip ceval-exam.zip
rm ceval-exam.zip
```
2025-11-04 18:58:33 +08:00
2. `MMLU` dataset.
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
unzip mmlu.zip
rm mmlu.zip
```
2025-11-04 18:58:33 +08:00
3. `GPQA` dataset.
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
unzip gpqa.zip
rm gpqa.zip
```
2025-11-04 18:58:33 +08:00
4. `MATH` dataset.
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
unzip math.zip
rm math.zip
```
2025-11-04 18:58:33 +08:00
5. `LiveCodeBench` dataset.
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
cd ais_bench/datasets
git lfs install
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
```
2025-11-04 18:58:33 +08:00
6. `AIME 2024` dataset.
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
cd ais_bench/datasets
mkdir aime/
cd aime/
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
unzip aime.zip
rm aime.zip
```
2025-11-04 18:58:33 +08:00
7. `GSM8K` dataset.
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
unzip gsm8k.zip
rm gsm8k.zip
```
2025-11-04 18:58:33 +08:00
#### Configuration
Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` .
There are several arguments that you should update according to your environment.
2025-11-19 16:00:39 +08:00
- `attr` : Identifier for the inference backend type, fixed as `service` (serving-based inference) or `local` (local model).
- `type` : Used to select different backend API types.
- `abbr` : Unique identifier for a local task, used to distinguish between multiple tasks.
2025-11-04 18:58:33 +08:00
- `path` : Update to your model weight path.
- `model` : Update to your model name in vLLM.
- `host_ip` and `host_port` : Update to your vLLM server ip and port.
2026-04-17 08:54:38 +08:00
- `max_out_len` : Note `max_out_len` + LLM input length should be less than `max_model_len` (config in your vllm server), `32768` will be suitable for most datasets.
2025-11-04 18:58:33 +08:00
- `batch_size` : Update according to your dataset.
- `temperature` : Update inference argument.
```python
from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
models = [
dict(
attr="service",
type=VLLMCustomAPIChat,
abbr='vllm-api-general-chat',
path="xxxx",
model="xxxx",
request_rate = 0,
retry = 2,
host_ip = "localhost",
host_port = 8000,
max_out_len = xxx,
batch_size = xxx,
trust_remote_code=False,
generation_kwargs = dict(
temperature = 0.6,
top_k = 10,
top_p = 0.95,
seed = None,
repetition_penalty = 1.03,
),
pred_postprocessor=dict(type=extract_non_reasoning_content)
)
]
```
#### Execute Accuracy Evaluation
Run the following code to execute different accuracy evaluation.
```shell
# run C-Eval dataset
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
# run MMLU dataset
ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
# run GPQA dataset
ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --mode all --dump-eval-details --merge-ds
# run MATH-500 dataset
ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
# run LiveCodeBench dataset
ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --mode all --dump-eval-details --merge-ds
# run AIME 2024 dataset
ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --mode all --dump-eval-details --merge-ds
```
After each dataset execution, you can get the result from saved files such as `outputs/default/20250628_151326` , there is an example as follows:
2026-01-15 09:06:01 +08:00
```shell
2025-11-04 18:58:33 +08:00
20250628_151326/
├── configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
│ └── 20250628_151326_29317.py
├── logs # Execution logs; if --debug is added to the command, no intermediate logs are saved to disk (all are printed directly to the screen)
│ ├── eval
│ │ └── vllm-api-general-chat
│ │ └── demo_gsm8k.out # Logs of the accuracy evaluation process based on inference results in the predictions/ folder
│ └── infer
│ └── vllm-api-general-chat
│ └── demo_gsm8k.out # Logs of the inference process
├── predictions
│ └── vllm-api-general-chat
│ └── demo_gsm8k.json # Inference results (all outputs returned by the inference service)
├── results
│ └── vllm-api-general-chat
│ └── demo_gsm8k.json # Raw scores calculated from the accuracy evaluation
└── summary
├── summary_20250628_151326.csv # Final accuracy scores (in table format)
├── summary_20250628_151326.md # Final accuracy scores (in Markdown format)
└── summary_20250628_151326.txt # Final accuracy scores (in text format)
```
#### Execute Performance Evaluation
2025-11-19 16:00:39 +08:00
Text-only benchmarks:
2025-11-04 18:58:33 +08:00
```shell
# run C-Eval dataset
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
# run MMLU dataset
ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
# run GPQA dataset
ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --summarizer default_perf --mode perf
# run MATH-500 dataset
ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
# run LiveCodeBench dataset
ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --summarizer default_perf --mode perf
# run AIME 2024 dataset
ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --summarizer default_perf --mode perf
```
2025-11-19 16:00:39 +08:00
Multi-modal benchmarks (text + images):
```shell
# run textvqa dataset
ais_bench --models vllm_api_stream_chat --datasets textvqa_gen_base64 --summarizer default_perf --mode perf
```
2025-11-04 18:58:33 +08:00
After execution, you can get the result from saved files, there is an example as follows:
2026-01-15 09:06:01 +08:00
```shell
2025-11-04 18:58:33 +08:00
20251031_070226/
|-- configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
| `-- 20251031_070226_122485.py
|-- logs
| `-- performances
| `-- vllm-api-general-chat
| `-- cevaldataset.out # Logs of the performance evaluation process
`-- performances
`-- vllm-api-general-chat
|-- cevaldataset.csv # Final performance results (in table format)
|-- cevaldataset.json # Final performance results (in json format)
|-- cevaldataset_details.h5 # Final performance results in details
|-- cevaldataset_details.json # Final performance results in details
|-- cevaldataset_plot.html # Final performance results (in html format)
`-- cevaldataset_rps_distribution_plot_with_actual_rps.html # Final performance results (in html format)
```
2025-11-19 16:00:39 +08:00
### 3. Troubleshooting
#### Invalid Image Path Error
If you download the TextVQA dataset following the AISBench documentation:
```bash
cd ais_bench/datasets
git lfs install
git clone https://huggingface.co/datasets/maoxx241/textvqa_subset
mv textvqa_subset/ textvqa/
mkdir textvqa/textvqa_json/
mv textvqa/*.json textvqa/textvqa_json/
mv textvqa/*.jsonl textvqa/textvqa_json/
```
you may encounter the following error:
```bash
AISBench - ERROR - /vllm-workspace/benchmark/ais_bench/benchmark/clients/base_client.py - raise_error - 35 - [AisBenchClientException] Request failed: HTTP status 400. Server response: {"error":{"message":"1 validation error for ChatCompletionContentPartImageParam\nimage_url\n Input should be a valid dictionary [type=dict_type, input_value='data/textvqa/train_images/b2ae0f96dfbea5d8.jpg', input_type=str]\n For further information visit https://errors.pydantic.dev/2.12/v/dict_type None","type":"BadRequestError","param":null,"code":400}}
```
You need to manually replace the dataset image paths with absolute paths, changing `/path/to/benchmark/ais_bench/datasets/textvqa/train_images/` to the actual absolute directory where the images are stored:
```bash
cd ais_bench/datasets/textvqa/textvqa_json
sed -i 's#data/textvqa/train_images/#/path/to/benchmark/ais_bench/datasets/textvqa/train_images/#g ' textvqa_val.json
```