xc-llm-ascend/docs/source/developer_guide/evaluation/using_ais_bench.md

# Using AISBench
This document guides you to conduct accuracy testing using [AISBench](https://gitee.com/aisbench/benchmark/tree/master). AISBench provides accuracy and performance evaluation for many datasets.

## Online Server
### 1. Start the vLLM server
You can run docker container to start the vLLM server on a single NPU:

```{code-block} bash
   :substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
```

Run the vLLM server in the docker.

```{code-block} bash
   :substitutions:
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &
```

:::{note}
`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
:::

The vLLM server is started successfully, if you see logs as below:

```
INFO:     Started server process [9446]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
```

### 2. Run different dataset using AISBench

#### Install AISBench

Refer to [AISBench](https://gitee.com/aisbench/benchmark/tree/master) for details.
Install AISBench from source.

```shell
git clone https://gitee.com/aisbench/benchmark.git
cd benchmark/
pip3 install -e ./ --use-pep517
```

Install extra AISBench dependencies.

```shell
pip3 install -r requirements/api.txt
pip3 install -r requirements/extra.txt
```

Run `ais_bench -h` to check the installation.

#### Download Dataset

You can choose one or multiple datasets to execute accuracy evaluation.

1. `C-Eval` dataset.

Take `C-Eval` dataset as an example. And you can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Every datasets have a `README.md` for detailed download and installation process.

Download dataset and install it to specific path.

```shell
cd ais_bench/datasets
mkdir ceval/
mkdir ceval/formal_ceval
cd ceval/formal_ceval
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
unzip ceval-exam.zip
rm ceval-exam.zip
```

2. `MMLU` dataset.

```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
unzip mmlu.zip
rm mmlu.zip
```

3. `GPQA` dataset.

```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
unzip gpqa.zip
rm gpqa.zip
```

4. `MATH` dataset.

```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
unzip math.zip
rm math.zip
```

5. `LiveCodeBench` dataset.

```shell
cd ais_bench/datasets
git lfs install
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
```

6. `AIME 2024` dataset.

```shell
cd ais_bench/datasets
mkdir aime/
cd aime/
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
unzip aime.zip
rm aime.zip
```

7. `GSM8K` dataset.

```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
unzip gsm8k.zip
rm gsm8k.zip
```

#### Configuration

Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
There are several arguments that you should update according to your environment.

- `path`: Update to your model weight path.
- `model`: Update to your model name in vLLM.
- `host_ip` and `host_port`: Update to your vLLM server ip and port.
- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server), `32768` will be suitable for most datasets.
- `batch_size`: Update according to your dataset.
- `temperature`: Update inference argument.

```python
from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr='vllm-api-general-chat',
        path="xxxx",
        model="xxxx",
        request_rate = 0,
        retry = 2,
        host_ip = "localhost",
        host_port = 8000,
        max_out_len = xxx,
        batch_size = xxx,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0.6,
            top_k = 10,
            top_p = 0.95,
            seed = None,
            repetition_penalty = 1.03,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    )
]

```

#### Execute Accuracy Evaluation

Run the following code to execute different accuracy evaluation.

```shell
# run C-Eval dataset
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

# run MMLU dataset
ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

# run GPQA dataset
ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --mode all --dump-eval-details --merge-ds

# run MATH-500 dataset
ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

# run LiveCodeBench dataset
ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --mode all --dump-eval-details --merge-ds

# run AIME 2024 dataset
ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --mode all --dump-eval-details --merge-ds

```

After each dataset execution, you can get the result from saved files such as `outputs/default/20250628_151326`, there is an example as follows:

```
20250628_151326/
├── configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
│   └── 20250628_151326_29317.py
├── logs # Execution logs; if --debug is added to the command, no intermediate logs are saved to disk (all are printed directly to the screen)
│   ├── eval
│   │   └── vllm-api-general-chat
│   │       └── demo_gsm8k.out # Logs of the accuracy evaluation process based on inference results in the predictions/ folder
│   └── infer
│       └── vllm-api-general-chat
│           └── demo_gsm8k.out # Logs of the inference process
├── predictions
│   └── vllm-api-general-chat
│       └── demo_gsm8k.json # Inference results (all outputs returned by the inference service)
├── results
│   └── vllm-api-general-chat
│       └── demo_gsm8k.json # Raw scores calculated from the accuracy evaluation
└── summary
    ├── summary_20250628_151326.csv # Final accuracy scores (in table format)
    ├── summary_20250628_151326.md # Final accuracy scores (in Markdown format)
    └── summary_20250628_151326.txt # Final accuracy scores (in text format)
```

#### Execute Performance Evaluation

```shell
# run C-Eval dataset
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf

# run MMLU dataset
ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf

# run GPQA dataset
ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --summarizer default_perf --mode perf

# run MATH-500 dataset
ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf

# run LiveCodeBench dataset
ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --summarizer default_perf --mode perf

# run AIME 2024 dataset
ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --summarizer default_perf --mode perf
```

After execution, you can get the result from saved files, there is an example as follows:

```
20251031_070226/
|-- configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
|   `-- 20251031_070226_122485.py
|-- logs
|   `-- performances
|       `-- vllm-api-general-chat
|           `-- cevaldataset.out # Logs of the performance evaluation process
`-- performances
    `-- vllm-api-general-chat
        |-- cevaldataset.csv # Final performance results (in table format)
        |-- cevaldataset.json # Final performance results (in json format)
        |-- cevaldataset_details.h5 # Final performance results in details
        |-- cevaldataset_details.json # Final performance results in details
        |-- cevaldataset_plot.html # Final performance results (in html format)
        `-- cevaldataset_rps_distribution_plot_with_actual_rps.html # Final performance results (in html format)
```
[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871) ### What this PR does / why we need it? Refactor the DeepSeek-V3.2-Exp tutorial. - vLLM version: v0.11.0 - vLLM main: https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac --------- Signed-off-by: menogrey <1299267905@qq.com> 2025-11-04 18:58:33 +08:00			`# Using AISBench`
			`This document guides you to conduct accuracy testing using [AISBench](https://gitee.com/aisbench/benchmark/tree/master). AISBench provides accuracy and performance evaluation for many datasets.`

			`## Online Server`
			`### 1. Start the vLLM server`
			`You can run docker container to start the vLLM server on a single NPU:`

			```{code-block} bash
			`:substitutions:`
			`# Update DEVICE according to your device (/dev/davinci[0-7])`
			`export DEVICE=/dev/davinci7`
			`# Update the vllm-ascend image`
			`export IMAGE=quay.io/ascend/vllm-ascend:\|vllm_ascend_version\|`
			`docker run --rm \`
			`--name vllm-ascend \`
			`--shm-size=1g \`
			`--device $DEVICE \`
			`--device /dev/davinci_manager \`
			`--device /dev/devmm_svm \`
			`--device /dev/hisi_hdc \`
			`-v /usr/local/dcmi:/usr/local/dcmi \`
			`-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \`
			`-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \`
			`-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \`
			`-v /etc/ascend_install.info:/etc/ascend_install.info \`
			`-v /root/.cache:/root/.cache \`
			`-p 8000:8000 \`
			`-e VLLM_USE_MODELSCOPE=True \`
			`-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \`
			`-it $IMAGE \`
			`/bin/bash`
			```

			`Run the vLLM server in the docker.`

			```{code-block} bash
			`:substitutions:`
			`vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &`
			```

			`:::{note}`
			`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
			`:::`

			`The vLLM server is started successfully, if you see logs as below:`

			```
			`INFO: Started server process [9446]`
			`INFO: Waiting for application startup.`
			`INFO: Application startup complete.`
			```

			`### 2. Run different dataset using AISBench`

			`#### Install AISBench`

			`Refer to [AISBench](https://gitee.com/aisbench/benchmark/tree/master) for details.`
			`Install AISBench from source.`

			```shell
			`git clone https://gitee.com/aisbench/benchmark.git`
			`cd benchmark/`
			`pip3 install -e ./ --use-pep517`
			```

			`Install extra AISBench dependencies.`

			```shell
			`pip3 install -r requirements/api.txt`
			`pip3 install -r requirements/extra.txt`
			```

			Run `ais_bench -h` to check the installation.

			`#### Download Dataset`

			`You can choose one or multiple datasets to execute accuracy evaluation.`

			1. `C-Eval` dataset.

			Take `C-Eval` dataset as an example. And you can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Every datasets have a `README.md` for detailed download and installation process.

			`Download dataset and install it to specific path.`

			```shell
			`cd ais_bench/datasets`
			`mkdir ceval/`
			`mkdir ceval/formal_ceval`
			`cd ceval/formal_ceval`
			`wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip`
			`unzip ceval-exam.zip`
			`rm ceval-exam.zip`
			```

			2. `MMLU` dataset.

			```shell
			`cd ais_bench/datasets`
			`wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip`
			`unzip mmlu.zip`
			`rm mmlu.zip`
			```

			3. `GPQA` dataset.

			```shell
			`cd ais_bench/datasets`
			`wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip`
			`unzip gpqa.zip`
			`rm gpqa.zip`
			```

			4. `MATH` dataset.

			```shell
			`cd ais_bench/datasets`
			`wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip`
			`unzip math.zip`
			`rm math.zip`
			```

			5. `LiveCodeBench` dataset.

			```shell
			`cd ais_bench/datasets`
			`git lfs install`
			`git clone https://huggingface.co/datasets/livecodebench/code_generation_lite`
			```

			6. `AIME 2024` dataset.

			```shell
			`cd ais_bench/datasets`
			`mkdir aime/`
			`cd aime/`
			`wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip`
			`unzip aime.zip`
			`rm aime.zip`
			```

			7. `GSM8K` dataset.

			```shell
			`cd ais_bench/datasets`
			`wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip`
			`unzip gsm8k.zip`
			`rm gsm8k.zip`
			```

			`#### Configuration`

			Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
			`There are several arguments that you should update according to your environment.`

			- `path`: Update to your model weight path.
			- `model`: Update to your model name in vLLM.
			- `host_ip` and `host_port`: Update to your vLLM server ip and port.
			- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server), `32768` will be suitable for most datasets.
			- `batch_size`: Update according to your dataset.
			- `temperature`: Update inference argument.

			```python
			`from ais_bench.benchmark.models import VLLMCustomAPIChat`
			`from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content`

			`models = [`
			`dict(`
			`attr="service",`
			`type=VLLMCustomAPIChat,`
			`abbr='vllm-api-general-chat',`
			`path="xxxx",`
			`model="xxxx",`
			`request_rate = 0,`
			`retry = 2,`
			`host_ip = "localhost",`
			`host_port = 8000,`
			`max_out_len = xxx,`
			`batch_size = xxx,`
			`trust_remote_code=False,`
			`generation_kwargs = dict(`
			`temperature = 0.6,`
			`top_k = 10,`
			`top_p = 0.95,`
			`seed = None,`
			`repetition_penalty = 1.03,`
			`),`
			`pred_postprocessor=dict(type=extract_non_reasoning_content)`
			`)`
			`]`

			```

			`#### Execute Accuracy Evaluation`

			`Run the following code to execute different accuracy evaluation.`

			```shell
			`# run C-Eval dataset`
			`ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds`

			`# run MMLU dataset`
			`ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds`

			`# run GPQA dataset`
			`ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --mode all --dump-eval-details --merge-ds`

			`# run MATH-500 dataset`
			`ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds`

			`# run LiveCodeBench dataset`
			`ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --mode all --dump-eval-details --merge-ds`

			`# run AIME 2024 dataset`
			`ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --mode all --dump-eval-details --merge-ds`

			```

			After each dataset execution, you can get the result from saved files such as `outputs/default/20250628_151326`, there is an example as follows:

			```
			`20250628_151326/`
			`├── configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks`
			`│ └── 20250628_151326_29317.py`
			`├── logs # Execution logs; if --debug is added to the command, no intermediate logs are saved to disk (all are printed directly to the screen)`
			`│ ├── eval`
			`│ │ └── vllm-api-general-chat`
			`│ │ └── demo_gsm8k.out # Logs of the accuracy evaluation process based on inference results in the predictions/ folder`
			`│ └── infer`
			`│ └── vllm-api-general-chat`
			`│ └── demo_gsm8k.out # Logs of the inference process`
			`├── predictions`
			`│ └── vllm-api-general-chat`
			`│ └── demo_gsm8k.json # Inference results (all outputs returned by the inference service)`
			`├── results`
			`│ └── vllm-api-general-chat`
			`│ └── demo_gsm8k.json # Raw scores calculated from the accuracy evaluation`
			`└── summary`
			`├── summary_20250628_151326.csv # Final accuracy scores (in table format)`
			`├── summary_20250628_151326.md # Final accuracy scores (in Markdown format)`
			`└── summary_20250628_151326.txt # Final accuracy scores (in text format)`
			```

			`#### Execute Performance Evaluation`

			```shell
			`# run C-Eval dataset`
			`ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf`

			`# run MMLU dataset`
			`ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf`

			`# run GPQA dataset`
			`ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --summarizer default_perf --mode perf`

			`# run MATH-500 dataset`
			`ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf`

			`# run LiveCodeBench dataset`
			`ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --summarizer default_perf --mode perf`

			`# run AIME 2024 dataset`
			`ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --summarizer default_perf --mode perf`
			```

			`After execution, you can get the result from saved files, there is an example as follows:`

			```
			`20251031_070226/`
			`\|-- configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks`
			\| `-- 20251031_070226_122485.py
			`\|-- logs`
			\| `-- performances
			\| `-- vllm-api-general-chat
			\| `-- cevaldataset.out # Logs of the performance evaluation process
			`-- performances
			`-- vllm-api-general-chat
			`\|-- cevaldataset.csv # Final performance results (in table format)`
			`\|-- cevaldataset.json # Final performance results (in json format)`
			`\|-- cevaldataset_details.h5 # Final performance results in details`
			`\|-- cevaldataset_details.json # Final performance results in details`
			`\|-- cevaldataset_plot.html # Final performance results (in html format)`
			`-- cevaldataset_rps_distribution_plot_with_actual_rps.html # Final performance results (in html format)
			```