[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)
### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: menogrey <1299267905@qq.com>
This commit is contained in:
@@ -5,6 +5,7 @@
|
||||
:maxdepth: 1
|
||||
using_evalscope
|
||||
using_lm_eval
|
||||
using_ais_bench
|
||||
using_opencompass
|
||||
accuracy_report/index
|
||||
:::
|
||||
|
||||
283
docs/source/developer_guide/evaluation/using_ais_bench.md
Normal file
283
docs/source/developer_guide/evaluation/using_ais_bench.md
Normal file
@@ -0,0 +1,283 @@
|
||||
# Using AISBench
|
||||
This document guides you to conduct accuracy testing using [AISBench](https://gitee.com/aisbench/benchmark/tree/master). AISBench provides accuracy and performance evaluation for many datasets.
|
||||
|
||||
## Online Server
|
||||
### 1. Start the vLLM server
|
||||
You can run docker container to start the vLLM server on a single NPU:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update DEVICE according to your device (/dev/davinci[0-7])
|
||||
export DEVICE=/dev/davinci7
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
docker run --rm \
|
||||
--name vllm-ascend \
|
||||
--shm-size=1g \
|
||||
--device $DEVICE \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8000:8000 \
|
||||
-e VLLM_USE_MODELSCOPE=True \
|
||||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||||
-it $IMAGE \
|
||||
/bin/bash
|
||||
```
|
||||
|
||||
Run the vLLM server in the docker.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &
|
||||
```
|
||||
|
||||
:::{note}
|
||||
`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
|
||||
:::
|
||||
|
||||
The vLLM server is started successfully, if you see logs as below:
|
||||
|
||||
```
|
||||
INFO: Started server process [9446]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
### 2. Run different dataset using AISBench
|
||||
|
||||
#### Install AISBench
|
||||
|
||||
Refer to [AISBench](https://gitee.com/aisbench/benchmark/tree/master) for details.
|
||||
Install AISBench from source.
|
||||
|
||||
```shell
|
||||
git clone https://gitee.com/aisbench/benchmark.git
|
||||
cd benchmark/
|
||||
pip3 install -e ./ --use-pep517
|
||||
```
|
||||
|
||||
Install extra AISBench dependencies.
|
||||
|
||||
```shell
|
||||
pip3 install -r requirements/api.txt
|
||||
pip3 install -r requirements/extra.txt
|
||||
```
|
||||
|
||||
Run `ais_bench -h` to check the installation.
|
||||
|
||||
#### Download Dataset
|
||||
|
||||
You can choose one or multiple datasets to execute accuracy evaluation.
|
||||
|
||||
1. `C-Eval` dataset.
|
||||
|
||||
Take `C-Eval` dataset as an example. And you can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Every datasets have a `README.md` for detailed download and installation process.
|
||||
|
||||
Download dataset and install it to specific path.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
mkdir ceval/
|
||||
mkdir ceval/formal_ceval
|
||||
cd ceval/formal_ceval
|
||||
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
|
||||
unzip ceval-exam.zip
|
||||
rm ceval-exam.zip
|
||||
```
|
||||
|
||||
2. `MMLU` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
|
||||
unzip mmlu.zip
|
||||
rm mmlu.zip
|
||||
```
|
||||
|
||||
3. `GPQA` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
|
||||
unzip gpqa.zip
|
||||
rm gpqa.zip
|
||||
```
|
||||
|
||||
4. `MATH` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
|
||||
unzip math.zip
|
||||
rm math.zip
|
||||
```
|
||||
|
||||
5. `LiveCodeBench` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
git lfs install
|
||||
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
|
||||
```
|
||||
|
||||
6. `AIME 2024` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
mkdir aime/
|
||||
cd aime/
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
|
||||
unzip aime.zip
|
||||
rm aime.zip
|
||||
```
|
||||
|
||||
7. `GSM8K` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
|
||||
unzip gsm8k.zip
|
||||
rm gsm8k.zip
|
||||
```
|
||||
|
||||
#### Configuration
|
||||
|
||||
Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
|
||||
There are several arguments that you should update according to your environment.
|
||||
|
||||
- `path`: Update to your model weight path.
|
||||
- `model`: Update to your model name in vLLM.
|
||||
- `host_ip` and `host_port`: Update to your vLLM server ip and port.
|
||||
- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server), `32768` will be suitable for most datasets.
|
||||
- `batch_size`: Update according to your dataset.
|
||||
- `temperature`: Update inference argument.
|
||||
|
||||
```python
|
||||
from ais_bench.benchmark.models import VLLMCustomAPIChat
|
||||
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
|
||||
|
||||
models = [
|
||||
dict(
|
||||
attr="service",
|
||||
type=VLLMCustomAPIChat,
|
||||
abbr='vllm-api-general-chat',
|
||||
path="xxxx",
|
||||
model="xxxx",
|
||||
request_rate = 0,
|
||||
retry = 2,
|
||||
host_ip = "localhost",
|
||||
host_port = 8000,
|
||||
max_out_len = xxx,
|
||||
batch_size = xxx,
|
||||
trust_remote_code=False,
|
||||
generation_kwargs = dict(
|
||||
temperature = 0.6,
|
||||
top_k = 10,
|
||||
top_p = 0.95,
|
||||
seed = None,
|
||||
repetition_penalty = 1.03,
|
||||
),
|
||||
pred_postprocessor=dict(type=extract_non_reasoning_content)
|
||||
)
|
||||
]
|
||||
|
||||
```
|
||||
|
||||
#### Execute Accuracy Evaluation
|
||||
|
||||
Run the following code to execute different accuracy evaluation.
|
||||
|
||||
```shell
|
||||
# run C-Eval dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
|
||||
|
||||
# run MMLU dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
|
||||
|
||||
# run GPQA dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --mode all --dump-eval-details --merge-ds
|
||||
|
||||
# run MATH-500 dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
|
||||
|
||||
# run LiveCodeBench dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --mode all --dump-eval-details --merge-ds
|
||||
|
||||
# run AIME 2024 dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --mode all --dump-eval-details --merge-ds
|
||||
|
||||
```
|
||||
|
||||
After each dataset execution, you can get the result from saved files such as `outputs/default/20250628_151326`, there is an example as follows:
|
||||
|
||||
```
|
||||
20250628_151326/
|
||||
├── configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
|
||||
│ └── 20250628_151326_29317.py
|
||||
├── logs # Execution logs; if --debug is added to the command, no intermediate logs are saved to disk (all are printed directly to the screen)
|
||||
│ ├── eval
|
||||
│ │ └── vllm-api-general-chat
|
||||
│ │ └── demo_gsm8k.out # Logs of the accuracy evaluation process based on inference results in the predictions/ folder
|
||||
│ └── infer
|
||||
│ └── vllm-api-general-chat
|
||||
│ └── demo_gsm8k.out # Logs of the inference process
|
||||
├── predictions
|
||||
│ └── vllm-api-general-chat
|
||||
│ └── demo_gsm8k.json # Inference results (all outputs returned by the inference service)
|
||||
├── results
|
||||
│ └── vllm-api-general-chat
|
||||
│ └── demo_gsm8k.json # Raw scores calculated from the accuracy evaluation
|
||||
└── summary
|
||||
├── summary_20250628_151326.csv # Final accuracy scores (in table format)
|
||||
├── summary_20250628_151326.md # Final accuracy scores (in Markdown format)
|
||||
└── summary_20250628_151326.txt # Final accuracy scores (in text format)
|
||||
```
|
||||
|
||||
#### Execute Performance Evaluation
|
||||
|
||||
```shell
|
||||
# run C-Eval dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
|
||||
|
||||
# run MMLU dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
|
||||
|
||||
# run GPQA dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --summarizer default_perf --mode perf
|
||||
|
||||
# run MATH-500 dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
|
||||
|
||||
# run LiveCodeBench dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --summarizer default_perf --mode perf
|
||||
|
||||
# run AIME 2024 dataset
|
||||
ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --summarizer default_perf --mode perf
|
||||
```
|
||||
|
||||
After execution, you can get the result from saved files, there is an example as follows:
|
||||
|
||||
```
|
||||
20251031_070226/
|
||||
|-- configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
|
||||
| `-- 20251031_070226_122485.py
|
||||
|-- logs
|
||||
| `-- performances
|
||||
| `-- vllm-api-general-chat
|
||||
| `-- cevaldataset.out # Logs of the performance evaluation process
|
||||
`-- performances
|
||||
`-- vllm-api-general-chat
|
||||
|-- cevaldataset.csv # Final performance results (in table format)
|
||||
|-- cevaldataset.json # Final performance results (in json format)
|
||||
|-- cevaldataset_details.h5 # Final performance results in details
|
||||
|-- cevaldataset_details.json # Final performance results in details
|
||||
|-- cevaldataset_plot.html # Final performance results (in html format)
|
||||
`-- cevaldataset_rps_distribution_plot_with_actual_rps.html # Final performance results (in html format)
|
||||
```
|
||||
@@ -122,10 +122,10 @@ After 30 minutes, the output is as shown below:
|
||||
```
|
||||
The markdown format results is as below:
|
||||
|
||||
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|
||||
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|
||||
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|
||||
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3215|± |0.0129|
|
||||
| | |strict-match | 5|exact_match|↑ |0.2077|± |0.0112|
|
||||
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.2077|± |0.0112|
|
||||
|
||||
```
|
||||
|
||||
@@ -187,7 +187,7 @@ The markdown format results is as below:
|
||||
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|
||||
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|
||||
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3412|± |0.0131|
|
||||
| | |strict-match | 5|exact_match|↑ |0.3139|± |0.0128|
|
||||
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.3139|± |0.0128|
|
||||
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user