[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it? Refactor the DeepSeek-V3.2-Exp tutorial. - vLLM version: v0.11.0 - vLLM main: 83f478bb19 --------- Signed-off-by: menogrey <1299267905@qq.com>
2025-11-04 18:58:33 +08:00
parent 49e6983b3b
commit 5f08e07208
9 changed files with 934 additions and 431 deletions
--- a/docs/source/developer_guide/evaluation/index.md
+++ b/docs/source/developer_guide/evaluation/index.md
@@ -5,6 +5,7 @@
 :maxdepth: 1
 using_evalscope
 using_lm_eval
+using_ais_bench
 using_opencompass
 accuracy_report/index
 :::
--- a/docs/source/developer_guide/evaluation/using_ais_bench.md
+++ b/docs/source/developer_guide/evaluation/using_ais_bench.md
@@ -0,0 +1,283 @@
+# Using AISBench
+This document guides you to conduct accuracy testing using [AISBench](https://gitee.com/aisbench/benchmark/tree/master). AISBench provides accuracy and performance evaluation for many datasets.
+
+## Online Server
+### 1. Start the vLLM server
+You can run docker container to start the vLLM server on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--shm-size=1g \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+/bin/bash
+```
+
+Run the vLLM server in the docker.
+
+```{code-block} bash
+   :substitutions:
+vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &
+```
+
+:::{note}
+`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
+:::
+
+The vLLM server is started successfully, if you see logs as below:
+
+```
+INFO:     Started server process [9446]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+### 2. Run different dataset using AISBench
+
+#### Install AISBench
+
+Refer to [AISBench](https://gitee.com/aisbench/benchmark/tree/master) for details.
+Install AISBench from source.
+
+```shell
+git clone https://gitee.com/aisbench/benchmark.git
+cd benchmark/
+pip3 install -e ./ --use-pep517
+```
+
+Install extra AISBench dependencies.
+
+```shell
+pip3 install -r requirements/api.txt
+pip3 install -r requirements/extra.txt
+```
+
+Run `ais_bench -h` to check the installation.
+
+#### Download Dataset
+
+You can choose one or multiple datasets to execute accuracy evaluation.
+
+1. `C-Eval` dataset.
+
+Take `C-Eval` dataset as an example. And you can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Every datasets have a `README.md` for detailed download and installation process.
+
+Download dataset and install it to specific path.
+
+```shell
+cd ais_bench/datasets
+mkdir ceval/
+mkdir ceval/formal_ceval
+cd ceval/formal_ceval
+wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
+unzip ceval-exam.zip
+rm ceval-exam.zip
+```
+
+2. `MMLU` dataset.
+
+```shell
+cd ais_bench/datasets
+wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
+unzip mmlu.zip
+rm mmlu.zip
+```
+
+3. `GPQA` dataset.
+
+```shell
+cd ais_bench/datasets
+wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
+unzip gpqa.zip
+rm gpqa.zip
+```
+
+4. `MATH` dataset.
+
+```shell
+cd ais_bench/datasets
+wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
+unzip math.zip
+rm math.zip
+```
+
+5. `LiveCodeBench` dataset.
+
+```shell
+cd ais_bench/datasets
+git lfs install
+git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
+```
+
+6. `AIME 2024` dataset.
+
+```shell
+cd ais_bench/datasets
+mkdir aime/
+cd aime/
+wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
+unzip aime.zip
+rm aime.zip
+```
+
+7. `GSM8K` dataset.
+
+```shell
+cd ais_bench/datasets
+wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
+unzip gsm8k.zip
+rm gsm8k.zip
+```
+
+#### Configuration
+
+Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
+There are several arguments that you should update according to your environment.
+
+- `path`: Update to your model weight path.
+- `model`: Update to your model name in vLLM.
+- `host_ip` and `host_port`: Update to your vLLM server ip and port.
+- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server), `32768` will be suitable for most datasets.
+- `batch_size`: Update according to your dataset.
+- `temperature`: Update inference argument.
+
+```python
+from ais_bench.benchmark.models import VLLMCustomAPIChat
+from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
+
+models = [
+    dict(
+        attr="service",
+        type=VLLMCustomAPIChat,
+        abbr='vllm-api-general-chat',
+        path="xxxx",
+        model="xxxx",
+        request_rate = 0,
+        retry = 2,
+        host_ip = "localhost",
+        host_port = 8000,
+        max_out_len = xxx,
+        batch_size = xxx,
+        trust_remote_code=False,
+        generation_kwargs = dict(
+            temperature = 0.6,
+            top_k = 10,
+            top_p = 0.95,
+            seed = None,
+            repetition_penalty = 1.03,
+        ),
+        pred_postprocessor=dict(type=extract_non_reasoning_content)
+    )
+]
+
+```
+
+#### Execute Accuracy Evaluation
+
+Run the following code to execute different accuracy evaluation.
+
+```shell
+# run C-Eval dataset
+ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
+
+# run MMLU dataset
+ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
+
+# run GPQA dataset
+ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --mode all --dump-eval-details --merge-ds
+
+# run MATH-500 dataset
+ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
+
+# run LiveCodeBench dataset
+ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --mode all --dump-eval-details --merge-ds
+
+# run AIME 2024 dataset
+ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --mode all --dump-eval-details --merge-ds
+
+```
+
+After each dataset execution, you can get the result from saved files such as `outputs/default/20250628_151326`, there is an example as follows:
+
+```
+20250628_151326/
+├── configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
+│   └── 20250628_151326_29317.py
+├── logs # Execution logs; if --debug is added to the command, no intermediate logs are saved to disk (all are printed directly to the screen)
+│   ├── eval
+│   │   └── vllm-api-general-chat
+│   │       └── demo_gsm8k.out # Logs of the accuracy evaluation process based on inference results in the predictions/ folder
+│   └── infer
+│       └── vllm-api-general-chat
+│           └── demo_gsm8k.out # Logs of the inference process
+├── predictions
+│   └── vllm-api-general-chat
+│       └── demo_gsm8k.json # Inference results (all outputs returned by the inference service)
+├── results
+│   └── vllm-api-general-chat
+│       └── demo_gsm8k.json # Raw scores calculated from the accuracy evaluation
+└── summary
+    ├── summary_20250628_151326.csv # Final accuracy scores (in table format)
+    ├── summary_20250628_151326.md # Final accuracy scores (in Markdown format)
+    └── summary_20250628_151326.txt # Final accuracy scores (in text format)
+```
+
+#### Execute Performance Evaluation
+
+```shell
+# run C-Eval dataset
+ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
+
+# run MMLU dataset
+ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
+
+# run GPQA dataset
+ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --summarizer default_perf --mode perf
+
+# run MATH-500 dataset
+ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
+
+# run LiveCodeBench dataset
+ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --summarizer default_perf --mode perf
+
+# run AIME 2024 dataset
+ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --summarizer default_perf --mode perf
+```
+
+After execution, you can get the result from saved files, there is an example as follows:
+
+```
+20251031_070226/
+|-- configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
+|   `-- 20251031_070226_122485.py
+|-- logs
+|   `-- performances
+|       `-- vllm-api-general-chat
+|           `-- cevaldataset.out # Logs of the performance evaluation process
+`-- performances
+    `-- vllm-api-general-chat
+        |-- cevaldataset.csv # Final performance results (in table format)
+        |-- cevaldataset.json # Final performance results (in json format)
+        |-- cevaldataset_details.h5 # Final performance results in details
+        |-- cevaldataset_details.json # Final performance results in details
+        |-- cevaldataset_plot.html # Final performance results (in html format)
+        `-- cevaldataset_rps_distribution_plot_with_actual_rps.html # Final performance results (in html format)
+```
--- a/docs/source/developer_guide/evaluation/using_lm_eval.md
+++ b/docs/source/developer_guide/evaluation/using_lm_eval.md
@@ -122,10 +122,10 @@ After 30 minutes, the output is as shown below:
 ```
 The markdown format results is as below:

-Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3215|±  |0.0129|
-|     |       |strict-match    |     5|exact_match|↑  |0.2077|±  |0.0112|
+|gsm8k|      3|strict-match    |     5|exact_match|↑  |0.2077|±  |0.0112|

 ```

@@ -187,7 +187,7 @@ The markdown format results is as below:
 Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3412|±  |0.0131|
-|     |       |strict-match    |     5|exact_match|↑  |0.3139|±  |0.0128|
+|gsm8k|      3|strict-match    |     5|exact_match|↑  |0.3139|±  |0.0128|

 ```