enginex-ascend-910-vllm/docs/source/developer_guide/evaluation/using_lm_eval.md

# Using lm-eval
This document will guide you have a accuracy testing using [lm-eval][1].

## Online Server
### 1. start the vLLM server
You can run docker container to start the vLLM server on a single NPU:

```{code-block} bash
   :substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
```

Started the vLLM server successfully,if you see log as below:

```
INFO:     Started server process [9446]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
```

### 2. Run gsm8k accuracy test using lm-eval

You can query result with input prompts:

```
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "prompt": "'"<|im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>\n"\
"<|im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\
"  Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\
"  Non-current assets: Net fixed assets 12 million yuan\n"\
"  Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\
"  Non-current liabilities: Long-term loans 9 million yuan\n"\
"  Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\
"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\
"Options:\n"\
"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\
"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\
"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
"<|im_start|>assistant\n"'",
        "max_tokens": 1,
        "temperature": 0,
        "stop": ["<|im_end|>"]
    }' | python3 -m json.tool
```

The output format matches the following:

```
{
    "id": "cmpl-2f678e8bdf5a4b209a3f2c1fa5832e25",
    "object": "text_completion",
    "created": 1754475138,
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "choices": [
        {
            "index": 0,
            "text": "A",
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 252,
        "total_tokens": 253,
        "completion_tokens": 1,
        "prompt_tokens_details": null
    },
    "kv_transfer_params": null
}
```

Install lm-eval in the container.

```bash
export HF_ENDPOINT="https://hf-mirror.com"
pip install lm-eval[api]
```

Run the following command:

```
# Only test gsm8k dataset in this demo
lm_eval \
  --model local-completions \
  --model_args model=Qwen/Qwen2.5-0.5B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
  --tasks gsm8k \
  --output_path ./
```

After 30 mins, the output is as shown below:

```
The markdown format results is as below:

Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3215|±  |0.0129|
|     |       |strict-match    |     5|exact_match|↑  |0.2077|±  |0.0112|

```

## Offline Server
### 1. Run docker container

You can run docker container on a single NPU:

```{code-block} bash
   :substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
```

### 2. Run gsm8k accuracy test using lm-eval
Install lm-eval in the container.

```bash
export HF_ENDPOINT="https://hf-mirror.com"
pip install lm-eval
```

Run the following command:

```
# Only test gsm8k dataset in this demo
lm_eval \
  --model vllm \
  --model_args pretrained=Qwen/Qwen2.5-0.5B-Instruct,max_model_len=4096 \
  --tasks gsm8k \
  --batch_size auto
```

After 1-2 mins, the output is as shown below:

```
The markdown format results is as below:

Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3412|±  |0.0131|
|     |       |strict-match    |     5|exact_match|↑  |0.3139|±  |0.0128|

```

## Use offline Datasets

Take gsm8k(single dataset) and mmlu(multi-subject dataset) as examples, and you can see more from [here][2].

```bash
# set HF_DATASETS_OFFLINE when using offline datasets
export HF_DATASETS_OFFLINE=1
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
# gsm8k yaml path
cd lm_eval/tasks/gsm8k
# mmlu yaml path
cd lm_eval/tasks/mmlu/default
```

set [gsm8k.yaml][3] as follows:

```yaml
tag:
  - math_word_problems
task: gsm8k

# set dataset_path arrow or json or parquet according to the downloaded dataset
dataset_path: arrow

# set dataset_name to null
dataset_name: null
output_type: generate_until

# add dataset_kwargs 
dataset_kwargs:
  data_files:
    # train and test data download path
    train: /root/.cache/gsm8k/gsm8k-train.arrow
    test: /root/.cache/gsm8k/gsm8k-test.arrow

training_split: train
fewshot_split: train
test_split: test
doc_to_text: 'Q: {{question}}
  A(Please follow the summarize the result at the end with the format of "The answer is xxx", where xx is the result.):'
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: false
    regexes_to_ignore:
      - ","
      - "\\$"
      - "(?s).*#### "
      - "\\.$"
generation_kwargs:
  until:
    - "Question:"
    - "</s>"
    - "<|im_end|>"
  do_sample: false
  temperature: 0.0
repeats: 1
num_fewshot: 5
filter_list:
  - name: "strict-match"
    filter:
      - function: "regex"
        regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
      - function: "take_first"
  - name: "flexible-extract"
    filter:
      - function: "regex"
        group_select: -1
        regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
      - function: "take_first"
metadata:
  version: 3.0
```

set [_default_template_yaml][4] as follows:

```yaml
# set dataset_path according to the downloaded dataset
dataset_path: /root/.cache/mmlu
test_split: test
fewshot_split: dev
fewshot_config:
  sampler: first_n
output_type: multiple_choice
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0
dataset_kwargs:
  trust_remote_code: true
```

You can see more usage on [Lm-eval Docs][5].

[1]: https://github.com/EleutherAI/lm-evaluation-harness
[2]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-local-datasets
[3]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml
[4]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/default/_default_template_yaml
[5]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/README.md
v0.10.1rc1 2025-09-09 09:40:35 +08:00			`# Using lm-eval`
			`This document will guide you have a accuracy testing using [lm-eval][1].`

			`## Online Server`
			`### 1. start the vLLM server`
			`You can run docker container to start the vLLM server on a single NPU:`

			```{code-block} bash
			`:substitutions:`
			`# Update DEVICE according to your device (/dev/davinci[0-7])`
			`export DEVICE=/dev/davinci7`
			`# Update the vllm-ascend image`
			`export IMAGE=quay.io/ascend/vllm-ascend:\|vllm_ascend_version\|`
			`docker run --rm \`
			`--name vllm-ascend \`
			`--device $DEVICE \`
			`--device /dev/davinci_manager \`
			`--device /dev/devmm_svm \`
			`--device /dev/hisi_hdc \`
			`-v /usr/local/dcmi:/usr/local/dcmi \`
			`-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \`
			`-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \`
			`-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \`
			`-v /etc/ascend_install.info:/etc/ascend_install.info \`
			`-v /root/.cache:/root/.cache \`
			`-p 8000:8000 \`
			`-e VLLM_USE_MODELSCOPE=True \`
			`-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \`
			`-it $IMAGE \`
			`/bin/bash`
			`vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &`
			```

			`Started the vLLM server successfully,if you see log as below:`

			```
			`INFO: Started server process [9446]`
			`INFO: Waiting for application startup.`
			`INFO: Application startup complete.`
			```

			`### 2. Run gsm8k accuracy test using lm-eval`

			`You can query result with input prompts:`

			```
			`curl http://localhost:8000/v1/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "Qwen/Qwen2.5-0.5B-Instruct",`
			`"prompt": "'"<\|im_start\|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<\|im_end\|>\n"\`
			`"<\|im_start\|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\`
			`" Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\`
			`" Non-current assets: Net fixed assets 12 million yuan\n"\`
			`" Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\`
			`" Non-current liabilities: Long-term loans 9 million yuan\n"\`
			`" Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\`
			`"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\`
			`"Options:\n"\`
			`"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\`
			`"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\`
			`"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\`
			`"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<\|im_end\|>\n"\`
			`"<\|im_start\|>assistant\n"'",`
			`"max_tokens": 1,`
			`"temperature": 0,`
			`"stop": ["<\|im_end\|>"]`
			`}' \| python3 -m json.tool`
			```

			`The output format matches the following:`

			```
			`{`
			`"id": "cmpl-2f678e8bdf5a4b209a3f2c1fa5832e25",`
			`"object": "text_completion",`
			`"created": 1754475138,`
			`"model": "Qwen/Qwen2.5-0.5B-Instruct",`
			`"choices": [`
			`{`
			`"index": 0,`
			`"text": "A",`
			`"logprobs": null,`
			`"finish_reason": "length",`
			`"stop_reason": null,`
			`"prompt_logprobs": null`
			`}`
			`],`
			`"service_tier": null,`
			`"system_fingerprint": null,`
			`"usage": {`
			`"prompt_tokens": 252,`
			`"total_tokens": 253,`
			`"completion_tokens": 1,`
			`"prompt_tokens_details": null`
			`},`
			`"kv_transfer_params": null`
			`}`
			```

			`Install lm-eval in the container.`

			```bash
			`export HF_ENDPOINT="https://hf-mirror.com"`
			`pip install lm-eval[api]`
			```

			`Run the following command:`

			```
			`# Only test gsm8k dataset in this demo`
			`lm_eval \`
			`--model local-completions \`
			`--model_args model=Qwen/Qwen2.5-0.5B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \`
			`--tasks gsm8k \`
			`--output_path ./`
			```

			`After 30 mins, the output is as shown below:`

			```
			`The markdown format results is as below:`

			`Tasks\|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\|`
			`\|-----\|------:\|----------------\|-----:\|-----------\|---\|-----:\|---\|-----:\|`
			`\|gsm8k\| 3\|flexible-extract\| 5\|exact_match\|↑ \|0.3215\|± \|0.0129\|`
			`\| \| \|strict-match \| 5\|exact_match\|↑ \|0.2077\|± \|0.0112\|`

			```

			`## Offline Server`
			`### 1. Run docker container`

			`You can run docker container on a single NPU:`

			```{code-block} bash
			`:substitutions:`
			`# Update DEVICE according to your device (/dev/davinci[0-7])`
			`export DEVICE=/dev/davinci7`
			`# Update the vllm-ascend image`
			`export IMAGE=quay.io/ascend/vllm-ascend:\|vllm_ascend_version\|`
			`docker run --rm \`
			`--name vllm-ascend \`
			`--device $DEVICE \`
			`--device /dev/davinci_manager \`
			`--device /dev/devmm_svm \`
			`--device /dev/hisi_hdc \`
			`-v /usr/local/dcmi:/usr/local/dcmi \`
			`-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \`
			`-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \`
			`-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \`
			`-v /etc/ascend_install.info:/etc/ascend_install.info \`
			`-v /root/.cache:/root/.cache \`
			`-p 8000:8000 \`
			`-e VLLM_USE_MODELSCOPE=True \`
			`-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \`
			`-it $IMAGE \`
			`/bin/bash`
			```

			`### 2. Run gsm8k accuracy test using lm-eval`
			`Install lm-eval in the container.`

			```bash
			`export HF_ENDPOINT="https://hf-mirror.com"`
			`pip install lm-eval`
			```

			`Run the following command:`

			```
			`# Only test gsm8k dataset in this demo`
			`lm_eval \`
			`--model vllm \`
			`--model_args pretrained=Qwen/Qwen2.5-0.5B-Instruct,max_model_len=4096 \`
			`--tasks gsm8k \`
			`--batch_size auto`
			```

			`After 1-2 mins, the output is as shown below:`

			```
			`The markdown format results is as below:`

			`Tasks\|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\|`
			`\|-----\|------:\|----------------\|-----:\|-----------\|---\|-----:\|---\|-----:\|`
			`\|gsm8k\| 3\|flexible-extract\| 5\|exact_match\|↑ \|0.3412\|± \|0.0131\|`
			`\| \| \|strict-match \| 5\|exact_match\|↑ \|0.3139\|± \|0.0128\|`

			```

			`## Use offline Datasets`

			`Take gsm8k(single dataset) and mmlu(multi-subject dataset) as examples, and you can see more from [here][2].`

			```bash
			`# set HF_DATASETS_OFFLINE when using offline datasets`
			`export HF_DATASETS_OFFLINE=1`
			`git clone https://github.com/EleutherAI/lm-evaluation-harness.git`
			`cd lm-evaluation-harness`
			`pip install -e .`
			`# gsm8k yaml path`
			`cd lm_eval/tasks/gsm8k`
			`# mmlu yaml path`
			`cd lm_eval/tasks/mmlu/default`
			```

			`set [gsm8k.yaml][3] as follows:`

			```yaml
			`tag:`
			`- math_word_problems`
			`task: gsm8k`

			`# set dataset_path arrow or json or parquet according to the downloaded dataset`
			`dataset_path: arrow`

			`# set dataset_name to null`
			`dataset_name: null`
			`output_type: generate_until`

			`# add dataset_kwargs`
			`dataset_kwargs:`
			`data_files:`
			`# train and test data download path`
			`train: /root/.cache/gsm8k/gsm8k-train.arrow`
			`test: /root/.cache/gsm8k/gsm8k-test.arrow`

			`training_split: train`
			`fewshot_split: train`
			`test_split: test`
			`doc_to_text: 'Q: {{question}}`
			`A(Please follow the summarize the result at the end with the format of "The answer is xxx", where xx is the result.):'`
			`doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"`
			`metric_list:`
			`- metric: exact_match`
			`aggregation: mean`
			`higher_is_better: true`
			`ignore_case: true`
			`ignore_punctuation: false`
			`regexes_to_ignore:`
			`- ","`
			`- "\\$"`
			`- "(?s).*#### "`
			`- "\\.$"`
			`generation_kwargs:`
			`until:`
			`- "Question:"`
			`- "</s>"`
			`- "<\|im_end\|>"`
			`do_sample: false`
			`temperature: 0.0`
			`repeats: 1`
			`num_fewshot: 5`
			`filter_list:`
			`- name: "strict-match"`
			`filter:`
			`- function: "regex"`
			`regex_pattern: "#### (\\-?[0-9\\.\\,]+)"`
			`- function: "take_first"`
			`- name: "flexible-extract"`
			`filter:`
			`- function: "regex"`
			`group_select: -1`
			`regex_pattern: "(-?[$0-9.,]{2,})\|(-?[0-9]+)"`
			`- function: "take_first"`
			`metadata:`
			`version: 3.0`
			```

			`set [_default_template_yaml][4] as follows:`

			```yaml
			`# set dataset_path according to the downloaded dataset`
			`dataset_path: /root/.cache/mmlu`
			`test_split: test`
			`fewshot_split: dev`
			`fewshot_config:`
			`sampler: first_n`
			`output_type: multiple_choice`
			`doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"`
			`doc_to_choice: ["A", "B", "C", "D"]`
			`doc_to_target: answer`
			`metric_list:`
			`- metric: acc`
			`aggregation: mean`
			`higher_is_better: true`
			`metadata:`
			`version: 1.0`
			`dataset_kwargs:`
			`trust_remote_code: true`
			```

			`You can see more usage on [Lm-eval Docs][5].`

			`[1]: https://github.com/EleutherAI/lm-evaluation-harness`
			`[2]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-local-datasets`
			`[3]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml`
			`[4]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/default/_default_template_yaml`
			`[5]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/README.md`