2025-04-01 23:43:51 +08:00
|
|
|
# Using lm-eval
|
2026-01-15 09:06:01 +08:00
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
This document guides you to conduct accuracy testing using [lm-eval][1].
|
2025-04-01 23:43:51 +08:00
|
|
|
|
2025-08-07 14:15:49 +08:00
|
|
|
## Online Server
|
2026-01-15 09:06:01 +08:00
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
### 1. Start the vLLM server
|
2026-01-15 09:06:01 +08:00
|
|
|
|
2025-08-07 14:15:49 +08:00
|
|
|
You can run docker container to start the vLLM server on a single NPU:
|
|
|
|
|
|
|
|
|
|
```{code-block} bash
|
|
|
|
|
:substitutions:
|
|
|
|
|
# Update DEVICE according to your device (/dev/davinci[0-7])
|
|
|
|
|
export DEVICE=/dev/davinci7
|
|
|
|
|
# Update the vllm-ascend image
|
|
|
|
|
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
|
|
|
|
docker run --rm \
|
|
|
|
|
--name vllm-ascend \
|
2025-10-23 11:17:26 +08:00
|
|
|
--shm-size=1g \
|
2025-08-07 14:15:49 +08:00
|
|
|
--device $DEVICE \
|
|
|
|
|
--device /dev/davinci_manager \
|
|
|
|
|
--device /dev/devmm_svm \
|
|
|
|
|
--device /dev/hisi_hdc \
|
|
|
|
|
-v /usr/local/dcmi:/usr/local/dcmi \
|
|
|
|
|
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
|
|
|
|
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
|
|
|
|
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
|
|
|
|
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
|
|
|
|
-v /root/.cache:/root/.cache \
|
|
|
|
|
-p 8000:8000 \
|
|
|
|
|
-e VLLM_USE_MODELSCOPE=True \
|
|
|
|
|
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
|
|
|
|
-it $IMAGE \
|
|
|
|
|
/bin/bash
|
2026-04-28 09:01:25 +08:00
|
|
|
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 4096 &
|
2025-08-07 14:15:49 +08:00
|
|
|
```
|
|
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
The vLLM server is started successfully, if you see logs as below:
|
2025-08-07 14:15:49 +08:00
|
|
|
|
2026-01-15 09:06:01 +08:00
|
|
|
```shell
|
2025-08-07 14:15:49 +08:00
|
|
|
INFO: Started server process [9446]
|
|
|
|
|
INFO: Waiting for application startup.
|
|
|
|
|
INFO: Application startup complete.
|
|
|
|
|
```
|
|
|
|
|
|
2026-04-17 08:54:38 +08:00
|
|
|
### 2. Run GSM8K using the vLLM server (curl) and then run lm-eval for accuracy testing
|
2025-08-07 14:15:49 +08:00
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
You can query the result with input prompts:
|
2025-08-07 14:15:49 +08:00
|
|
|
|
2026-01-15 09:06:01 +08:00
|
|
|
```shell
|
2026-04-28 09:01:25 +08:00
|
|
|
PROMPT='<|im_start|>system
|
|
|
|
|
You are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>
|
|
|
|
|
<|im_start|>user
|
|
|
|
|
Question: A company'"'"'s balance sheet as of December 31, 2023 shows:
|
|
|
|
|
Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan
|
|
|
|
|
Non-current assets: Net fixed assets 12 million yuan
|
|
|
|
|
Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan
|
|
|
|
|
Non-current liabilities: Long-term loans 9 million yuan
|
|
|
|
|
Owner'"'"'s equity: Paid-in capital 10 million yuan, Retained earnings ?
|
|
|
|
|
Requirement: Calculate the company'"'"'s Asset-Liability Ratio and Current Ratio (round to two decimal places).
|
|
|
|
|
Options:
|
|
|
|
|
A. Asset-Liability Ratio=58.33%, Current Ratio=1.90
|
|
|
|
|
B. Asset-Liability Ratio=62.50%, Current Ratio=2.17
|
|
|
|
|
C. Asset-Liability Ratio=65.22%, Current Ratio=1.75
|
|
|
|
|
D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>
|
|
|
|
|
<|im_start|>assistant
|
|
|
|
|
'
|
|
|
|
|
|
2025-08-07 14:15:49 +08:00
|
|
|
curl http://localhost:8000/v1/completions \
|
|
|
|
|
-H "Content-Type: application/json" \
|
2026-04-28 09:01:25 +08:00
|
|
|
-d "$(jq -n \
|
|
|
|
|
--arg model "Qwen/Qwen2.5-0.5B-Instruct" \
|
|
|
|
|
--arg prompt "$PROMPT" \
|
|
|
|
|
'{
|
|
|
|
|
model: $model,
|
|
|
|
|
prompt: $prompt,
|
|
|
|
|
max_completion_tokens: 1,
|
|
|
|
|
temperature: 0,
|
|
|
|
|
stop: ["<|im_end|>"]
|
|
|
|
|
}')" | python3 -m json.tool
|
2025-08-07 14:15:49 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The output format matches the following:
|
|
|
|
|
|
2026-01-15 09:06:01 +08:00
|
|
|
```json
|
2025-08-07 14:15:49 +08:00
|
|
|
{
|
|
|
|
|
"id": "cmpl-2f678e8bdf5a4b209a3f2c1fa5832e25",
|
|
|
|
|
"object": "text_completion",
|
|
|
|
|
"created": 1754475138,
|
|
|
|
|
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
|
|
|
|
"choices": [
|
|
|
|
|
{
|
|
|
|
|
"index": 0,
|
|
|
|
|
"text": "A",
|
|
|
|
|
"logprobs": null,
|
|
|
|
|
"finish_reason": "length",
|
|
|
|
|
"stop_reason": null,
|
|
|
|
|
"prompt_logprobs": null
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"service_tier": null,
|
|
|
|
|
"system_fingerprint": null,
|
|
|
|
|
"usage": {
|
|
|
|
|
"prompt_tokens": 252,
|
|
|
|
|
"total_tokens": 253,
|
|
|
|
|
"completion_tokens": 1,
|
|
|
|
|
"prompt_tokens_details": null
|
|
|
|
|
},
|
|
|
|
|
"kv_transfer_params": null
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
Install lm-eval in the container:
|
2025-08-07 14:15:49 +08:00
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
export HF_ENDPOINT="https://hf-mirror.com"
|
2026-03-14 22:41:02 +08:00
|
|
|
export USE_MODELSCOPE_HUB=0
|
2025-08-07 14:15:49 +08:00
|
|
|
pip install lm-eval[api]
|
|
|
|
|
```
|
|
|
|
|
|
2026-03-14 22:41:02 +08:00
|
|
|
:::{note}
|
|
|
|
|
The Docker container is launched with `VLLM_USE_MODELSCOPE=True`, which may
|
|
|
|
|
cause lm-eval to download datasets from ModelScope instead of HuggingFace.
|
|
|
|
|
Setting `USE_MODELSCOPE_HUB=0` disables this behavior so that lm-eval can
|
|
|
|
|
fetch datasets from HuggingFace correctly.
|
|
|
|
|
:::
|
|
|
|
|
|
2025-08-07 14:15:49 +08:00
|
|
|
Run the following command:
|
|
|
|
|
|
2026-01-15 09:06:01 +08:00
|
|
|
```shell
|
2025-08-07 14:15:49 +08:00
|
|
|
# Only test gsm8k dataset in this demo
|
|
|
|
|
lm_eval \
|
|
|
|
|
--model local-completions \
|
|
|
|
|
--model_args model=Qwen/Qwen2.5-0.5B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
|
|
|
|
|
--tasks gsm8k \
|
|
|
|
|
--output_path ./
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
After 30 minutes, the output is as shown below:
|
2025-08-07 14:15:49 +08:00
|
|
|
|
2026-01-15 09:06:01 +08:00
|
|
|
```shell
|
2025-08-07 14:15:49 +08:00
|
|
|
The markdown format results is as below:
|
|
|
|
|
|
2025-11-04 18:58:33 +08:00
|
|
|
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|
2025-08-07 14:15:49 +08:00
|
|
|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|
|
|
|
|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3215|± |0.0129|
|
2025-11-04 18:58:33 +08:00
|
|
|
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.2077|± |0.0112|
|
2025-08-07 14:15:49 +08:00
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Offline Server
|
2026-01-15 09:06:01 +08:00
|
|
|
|
2025-08-07 14:15:49 +08:00
|
|
|
### 1. Run docker container
|
2025-04-01 23:43:51 +08:00
|
|
|
|
|
|
|
|
You can run docker container on a single NPU:
|
|
|
|
|
|
|
|
|
|
```{code-block} bash
|
|
|
|
|
:substitutions:
|
|
|
|
|
# Update DEVICE according to your device (/dev/davinci[0-7])
|
|
|
|
|
export DEVICE=/dev/davinci7
|
|
|
|
|
# Update the vllm-ascend image
|
|
|
|
|
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
|
|
|
|
docker run --rm \
|
|
|
|
|
--name vllm-ascend \
|
2025-10-23 11:17:26 +08:00
|
|
|
--shm-size=1g \
|
2025-04-01 23:43:51 +08:00
|
|
|
--device $DEVICE \
|
|
|
|
|
--device /dev/davinci_manager \
|
|
|
|
|
--device /dev/devmm_svm \
|
|
|
|
|
--device /dev/hisi_hdc \
|
|
|
|
|
-v /usr/local/dcmi:/usr/local/dcmi \
|
|
|
|
|
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
|
|
|
|
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
|
|
|
|
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
|
|
|
|
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
|
|
|
|
-v /root/.cache:/root/.cache \
|
|
|
|
|
-p 8000:8000 \
|
|
|
|
|
-e VLLM_USE_MODELSCOPE=True \
|
|
|
|
|
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
|
|
|
|
-it $IMAGE \
|
|
|
|
|
/bin/bash
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
### 2. Run GSM8K using lm-eval for accuracy testing
|
2026-01-15 09:06:01 +08:00
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
Install lm-eval in the container:
|
2025-04-01 23:43:51 +08:00
|
|
|
|
|
|
|
|
```bash
|
2025-08-07 14:15:49 +08:00
|
|
|
export HF_ENDPOINT="https://hf-mirror.com"
|
2026-03-14 22:41:02 +08:00
|
|
|
export USE_MODELSCOPE_HUB=0
|
2025-04-01 23:43:51 +08:00
|
|
|
pip install lm-eval
|
|
|
|
|
```
|
2025-07-25 22:16:10 +08:00
|
|
|
|
2026-03-14 22:41:02 +08:00
|
|
|
:::{note}
|
|
|
|
|
The Docker container is launched with `VLLM_USE_MODELSCOPE=True`, which may
|
|
|
|
|
cause lm-eval to download datasets from ModelScope instead of HuggingFace.
|
|
|
|
|
Setting `USE_MODELSCOPE_HUB=0` disables this behavior so that lm-eval can
|
|
|
|
|
fetch datasets from HuggingFace correctly.
|
|
|
|
|
:::
|
|
|
|
|
|
2025-04-01 23:43:51 +08:00
|
|
|
Run the following command:
|
|
|
|
|
|
2026-01-15 09:06:01 +08:00
|
|
|
```shell
|
2025-08-07 14:15:49 +08:00
|
|
|
# Only test gsm8k dataset in this demo
|
2025-04-01 23:43:51 +08:00
|
|
|
lm_eval \
|
|
|
|
|
--model vllm \
|
2025-08-07 14:15:49 +08:00
|
|
|
--model_args pretrained=Qwen/Qwen2.5-0.5B-Instruct,max_model_len=4096 \
|
|
|
|
|
--tasks gsm8k \
|
|
|
|
|
--batch_size auto
|
2025-04-01 23:43:51 +08:00
|
|
|
```
|
|
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
After 1 to 2 minutes, the output is shown below:
|
2025-04-01 23:43:51 +08:00
|
|
|
|
2026-01-15 09:06:01 +08:00
|
|
|
```shell
|
2025-04-01 23:43:51 +08:00
|
|
|
The markdown format results is as below:
|
|
|
|
|
|
2025-08-07 14:15:49 +08:00
|
|
|
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|
|
|
|
|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|
|
|
|
|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3412|± |0.0131|
|
2025-11-04 18:58:33 +08:00
|
|
|
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.3139|± |0.0128|
|
2025-08-07 14:15:49 +08:00
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
## Use Offline Datasets
|
2025-08-07 14:15:49 +08:00
|
|
|
|
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
|
|
|
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [using-local-datasets][2].
|
2025-08-07 14:15:49 +08:00
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# set HF_DATASETS_OFFLINE when using offline datasets
|
|
|
|
|
export HF_DATASETS_OFFLINE=1
|
|
|
|
|
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
|
|
|
|
|
cd lm-evaluation-harness
|
|
|
|
|
pip install -e .
|
|
|
|
|
# gsm8k yaml path
|
|
|
|
|
cd lm_eval/tasks/gsm8k
|
|
|
|
|
# mmlu yaml path
|
|
|
|
|
cd lm_eval/tasks/mmlu/default
|
|
|
|
|
```
|
|
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
Set [gsm8k.yaml][3] as follows:
|
2025-08-07 14:15:49 +08:00
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
|
tag:
|
|
|
|
|
- math_word_problems
|
|
|
|
|
task: gsm8k
|
|
|
|
|
|
|
|
|
|
# set dataset_path arrow or json or parquet according to the downloaded dataset
|
|
|
|
|
dataset_path: arrow
|
|
|
|
|
|
|
|
|
|
# set dataset_name to null
|
|
|
|
|
dataset_name: null
|
|
|
|
|
output_type: generate_until
|
|
|
|
|
|
|
|
|
|
# add dataset_kwargs
|
|
|
|
|
dataset_kwargs:
|
|
|
|
|
data_files:
|
|
|
|
|
# train and test data download path
|
|
|
|
|
train: /root/.cache/gsm8k/gsm8k-train.arrow
|
|
|
|
|
test: /root/.cache/gsm8k/gsm8k-test.arrow
|
2025-04-01 23:43:51 +08:00
|
|
|
|
2025-08-07 14:15:49 +08:00
|
|
|
training_split: train
|
|
|
|
|
fewshot_split: train
|
|
|
|
|
test_split: test
|
|
|
|
|
doc_to_text: 'Q: {{question}}
|
2025-10-29 11:03:39 +08:00
|
|
|
A(Please follow the summarized result at the end with the format of "The answer is xxx", where xx is the result.):'
|
2025-08-07 14:15:49 +08:00
|
|
|
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
|
|
|
|
|
metric_list:
|
|
|
|
|
- metric: exact_match
|
|
|
|
|
aggregation: mean
|
|
|
|
|
higher_is_better: true
|
|
|
|
|
ignore_case: true
|
|
|
|
|
ignore_punctuation: false
|
|
|
|
|
regexes_to_ignore:
|
|
|
|
|
- ","
|
|
|
|
|
- "\\$"
|
|
|
|
|
- "(?s).*#### "
|
|
|
|
|
- "\\.$"
|
|
|
|
|
generation_kwargs:
|
|
|
|
|
until:
|
|
|
|
|
- "Question:"
|
|
|
|
|
- "</s>"
|
|
|
|
|
- "<|im_end|>"
|
|
|
|
|
do_sample: false
|
|
|
|
|
temperature: 0.0
|
|
|
|
|
repeats: 1
|
|
|
|
|
num_fewshot: 5
|
|
|
|
|
filter_list:
|
|
|
|
|
- name: "strict-match"
|
|
|
|
|
filter:
|
|
|
|
|
- function: "regex"
|
|
|
|
|
regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
|
|
|
|
|
- function: "take_first"
|
|
|
|
|
- name: "flexible-extract"
|
|
|
|
|
filter:
|
|
|
|
|
- function: "regex"
|
|
|
|
|
group_select: -1
|
|
|
|
|
regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
|
|
|
|
|
- function: "take_first"
|
|
|
|
|
metadata:
|
|
|
|
|
version: 3.0
|
2025-04-01 23:43:51 +08:00
|
|
|
```
|
|
|
|
|
|
2025-10-29 11:03:39 +08:00
|
|
|
Set [_default_template_yaml][4] as follows:
|
2025-08-07 14:15:49 +08:00
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
|
# set dataset_path according to the downloaded dataset
|
|
|
|
|
dataset_path: /root/.cache/mmlu
|
|
|
|
|
test_split: test
|
|
|
|
|
fewshot_split: dev
|
|
|
|
|
fewshot_config:
|
|
|
|
|
sampler: first_n
|
|
|
|
|
output_type: multiple_choice
|
|
|
|
|
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
|
|
|
|
|
doc_to_choice: ["A", "B", "C", "D"]
|
|
|
|
|
doc_to_target: answer
|
|
|
|
|
metric_list:
|
|
|
|
|
- metric: acc
|
|
|
|
|
aggregation: mean
|
|
|
|
|
higher_is_better: true
|
|
|
|
|
metadata:
|
|
|
|
|
version: 1.0
|
|
|
|
|
dataset_kwargs:
|
|
|
|
|
trust_remote_code: true
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
You can see more usage on [Lm-eval Docs][5].
|
|
|
|
|
|
|
|
|
|
[1]: https://github.com/EleutherAI/lm-evaluation-harness
|
|
|
|
|
[2]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-local-datasets
|
|
|
|
|
[3]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml
|
|
|
|
|
[4]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/default/_default_template_yaml
|
|
|
|
|
[5]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/README.md
|