2025-04-01 23:43:51 +08:00
# Using lm-eval
2026-01-15 09:06:01 +08:00
2025-10-29 11:03:39 +08:00
This document guides you to conduct accuracy testing using [lm-eval][1].
2025-04-01 23:43:51 +08:00
2025-08-07 14:15:49 +08:00
## Online Server
2026-01-15 09:06:01 +08:00
2025-10-29 11:03:39 +08:00
### 1. Start the vLLM server
2026-01-15 09:06:01 +08:00
2025-08-07 14:15:49 +08:00
You can run docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
2025-10-23 11:17:26 +08:00
--shm-size=1g \
2025-08-07 14:15:49 +08:00
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
```
2025-10-29 11:03:39 +08:00
The vLLM server is started successfully, if you see logs as below:
2025-08-07 14:15:49 +08:00
2026-01-15 09:06:01 +08:00
```shell
2025-08-07 14:15:49 +08:00
INFO: Started server process [9446]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
2025-10-29 11:03:39 +08:00
### 2. Run GSM8K using lm-eval for accuracy testing
2025-08-07 14:15:49 +08:00
2025-10-29 11:03:39 +08:00
You can query the result with input prompts:
2025-08-07 14:15:49 +08:00
2026-01-15 09:06:01 +08:00
```shell
2025-08-07 14:15:49 +08:00
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "'"< |im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).< |im_end|>\n"\
"< |im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\
" Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\
" Non-current assets: Net fixed assets 12 million yuan\n"\
" Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\
" Non-current liabilities: Long-term loans 9 million yuan\n"\
" Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\
"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\
"Options:\n"\
"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\
"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\
"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50< |im_end|>\n"\
"< |im_start|>assistant\n"'",
2026-01-26 11:57:40 +08:00
"max_completion_tokens": 1,
2025-08-07 14:15:49 +08:00
"temperature": 0,
"stop": ["< |im_end|>"]
}' | python3 -m json.tool
```
The output format matches the following:
2026-01-15 09:06:01 +08:00
```json
2025-08-07 14:15:49 +08:00
{
"id": "cmpl-2f678e8bdf5a4b209a3f2c1fa5832e25",
"object": "text_completion",
"created": 1754475138,
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"choices": [
{
"index": 0,
"text": "A",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 252,
"total_tokens": 253,
"completion_tokens": 1,
"prompt_tokens_details": null
},
"kv_transfer_params": null
}
```
2025-10-29 11:03:39 +08:00
Install lm-eval in the container:
2025-08-07 14:15:49 +08:00
```bash
export HF_ENDPOINT="https://hf-mirror.com"
2026-03-14 22:41:02 +08:00
export USE_MODELSCOPE_HUB=0
2025-08-07 14:15:49 +08:00
pip install lm-eval[api]
```
2026-03-14 22:41:02 +08:00
:::{note}
The Docker container is launched with `VLLM_USE_MODELSCOPE=True` , which may
cause lm-eval to download datasets from ModelScope instead of HuggingFace.
Setting `USE_MODELSCOPE_HUB=0` disables this behavior so that lm-eval can
fetch datasets from HuggingFace correctly.
:::
2025-08-07 14:15:49 +08:00
Run the following command:
2026-01-15 09:06:01 +08:00
```shell
2025-08-07 14:15:49 +08:00
# Only test gsm8k dataset in this demo
lm_eval \
--model local-completions \
--model_args model=Qwen/Qwen2.5-0.5B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \
--output_path ./
```
2025-10-29 11:03:39 +08:00
After 30 minutes, the output is as shown below:
2025-08-07 14:15:49 +08:00
2026-01-15 09:06:01 +08:00
```shell
2025-08-07 14:15:49 +08:00
The markdown format results is as below:
2025-11-04 18:58:33 +08:00
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
2025-08-07 14:15:49 +08:00
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3215|± |0.0129|
2025-11-04 18:58:33 +08:00
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.2077|± |0.0112|
2025-08-07 14:15:49 +08:00
```
## Offline Server
2026-01-15 09:06:01 +08:00
2025-08-07 14:15:49 +08:00
### 1. Run docker container
2025-04-01 23:43:51 +08:00
You can run docker container on a single NPU:
```{code-block} bash
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
2025-10-23 11:17:26 +08:00
--shm-size=1g \
2025-04-01 23:43:51 +08:00
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
```
2025-10-29 11:03:39 +08:00
### 2. Run GSM8K using lm-eval for accuracy testing
2026-01-15 09:06:01 +08:00
2025-10-29 11:03:39 +08:00
Install lm-eval in the container:
2025-04-01 23:43:51 +08:00
```bash
2025-08-07 14:15:49 +08:00
export HF_ENDPOINT="https://hf-mirror.com"
2026-03-14 22:41:02 +08:00
export USE_MODELSCOPE_HUB=0
2025-04-01 23:43:51 +08:00
pip install lm-eval
```
2025-07-25 22:16:10 +08:00
2026-03-14 22:41:02 +08:00
:::{note}
The Docker container is launched with `VLLM_USE_MODELSCOPE=True` , which may
cause lm-eval to download datasets from ModelScope instead of HuggingFace.
Setting `USE_MODELSCOPE_HUB=0` disables this behavior so that lm-eval can
fetch datasets from HuggingFace correctly.
:::
2025-04-01 23:43:51 +08:00
Run the following command:
2026-01-15 09:06:01 +08:00
```shell
2025-08-07 14:15:49 +08:00
# Only test gsm8k dataset in this demo
2025-04-01 23:43:51 +08:00
lm_eval \
--model vllm \
2025-08-07 14:15:49 +08:00
--model_args pretrained=Qwen/Qwen2.5-0.5B-Instruct,max_model_len=4096 \
--tasks gsm8k \
--batch_size auto
2025-04-01 23:43:51 +08:00
```
2025-10-29 11:03:39 +08:00
After 1 to 2 minutes, the output is shown below:
2025-04-01 23:43:51 +08:00
2026-01-15 09:06:01 +08:00
```shell
2025-04-01 23:43:51 +08:00
The markdown format results is as below:
2025-08-07 14:15:49 +08:00
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3412|± |0.0131|
2025-11-04 18:58:33 +08:00
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.3139|± |0.0128|
2025-08-07 14:15:49 +08:00
```
2025-10-29 11:03:39 +08:00
## Use Offline Datasets
2025-08-07 14:15:49 +08:00
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.
Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.
How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [using-local-datasets][2].
2025-08-07 14:15:49 +08:00
```bash
# set HF_DATASETS_OFFLINE when using offline datasets
export HF_DATASETS_OFFLINE=1
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
# gsm8k yaml path
cd lm_eval/tasks/gsm8k
# mmlu yaml path
cd lm_eval/tasks/mmlu/default
```
2025-10-29 11:03:39 +08:00
Set [gsm8k.yaml][3] as follows:
2025-08-07 14:15:49 +08:00
```yaml
tag:
- math_word_problems
task: gsm8k
# set dataset_path arrow or json or parquet according to the downloaded dataset
dataset_path: arrow
# set dataset_name to null
dataset_name: null
output_type: generate_until
# add dataset_kwargs
dataset_kwargs:
data_files:
# train and test data download path
train: /root/.cache/gsm8k/gsm8k-train.arrow
test: /root/.cache/gsm8k/gsm8k-test.arrow
2025-04-01 23:43:51 +08:00
2025-08-07 14:15:49 +08:00
training_split: train
fewshot_split: train
test_split: test
doc_to_text: 'Q: {{question}}
2025-10-29 11:03:39 +08:00
A(Please follow the summarized result at the end with the format of "The answer is xxx", where xx is the result.):'
2025-08-07 14:15:49 +08:00
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
regexes_to_ignore:
- ","
- "\\$"
- "(?s).*#### "
- "\\.$"
generation_kwargs:
until:
- "Question:"
- "</ s > "
- "< |im_end|>"
do_sample: false
temperature: 0.0
repeats: 1
num_fewshot: 5
filter_list:
- name: "strict-match"
filter:
- function: "regex"
regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
- function: "take_first"
- name: "flexible-extract"
filter:
- function: "regex"
group_select: -1
regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
- function: "take_first"
metadata:
version: 3.0
2025-04-01 23:43:51 +08:00
```
2025-10-29 11:03:39 +08:00
Set [_default_template_yaml][4] as follows:
2025-08-07 14:15:49 +08:00
```yaml
# set dataset_path according to the downloaded dataset
dataset_path: /root/.cache/mmlu
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
output_type: multiple_choice
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
```
You can see more usage on [Lm-eval Docs][5].
[1]: https://github.com/EleutherAI/lm-evaluation-harness
[2]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using -local-datasets
[3]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml
[4]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/default/_default_template_yaml
[5]: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/README.md