### What this PR does / why we need it? This PR fixes various documentation issues and improves code examples throughout the project. Signed-off-by: MrZ20 <2609716663@qq.com>
9.4 KiB
Using lm-eval
This document guides you to conduct accuracy testing using lm-eval.
Online Server
1. Start the vLLM server
You can run docker container to start the vLLM server on a single NPU:
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 4096 &
The vLLM server is started successfully, if you see logs as below:
INFO: Started server process [9446]
INFO: Waiting for application startup.
INFO: Application startup complete.
2. Run GSM8K using the vLLM server (curl) and then run lm-eval for accuracy testing
You can query the result with input prompts:
PROMPT='<|im_start|>system
You are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>
<|im_start|>user
Question: A company'"'"'s balance sheet as of December 31, 2023 shows:
Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan
Non-current assets: Net fixed assets 12 million yuan
Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan
Non-current liabilities: Long-term loans 9 million yuan
Owner'"'"'s equity: Paid-in capital 10 million yuan, Retained earnings ?
Requirement: Calculate the company'"'"'s Asset-Liability Ratio and Current Ratio (round to two decimal places).
Options:
A. Asset-Liability Ratio=58.33%, Current Ratio=1.90
B. Asset-Liability Ratio=62.50%, Current Ratio=2.17
C. Asset-Liability Ratio=65.22%, Current Ratio=1.75
D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>
<|im_start|>assistant
'
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d "$(jq -n \
--arg model "Qwen/Qwen2.5-0.5B-Instruct" \
--arg prompt "$PROMPT" \
'{
model: $model,
prompt: $prompt,
max_completion_tokens: 1,
temperature: 0,
stop: ["<|im_end|>"]
}')" | python3 -m json.tool
The output format matches the following:
{
"id": "cmpl-2f678e8bdf5a4b209a3f2c1fa5832e25",
"object": "text_completion",
"created": 1754475138,
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"choices": [
{
"index": 0,
"text": "A",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 252,
"total_tokens": 253,
"completion_tokens": 1,
"prompt_tokens_details": null
},
"kv_transfer_params": null
}
Install lm-eval in the container:
export HF_ENDPOINT="https://hf-mirror.com"
export USE_MODELSCOPE_HUB=0
pip install lm-eval[api]
:::{note}
The Docker container is launched with VLLM_USE_MODELSCOPE=True, which may
cause lm-eval to download datasets from ModelScope instead of HuggingFace.
Setting USE_MODELSCOPE_HUB=0 disables this behavior so that lm-eval can
fetch datasets from HuggingFace correctly.
:::
Run the following command:
# Only test gsm8k dataset in this demo
lm_eval \
--model local-completions \
--model_args model=Qwen/Qwen2.5-0.5B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \
--output_path ./
After 30 minutes, the output is as shown below:
The markdown format results is as below:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3215|± |0.0129|
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.2077|± |0.0112|
Offline Server
1. Run docker container
You can run docker container on a single NPU:
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
2. Run GSM8K using lm-eval for accuracy testing
Install lm-eval in the container:
export HF_ENDPOINT="https://hf-mirror.com"
export USE_MODELSCOPE_HUB=0
pip install lm-eval
:::{note}
The Docker container is launched with VLLM_USE_MODELSCOPE=True, which may
cause lm-eval to download datasets from ModelScope instead of HuggingFace.
Setting USE_MODELSCOPE_HUB=0 disables this behavior so that lm-eval can
fetch datasets from HuggingFace correctly.
:::
Run the following command:
# Only test gsm8k dataset in this demo
lm_eval \
--model vllm \
--model_args pretrained=Qwen/Qwen2.5-0.5B-Instruct,max_model_len=4096 \
--tasks gsm8k \
--batch_size auto
After 1 to 2 minutes, the output is shown below:
The markdown format results is as below:
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3412|± |0.0131|
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.3139|± |0.0128|
Use Offline Datasets
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from using-local-datasets.
# set HF_DATASETS_OFFLINE when using offline datasets
export HF_DATASETS_OFFLINE=1
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
# gsm8k yaml path
cd lm_eval/tasks/gsm8k
# mmlu yaml path
cd lm_eval/tasks/mmlu/default
Set gsm8k.yaml as follows:
tag:
- math_word_problems
task: gsm8k
# set dataset_path arrow or json or parquet according to the downloaded dataset
dataset_path: arrow
# set dataset_name to null
dataset_name: null
output_type: generate_until
# add dataset_kwargs
dataset_kwargs:
data_files:
# train and test data download path
train: /root/.cache/gsm8k/gsm8k-train.arrow
test: /root/.cache/gsm8k/gsm8k-test.arrow
training_split: train
fewshot_split: train
test_split: test
doc_to_text: 'Q: {{question}}
A(Please follow the summarized result at the end with the format of "The answer is xxx", where xx is the result.):'
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
regexes_to_ignore:
- ","
- "\\$"
- "(?s).*#### "
- "\\.$"
generation_kwargs:
until:
- "Question:"
- "</s>"
- "<|im_end|>"
do_sample: false
temperature: 0.0
repeats: 1
num_fewshot: 5
filter_list:
- name: "strict-match"
filter:
- function: "regex"
regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
- function: "take_first"
- name: "flexible-extract"
filter:
- function: "regex"
group_select: -1
regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
- function: "take_first"
metadata:
version: 3.0
Set _default_template_yaml as follows:
# set dataset_path according to the downloaded dataset
dataset_path: /root/.cache/mmlu
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
output_type: multiple_choice
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
You can see more usage on Lm-eval Docs.