What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
9.4 KiB
Using lm-eval
This document guides you to conduct accuracy testing using lm-eval.
Online Server
1. Start the vLLM server
You can run docker container to start the vLLM server on a single NPU:
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
The vLLM server is started successfully, if you see logs as below:
INFO: Started server process [9446]
INFO: Waiting for application startup.
INFO: Application startup complete.
2. Run GSM8K using lm-eval for accuracy testing
You can query the result with input prompts:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "'"<|im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>\n"\
"<|im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\
" Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\
" Non-current assets: Net fixed assets 12 million yuan\n"\
" Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\
" Non-current liabilities: Long-term loans 9 million yuan\n"\
" Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\
"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\
"Options:\n"\
"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\
"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\
"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
"<|im_start|>assistant\n"'",
"max_completion_tokens": 1,
"temperature": 0,
"stop": ["<|im_end|>"]
}' | python3 -m json.tool
The output format matches the following:
{
"id": "cmpl-2f678e8bdf5a4b209a3f2c1fa5832e25",
"object": "text_completion",
"created": 1754475138,
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"choices": [
{
"index": 0,
"text": "A",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 252,
"total_tokens": 253,
"completion_tokens": 1,
"prompt_tokens_details": null
},
"kv_transfer_params": null
}
Install lm-eval in the container:
export HF_ENDPOINT="https://hf-mirror.com"
export USE_MODELSCOPE_HUB=0
pip install lm-eval[api]
:::{note}
The Docker container is launched with VLLM_USE_MODELSCOPE=True, which may
cause lm-eval to download datasets from ModelScope instead of HuggingFace.
Setting USE_MODELSCOPE_HUB=0 disables this behavior so that lm-eval can
fetch datasets from HuggingFace correctly.
:::
Run the following command:
# Only test gsm8k dataset in this demo
lm_eval \
--model local-completions \
--model_args model=Qwen/Qwen2.5-0.5B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \
--output_path ./
After 30 minutes, the output is as shown below:
The markdown format results is as below:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3215|± |0.0129|
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.2077|± |0.0112|
Offline Server
1. Run docker container
You can run docker container on a single NPU:
:substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash
2. Run GSM8K using lm-eval for accuracy testing
Install lm-eval in the container:
export HF_ENDPOINT="https://hf-mirror.com"
export USE_MODELSCOPE_HUB=0
pip install lm-eval
:::{note}
The Docker container is launched with VLLM_USE_MODELSCOPE=True, which may
cause lm-eval to download datasets from ModelScope instead of HuggingFace.
Setting USE_MODELSCOPE_HUB=0 disables this behavior so that lm-eval can
fetch datasets from HuggingFace correctly.
:::
Run the following command:
# Only test gsm8k dataset in this demo
lm_eval \
--model vllm \
--model_args pretrained=Qwen/Qwen2.5-0.5B-Instruct,max_model_len=4096 \
--tasks gsm8k \
--batch_size auto
After 1 to 2 minutes, the output is shown below:
The markdown format results is as below:
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3412|± |0.0131|
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.3139|± |0.0128|
Use Offline Datasets
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from using-local-datasets.
# set HF_DATASETS_OFFLINE when using offline datasets
export HF_DATASETS_OFFLINE=1
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
# gsm8k yaml path
cd lm_eval/tasks/gsm8k
# mmlu yaml path
cd lm_eval/tasks/mmlu/default
Set gsm8k.yaml as follows:
tag:
- math_word_problems
task: gsm8k
# set dataset_path arrow or json or parquet according to the downloaded dataset
dataset_path: arrow
# set dataset_name to null
dataset_name: null
output_type: generate_until
# add dataset_kwargs
dataset_kwargs:
data_files:
# train and test data download path
train: /root/.cache/gsm8k/gsm8k-train.arrow
test: /root/.cache/gsm8k/gsm8k-test.arrow
training_split: train
fewshot_split: train
test_split: test
doc_to_text: 'Q: {{question}}
A(Please follow the summarized result at the end with the format of "The answer is xxx", where xx is the result.):'
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
regexes_to_ignore:
- ","
- "\\$"
- "(?s).*#### "
- "\\.$"
generation_kwargs:
until:
- "Question:"
- "</s>"
- "<|im_end|>"
do_sample: false
temperature: 0.0
repeats: 1
num_fewshot: 5
filter_list:
- name: "strict-match"
filter:
- function: "regex"
regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
- function: "take_first"
- name: "flexible-extract"
filter:
- function: "regex"
group_select: -1
regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
- function: "take_first"
metadata:
version: 3.0
Set _default_template_yaml as follows:
# set dataset_path according to the downloaded dataset
dataset_path: /root/.cache/mmlu
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
output_type: multiple_choice
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
You can see more usage on Lm-eval Docs.