Files
xc-llm-kunlun/docs/source/developer_guide/evaluation/accuracy/accuracy_server.md
2025-12-10 17:51:24 +08:00

7.6 KiB
Raw Blame History

Overall accuracy test

EvalScope

1.Download and install

EvalScope supports use in Python environments. Users can install EvalScope via pip or from source code. Here are examples of both installation methods:

#pip
pip install evalscope[perf] -U
#git
git clone https://github.com/modelscope/evalscope.git
cd evalscope
pip install -e '.[perf]'

2.Dataset preparation script

from evalscope.collections import CollectionSchema, DatasetInfo, WeightedSampler
from evalscope.utils.io_utils import dump_jsonl_data
import os  # Step 1: Import the os module

schema = CollectionSchema(
    name="VL-Test",
    datasets=[
        CollectionSchema(
            name="PureText",
            weight=1,
            datasets=[
                DatasetInfo(
                    name="mmlu_pro",
                    weight=1,
                    task_type="exam",
                    tags=["en"],
                    args={"few_shot_num": 0},
                ),
                DatasetInfo(
                    name="ifeval",
                    weight=1,
                    task_type="instruction",
                    tags=["en"],
                    args={"few_shot_num": 0},
                ),
                DatasetInfo(
                    name="gsm8k",
                    weight=1,
                    task_type="math",
                    tags=["en"],
                    args={"few_shot_num": 0},
                ),
            ],
        ),
        CollectionSchema(
            name="Vision",
            weight=2,
            datasets=[
                DatasetInfo(
                    name="math_vista",
                    weight=1,
                    task_type="math",
                    tags=["en"],
                    args={"few_shot_num": 0},
                ),
                DatasetInfo(
                    name="mmmu_pro",
                    weight=1,
                    task_type="exam",
                    tags=["en"],
                    args={"few_shot_num": 0},
                ),
            ],
        ),
    ],
)


# get the mixed data
mixed_data = WeightedSampler(schema).sample(1000)

output_path = "outputs/vl_test.jsonl"  # Step 2: Define the output file path
output_dir = os.path.dirname(output_path)  # Step 3: Obtain the directory name
if not os.path.exists(output_dir):  # Step 4: Check if the directory exists
    os.makedirs(output_dir, exist_ok=True)  # Step 5: Automatically create directories


# dump the mixed data to a jsonl file
dump_jsonl_data(mixed_data, output_path)  # Step 6: Securely write to the file

Dataset composition visualization:

┌───────────────────────────────────────┐
│       VL-Test (1000 samples)          │
├─────────────────┬─────────────────────┤
│   PureText      │      Vision         │
│   (333 样本)    │    (667 样本)        │
├─────────────────┼─────────────────────┤
│ • mmlu_pro      │ • math_vista        │
│ • ifeval        │ • mmmu_pro          │
│ • gsm8k         │                     │
└─────────────────┴─────────────────────┘

3.Test

from dotenv import dotenv_values

from evalscope import TaskConfig, run_task
from evalscope.constants import EvalType

task_cfg = TaskConfig(
    model="Qwen2.5-VL-7B-Instruct",
    api_url="http://localhost:8804/v1",
    api_key="EMPTY",
    eval_type=EvalType.SERVICE,
    datasets=[
        "data_collection",
    ],
    dataset_args={
        "data_collection": {
            "local_path": "../outputs/vl_test.jsonl",
        }
    },
    eval_batch_size=5,
    generation_config={
        "max_tokens": 30000,  # The maximum number of tokens that can be generated should be set to a large value to avoid output truncation.
        "temperature": 0.6,  # Sampling temperature (recommended value from qwen report)
        "top_p": 0.95,  # top-p sampling (recommended value from qwen report)
        "top_k": 20,  # Top-k sampling (recommended value from qwen report)
        "n": 1,  # Number of responses generated per request
        "repetition_penalty": 1.0,  # 1.0 = Penalty disabled, >1.0 = Penalty repeated.
    },
)

run_task(task_cfg=task_cfg)

Parameter Tuning Guide:

Parameter Current value Effect Adjustment suggestions
temperature 0.6 Control output diversity Math problems ↓ 0.3 / Creative writing ↑ 0.9
top_p 0.95 Filtering low-probability tokens Reduce "nonsense"
eval_batch_size 5 Number of requests processed in parallel With sufficient video memory, it can be increased to 10.

Run the test:

#!/bin/bash
# ========================================
# Step 1: Set the log file path
# ========================================
LOG_FILE="accuracy_$(date +%Y%m%d_%H%M).log"

# ========================================
# Step 2: Execute the Python script and capture all output
# Meaning of 2>&1:
# - 2 represents standard error output (stderr)
# ->& represents redirection and merging
# - 1 represents standard output (stdout)
# Function: Merges error messages into standard output as well.
# ========================================
python accuracy.py 2>&1 | tee "$LOG_FILE"

# ========================================
# Step 3: Check Execution Status
# ${PIPESTATUS[0]} Get the exit code of the first command (Python) in the pipeline
# ========================================
EXIT_CODE=${PIPESTATUS[0]}
if [ $EXIT_CODE -eq 0 ]; then
    echo "✅ 评测完成! 日志已保存到: $LOG_FILE"
else
    echo "❌ 评测失败! 退出码: $EXIT_CODE 请查看日志: $LOG_FILE"
fi

4.Common problem fixes

4.1 NLTK resource missing fix
Resource punkt_tab not found.

Solution

import nltk
import os

# Step 1: Set the download path (select a writable directory)
download_dir = "/workspace/myenv/nltk_data"
os.makedirs(download_dir, exist_ok=True)

# Step 2: Configure NLTK data path
nltk.data.path.append(download_dir)

# Step 3: Download necessary resources
print("🔽 开始下载punkt_tab资源...")
try:
    nltk.download("punkt_tab", download_dir=download_dir)
    print("✅ 下载成功!")
except Exception as e:
    print(f"❌ 下载失败: {e}")
    print("💡 备选方案:手动从GitHub下载")
    print(
        "   URL: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt_tab.zip"
    )

repair:

# Activate environment
source /workspace/myenv/bin/activate

# Run the repair script
python fix_nltk.py

# Rerun the test
bash run_accuracy_test.sh

5.Results Display

+-------------+---------------------+--------------+---------------+-------+
|  task_type  |       metric        | dataset_name | average_score | count |
+-------------+---------------------+--------------+---------------+-------+
|    exam     |         acc         |   mmmu_pro   |     0.521     |  334  |
|    math     |         acc         |  math_vista  |    0.6066     |  333  |
|    exam     |         acc         |   mmlu_pro   |    0.5405     |  111  |
| instruction | prompt_level_strict |    ifeval    |    0.6937     |  111  |
|    math     |         acc         |    gsm8k     |    0.8288     |  111  |
+-------------+---------------------+--------------+---------------+-------+