Files

Xinyu Dong 7be26ca617 [Bugs] Fix Docs Build Problem (#97 )

* [Bugs] Docs fixed

* Update contributing.md

* Update index.md

* fix lua to text

* fix title size

2026-01-10 05:55:40 +08:00

7.8 KiB

Raw Permalink Blame History

Overall accuracy test

EvalScope

1.Download and install

EvalScope supports use in Python environments. Users can install EvalScope via pip or from source code. Here are examples of both installation methods:

#pip
pip install evalscope[perf] -U
#git
git clone https://github.com/modelscope/evalscope.git
cd evalscope
pip install -e '.[perf]'

2.Dataset preparation script

from evalscope.collections import CollectionSchema, DatasetInfo, WeightedSampler
from evalscope.utils.io_utils import dump_jsonl_data
import os  # Step 1: Import the os module

schema = CollectionSchema(
    name="VL-Test",
    datasets=[
        CollectionSchema(
            name="PureText",
            weight=1,
            datasets=[
                DatasetInfo(
                    name="mmlu_pro",
                    weight=1,
                    task_type="exam",
                    tags=["en"],
                    args={"few_shot_num": 0},
                ),
                DatasetInfo(
                    name="ifeval",
                    weight=1,
                    task_type="instruction",
                    tags=["en"],
                    args={"few_shot_num": 0},
                ),
                DatasetInfo(
                    name="gsm8k",
                    weight=1,
                    task_type="math",
                    tags=["en"],
                    args={"few_shot_num": 0},
                ),
            ],
        ),
        CollectionSchema(
            name="Vision",
            weight=2,
            datasets=[
                DatasetInfo(
                    name="math_vista",
                    weight=1,
                    task_type="math",
                    tags=["en"],
                    args={"few_shot_num": 0},
                ),
                DatasetInfo(
                    name="mmmu_pro",
                    weight=1,
                    task_type="exam",
                    tags=["en"],
                    args={"few_shot_num": 0},
                ),
            ],
        ),
    ],
)


# get the mixed data
mixed_data = WeightedSampler(schema).sample(1000)

output_path = "outputs/vl_test.jsonl"  # Step 2: Define the output file path
output_dir = os.path.dirname(output_path)  # Step 3: Obtain the directory name
if not os.path.exists(output_dir):  # Step 4: Check if the directory exists
    os.makedirs(output_dir, exist_ok=True)  # Step 5: Automatically create directories


# dump the mixed data to a jsonl file
dump_jsonl_data(mixed_data, output_path)  # Step 6: Securely write to the file

Dataset composition visualization:

┌───────────────────────────────────────┐
│       VL-Test (1000 samples)          │
├─────────────────┬─────────────────────┤
│   PureText      │      Vision         │
│   (333 samples) │    (667 samples)    │
├─────────────────┼─────────────────────┤
│ • mmlu_pro      │ • math_vista        │
│ • ifeval        │ • mmmu_pro          │
│ • gsm8k         │                     │
└─────────────────┴─────────────────────┘

3.Test

from dotenv import dotenv_values

from evalscope import TaskConfig, run_task
from evalscope.constants import EvalType

task_cfg = TaskConfig(
    model="Qwen2.5-VL-7B-Instruct",
    api_url="http://localhost:8804/v1",
    api_key="EMPTY",
    eval_type=EvalType.SERVICE,
    datasets=[
        "data_collection",
    ],
    dataset_args={
        "data_collection": {
            "local_path": "../outputs/vl_test.jsonl",
        }
    },
    eval_batch_size=5,
    generation_config={
        "max_tokens": 30000,  # The maximum number of tokens that can be generated should be set to a large value to avoid output truncation.
        "temperature": 0.6,  # Sampling temperature (recommended value from qwen report)
        "top_p": 0.95,  # top-p sampling (recommended value from qwen report)
        "top_k": 20,  # Top-k sampling (recommended value from qwen report)
        "n": 1,  # Number of responses generated per request
        "repetition_penalty": 1.0,  # 1.0 = Penalty disabled, >1.0 = Penalty repeated.
    },
)

run_task(task_cfg=task_cfg)

Parameter Tuning Guide:

Parameter	Current value	Effect	Adjustment suggestions
`temperature`	0.6	Control output diversity	Math problems ↓ 0.3 / Creative writing ↑ 0.9
`top_p`	0.95	Filtering low-probability tokens	Reduce "nonsense"
`eval_batch_size`	5	Number of requests processed in parallel	With sufficient video memory, it can be increased to 10.

Run the test:

#!/bin/bash
# ========================================
# Step 1: Set the log file path
# ========================================
LOG_FILE="accuracy_$(date +%Y%m%d_%H%M).log"

# ========================================
# Step 2: Execute the Python script and capture all output
# Meaning of 2>&1:
# - 2 represents standard error output (stderr)
# ->& represents redirection and merging
# - 1 represents standard output (stdout)
# Function: Merges error messages into standard output as well.
# ========================================
python accuracy.py 2>&1 | tee "$LOG_FILE"

# ========================================
# Step 3: Check Execution Status
# ${PIPESTATUS[0]} Get the exit code of the first command (Python) in the pipeline
# ========================================
EXIT_CODE=${PIPESTATUS[0]}
if [ $EXIT_CODE -eq 0 ]; then
    echo "✅ Evaluation completed! Log saved to: $LOG_FILE"
else
    echo "❌ Evaluation failed! Exit code: $EXIT_CODE Please check the log: $LOG_FILE"
fi

4.Common problem fixes

4.1 NLTK resource missing fix

Resource punkt_tab not found.

Solution：

import nltk
import os

# Step 1: Set the download path (select a writable directory)
download_dir = "/workspace/myenv/nltk_data"
os.makedirs(download_dir, exist_ok=True)

# Step 2: Configure NLTK data path
nltk.data.path.append(download_dir)

# Step 3: Download necessary resources
print("🔽 Start downloading punkt_tab resource...")
try:
    nltk.download("punkt_tab", download_dir=download_dir)
    print("✅ Download successful!")
except Exception as e:
    print(f"❌ Download failed: {e}")
    print("💡 Alternative: Download manually from GitHub")
    print(
        "   URL: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt_tab.zip"
    )

repair:

# Activate environment
source /workspace/myenv/bin/activate

# Run the repair script
python fix_nltk.py

# Rerun the test
bash run_accuracy_test.sh

5.Results Display

+-------------+---------------------+--------------+---------------+-------+
|  task_type  |       metric        | dataset_name | average_score | count |
+-------------+---------------------+--------------+---------------+-------+
|    exam     |         acc         |   mmmu_pro   |     0.521     |  334  |
|    math     |         acc         |  math_vista  |    0.6066     |  333  |
|    exam     |         acc         |   mmlu_pro   |    0.5405     |  111  |
| instruction | prompt_level_strict |    ifeval    |    0.6937     |  111  |
|    math     |         acc         |    gsm8k     |    0.8288     |  111  |
+-------------+---------------------+--------------+---------------+-------+

7.8 KiB Raw Permalink Blame History Unescape Escape