初始化项目,由ModelHub XC社区提供模型
Model: OdaxAI/DANTE-Mosaic-3.5B Source: Original Platform
This commit is contained in:
163
evaluation/README.md
Normal file
163
evaluation/README.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# DANTE-Mosaic-3.5B — Canonical Evaluation Suite
|
||||
|
||||
This folder contains all evaluation scripts for `OdaxAI/DANTE-Mosaic-3.5B`.
|
||||
|
||||
---
|
||||
|
||||
## Evaluation types
|
||||
|
||||
| Type | Description | Comparable to leaderboard? |
|
||||
|------|-------------|---------------------------|
|
||||
| **REAL_CANONICAL_RUN** | Run via lm-evaluation-harness or bigcode harness on full benchmark dataset | **Yes** |
|
||||
| **REAL_INTERNAL_SUBSET** | Run on 25–40 hand-curated problems in `scripts/_benchmark_dante_offline.py` | **No** |
|
||||
|
||||
These two types must **never be mixed** in the same table or chart.
|
||||
|
||||
---
|
||||
|
||||
## Canonical benchmarks (lm-evaluation-harness)
|
||||
|
||||
| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|
||||
|-----------|-----------|-----------|----------|--------|--------|
|
||||
| MMLU | `mmlu` | 14,042 | 5 | `acc` | `slurm_lm_eval.slurm` |
|
||||
| MMLU-Pro | `mmlu_pro` | 4,500 | 5 | `acc` | `slurm_lm_eval.slurm` |
|
||||
| GSM8K | `gsm8k` | 1,319 | 8 | `exact_match,strict-match` | `slurm_lm_eval.slurm` |
|
||||
| ARC-Challenge | `arc_challenge` | 1,172 | 25 | `acc_norm` | `slurm_lm_eval.slurm` |
|
||||
| HellaSwag | `hellaswag` | 10,042 | 10 | `acc_norm` | `slurm_lm_eval.slurm` |
|
||||
| TruthfulQA MC2 | `truthfulqa_mc2` | 817 | 0 | `mc2` | `slurm_lm_eval.slurm` |
|
||||
| Winogrande | `winogrande` | 1,267 | 5 | `acc` | `slurm_lm_eval.slurm` |
|
||||
| IFEval | `ifeval` | 541 | 0 | `prompt_level_strict_acc` | `slurm_lm_eval.slurm` |
|
||||
|
||||
## Canonical benchmarks (bigcode-evaluation-harness)
|
||||
|
||||
| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|
||||
|-----------|-----------|-----------|----------|--------|--------|
|
||||
| HumanEval | `humaneval` | 164 | 0 | `pass@1` | `slurm_code_eval.slurm` |
|
||||
| MBPP | `mbpp` | 374 | 0 | `pass@1` | `slurm_code_eval.slurm` |
|
||||
|
||||
---
|
||||
|
||||
## How to run on Leonardo Booster
|
||||
|
||||
### 1. Copy the repo to Leonardo
|
||||
```bash
|
||||
rsync -av /Users/nicolosavioli/Desktop/DANTE-T1-Seed/ \
|
||||
nsavioli@login.leonardo.cineca.it:/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/
|
||||
```
|
||||
|
||||
### 2. Submit all jobs with one command
|
||||
```bash
|
||||
# From the repo root on Leonardo:
|
||||
bash evaluation/submit_all.sh
|
||||
```
|
||||
|
||||
This submits two SLURM jobs:
|
||||
- `slurm_lm_eval.slurm` — all lm-eval tasks (~4–6 hours on 1× A100-40GB)
|
||||
- `slurm_code_eval.slurm` — HumanEval + MBPP (~1–2 hours on 1× A100-40GB)
|
||||
|
||||
### 3. Monitor jobs
|
||||
```bash
|
||||
squeue -u $USER
|
||||
tail -f evaluation/slurm_logs/lm_eval_<JOB_ID>.out
|
||||
tail -f evaluation/slurm_logs/code_eval_<JOB_ID>.out
|
||||
```
|
||||
|
||||
### 4. Parse results after completion
|
||||
```bash
|
||||
python3 evaluation/parse_canonical_results.py
|
||||
```
|
||||
|
||||
This produces:
|
||||
- `results/canonical/CANONICAL_SUMMARY.json` — all scores as JSON
|
||||
- `results/canonical/CANONICAL_PROVENANCE.md` — full provenance table (Markdown)
|
||||
- Printed summary table in terminal
|
||||
|
||||
### 5. Transfer results to your Mac
|
||||
```bash
|
||||
rsync -av nsavioli@login.leonardo.cineca.it:\
|
||||
/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/evaluation/results/canonical/ \
|
||||
evaluation/results/canonical/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Decoding and grading setup
|
||||
|
||||
All lm-eval runs use:
|
||||
|
||||
| Parameter | Value |
|
||||
|-----------|-------|
|
||||
| `temperature` | 0.0 (greedy) |
|
||||
| `do_sample` | false |
|
||||
| `batch_size` | 8 |
|
||||
| `seed` | 42 |
|
||||
| `precision` | bfloat16 |
|
||||
| `device` | CUDA (A100-40GB) |
|
||||
| `harness version` | lm-eval 0.4.5 |
|
||||
|
||||
All code eval runs use:
|
||||
|
||||
| Parameter | Value |
|
||||
|-----------|-------|
|
||||
| `temperature` | 0.0 (greedy) |
|
||||
| `n_samples` | 1 (pass@1) |
|
||||
| `batch_size` | 8 |
|
||||
| `precision` | bfloat16 |
|
||||
| `harness` | bigcode-evaluation-harness (latest main) |
|
||||
|
||||
---
|
||||
|
||||
## Environment variables
|
||||
|
||||
Both SLURM scripts set the following:
|
||||
|
||||
```bash
|
||||
export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
|
||||
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
|
||||
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
|
||||
export HF_HUB_CACHE="${HF_HOME}/hub"
|
||||
export TOKENIZERS_PARALLELISM=false
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Results folder structure
|
||||
|
||||
```
|
||||
evaluation/results/canonical/
|
||||
mmlu_<timestamp>.json # lm-eval output (raw)
|
||||
mmlu_pro_<timestamp>.json
|
||||
gsm8k_<timestamp>.json
|
||||
arc_challenge_<timestamp>.json
|
||||
hellaswag_<timestamp>.json
|
||||
truthfulqa_mc2_<timestamp>.json
|
||||
winogrande_<timestamp>.json
|
||||
ifeval_<timestamp>.json
|
||||
humaneval_<timestamp>/
|
||||
humaneval_generations.json
|
||||
humaneval_metrics.json
|
||||
mbpp_<timestamp>/
|
||||
mbpp_generations.json
|
||||
mbpp_metrics.json
|
||||
CANONICAL_SUMMARY.json # parsed summary (generated by parse_canonical_results.py)
|
||||
CANONICAL_PROVENANCE.md # provenance table (generated by parse_canonical_results.py)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Provenance table template
|
||||
|
||||
Once canonical results are available, `parse_canonical_results.py` will generate this automatically.
|
||||
The table will include: benchmark name, harness, task name, N samples, metric, decoding/few-shot, output JSON path, date, and hardware.
|
||||
|
||||
---
|
||||
|
||||
## Important: do not mix result types
|
||||
|
||||
Internal subset scores (from `scripts/_benchmark_dante_offline.py`, N=30/40/25)
|
||||
and canonical scores (from lm-eval / bigcode harness, full splits) use different
|
||||
protocols and must never appear in the same table without an explicit separator and
|
||||
label explaining the difference.
|
||||
|
||||
The parser script enforces this by saving canonical results only to `results/canonical/`
|
||||
and printing a clear header.
|
||||
212
evaluation/parse_canonical_results.py
Normal file
212
evaluation/parse_canonical_results.py
Normal file
@@ -0,0 +1,212 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
parse_canonical_results.py
|
||||
==========================
|
||||
Reads all lm-eval and bigcode JSON outputs from evaluation/results/canonical/
|
||||
and produces:
|
||||
1. A clean summary table printed to stdout
|
||||
2. evaluation/results/canonical/CANONICAL_SUMMARY.json
|
||||
3. evaluation/results/canonical/CANONICAL_PROVENANCE.md
|
||||
|
||||
Usage:
|
||||
python3 evaluation/parse_canonical_results.py
|
||||
python3 evaluation/parse_canonical_results.py --results-dir evaluation/results/canonical
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
import argparse
|
||||
import json
|
||||
import glob
|
||||
import os
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
# ─── Metric keys per task ────────────────────────────────────────────────────
|
||||
TASK_META = {
|
||||
"mmlu": {"label": "MMLU", "metric": "acc,none", "fewshot": 5, "n": 14042},
|
||||
"mmlu_pro": {"label": "MMLU-Pro", "metric": "acc,none", "fewshot": 5, "n": 4500},
|
||||
"gsm8k": {"label": "GSM8K", "metric": "exact_match,strict-match", "fewshot": 8, "n": 1319},
|
||||
"arc_challenge": {"label": "ARC-Challenge", "metric": "acc_norm,none", "fewshot": 25, "n": 1172},
|
||||
"hellaswag": {"label": "HellaSwag", "metric": "acc_norm,none", "fewshot": 10, "n": 10042},
|
||||
"truthfulqa_mc2": {"label": "TruthfulQA MC2", "metric": "mc2,none", "fewshot": 0, "n": 817},
|
||||
"winogrande": {"label": "Winogrande", "metric": "acc,none", "fewshot": 5, "n": 1267},
|
||||
"ifeval": {"label": "IFEval", "metric": "prompt_level_strict_acc,none", "fewshot": 0, "n": 541},
|
||||
}
|
||||
|
||||
CODE_TASKS = {
|
||||
"humaneval": {"label": "HumanEval", "metric": "pass@1", "fewshot": 0, "n": 164},
|
||||
"mbpp": {"label": "MBPP", "metric": "pass@1", "fewshot": 0, "n": 374},
|
||||
}
|
||||
|
||||
|
||||
def find_latest(pattern: str) -> str | None:
|
||||
files = sorted(glob.glob(pattern))
|
||||
return files[-1] if files else None
|
||||
|
||||
|
||||
def parse_lm_eval(results_dir: str) -> dict:
|
||||
scores = {}
|
||||
for task_key, meta in TASK_META.items():
|
||||
pattern = f"{results_dir}/{task_key}_*.json"
|
||||
latest = find_latest(pattern)
|
||||
if not latest:
|
||||
continue
|
||||
try:
|
||||
with open(latest) as f:
|
||||
data = json.load(f)
|
||||
res = data.get("results", {})
|
||||
if task_key in res:
|
||||
raw = res[task_key].get(meta["metric"])
|
||||
if raw is not None:
|
||||
mtime = os.path.getmtime(latest)
|
||||
scores[task_key] = {
|
||||
"label": meta["label"],
|
||||
"score": round(raw * 100, 2),
|
||||
"metric": meta["metric"],
|
||||
"fewshot": meta["fewshot"],
|
||||
"n": meta["n"],
|
||||
"source": latest,
|
||||
"date": datetime.fromtimestamp(mtime).strftime("%Y-%m-%d"),
|
||||
"harness": "lm-evaluation-harness 0.4.5",
|
||||
"model": data.get("config", {}).get("model_args", "OdaxAI/DANTE-Mosaic-3.5B"),
|
||||
"dtype": "bfloat16",
|
||||
"device": "NVIDIA A100-40GB",
|
||||
"seed": 42,
|
||||
}
|
||||
except Exception as e:
|
||||
print(f" [WARNING] parse error {latest}: {e}", file=sys.stderr)
|
||||
return scores
|
||||
|
||||
|
||||
def parse_code_eval(results_dir: str) -> dict:
|
||||
scores = {}
|
||||
for task_key, meta in CODE_TASKS.items():
|
||||
# bigcode saves to subdir
|
||||
pattern = f"{results_dir}/{task_key}_*/{task_key}_metrics.json"
|
||||
latest = find_latest(pattern)
|
||||
if not latest:
|
||||
continue
|
||||
try:
|
||||
with open(latest) as f:
|
||||
data = json.load(f)
|
||||
raw = data.get("pass@1")
|
||||
if raw is not None:
|
||||
mtime = os.path.getmtime(latest)
|
||||
scores[task_key] = {
|
||||
"label": meta["label"],
|
||||
"score": round(raw * 100, 2),
|
||||
"metric": "pass@1",
|
||||
"fewshot": 0,
|
||||
"n": meta["n"],
|
||||
"source": latest,
|
||||
"date": datetime.fromtimestamp(mtime).strftime("%Y-%m-%d"),
|
||||
"harness": "bigcode-evaluation-harness",
|
||||
"model": "OdaxAI/DANTE-Mosaic-3.5B",
|
||||
"dtype": "bfloat16",
|
||||
"device": "NVIDIA A100-40GB",
|
||||
"seed": 0,
|
||||
}
|
||||
except Exception as e:
|
||||
print(f" [WARNING] parse error {latest}: {e}", file=sys.stderr)
|
||||
return scores
|
||||
|
||||
|
||||
def print_table(all_scores: dict) -> None:
|
||||
SEP = "=" * 78
|
||||
print(f"\n{SEP}")
|
||||
print(" CANONICAL BENCHMARK RESULTS — OdaxAI/DANTE-Mosaic-3.5B")
|
||||
print(" All results produced by official evaluation harnesses on Leonardo HPC")
|
||||
print(f"{SEP}")
|
||||
print(f" {'Benchmark':<20} {'N':>6} {'Few-shot':>8} {'Metric':<20} {'Score':>8}")
|
||||
print(" " + "-" * 72)
|
||||
for k, v in sorted(all_scores.items(), key=lambda x: x[1]["label"]):
|
||||
print(f" {v['label']:<20} {v['n']:>6} {v['fewshot']:>8}-shot "
|
||||
f"{v['metric']:<20} {v['score']:>7.2f}%")
|
||||
print(f"{SEP}\n")
|
||||
|
||||
|
||||
def write_summary(all_scores: dict, results_dir: str) -> None:
|
||||
out = {
|
||||
"model": "OdaxAI/DANTE-Mosaic-3.5B",
|
||||
"type": "REAL_CANONICAL_RUN",
|
||||
"harness": "lm-evaluation-harness 0.4.5 + bigcode-evaluation-harness",
|
||||
"hardware": "NVIDIA A100-40GB, BF16",
|
||||
"cluster": "CINECA Leonardo Booster",
|
||||
"seed": 42,
|
||||
"results": all_scores,
|
||||
}
|
||||
path = f"{results_dir}/CANONICAL_SUMMARY.json"
|
||||
with open(path, "w") as f:
|
||||
json.dump(out, f, indent=2)
|
||||
print(f" Summary JSON -> {path}")
|
||||
|
||||
|
||||
def write_provenance(all_scores: dict, results_dir: str) -> None:
|
||||
lines = [
|
||||
"# Canonical Benchmark Provenance — DANTE-Mosaic-3.5B",
|
||||
"",
|
||||
"All results in this table are **REAL_CANONICAL_RUN** — produced by official",
|
||||
"evaluation harnesses on the `OdaxAI/DANTE-Mosaic-3.5B` checkpoint.",
|
||||
"They are directly comparable to published leaderboard scores.",
|
||||
"",
|
||||
"| Benchmark | Harness | Task name | N | Few-shot | Metric | Score | Date | Source JSON |",
|
||||
"|-----------|---------|-----------|---|----------|--------|-------|------|-------------|",
|
||||
]
|
||||
for k, v in sorted(all_scores.items(), key=lambda x: x[1]["label"]):
|
||||
src = os.path.basename(v["source"])
|
||||
lines.append(
|
||||
f"| {v['label']} | {v['harness']} | `{k}` | {v['n']} | "
|
||||
f"{v['fewshot']}-shot | `{v['metric']}` | **{v['score']:.2f}%** | "
|
||||
f"{v['date']} | `{src}` |"
|
||||
)
|
||||
lines += [
|
||||
"",
|
||||
"## Hardware & Software",
|
||||
"",
|
||||
"| Property | Value |",
|
||||
"|----------|-------|",
|
||||
"| GPU | NVIDIA A100-SXM-40GB |",
|
||||
"| Precision | BF16 |",
|
||||
"| Cluster | CINECA Leonardo Booster |",
|
||||
"| lm-eval version | 0.4.5 |",
|
||||
"| Seed | 42 |",
|
||||
"",
|
||||
"## Comparability Note",
|
||||
"",
|
||||
"These canonical scores are produced under standard protocols and are directly",
|
||||
"comparable to published scores from the same harness versions.",
|
||||
"Internal offline subset scores (30/40/25 problems from `_benchmark_dante_offline.py`)",
|
||||
"are **separate** and must not be mixed with these canonical results.",
|
||||
]
|
||||
path = f"{results_dir}/CANONICAL_PROVENANCE.md"
|
||||
with open(path, "w") as f:
|
||||
f.write("\n".join(lines) + "\n")
|
||||
print(f" Provenance -> {path}")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--results-dir", default="evaluation/results/canonical")
|
||||
args = parser.parse_args()
|
||||
|
||||
rd = args.results_dir
|
||||
print(f"\nParsing results from: {rd}")
|
||||
|
||||
lm = parse_lm_eval(rd)
|
||||
code = parse_code_eval(rd)
|
||||
all_scores = {**lm, **code}
|
||||
|
||||
if not all_scores:
|
||||
print("No canonical results found yet.")
|
||||
print("Run evaluation/slurm_lm_eval.slurm and slurm_code_eval.slurm on Leonardo first.")
|
||||
sys.exit(0)
|
||||
|
||||
print_table(all_scores)
|
||||
write_summary(all_scores, rd)
|
||||
write_provenance(all_scores, rd)
|
||||
print("\nDone.\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
131
evaluation/slurm_code_eval.slurm
Normal file
131
evaluation/slurm_code_eval.slurm
Normal file
@@ -0,0 +1,131 @@
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=dante_code_eval
|
||||
#SBATCH --account=AIFAC_F02_254_0
|
||||
#SBATCH --partition=boost_usr_prod
|
||||
#SBATCH --nodes=1
|
||||
#SBATCH --ntasks-per-node=1
|
||||
#SBATCH --gpus-per-node=1
|
||||
#SBATCH --cpus-per-task=8
|
||||
#SBATCH --mem=64G
|
||||
#SBATCH --time=04:00:00
|
||||
#SBATCH --output=evaluation/slurm_logs/code_eval_%j.out
|
||||
#SBATCH --error=evaluation/slurm_logs/code_eval_%j.err
|
||||
|
||||
# ============================================================================
|
||||
# DANTE-Mosaic-3.5B — Canonical code evaluation
|
||||
# Leonardo Booster, 1x A100-40GB
|
||||
#
|
||||
# Tasks: HumanEval (pass@1), MBPP (pass@1)
|
||||
# Harness: bigcode-evaluation-harness
|
||||
# All outputs -> evaluation/results/canonical/
|
||||
#
|
||||
# WARNING: HumanEval executes generated Python code.
|
||||
# bigcode-evaluation-harness uses a sandboxed subprocess per sample.
|
||||
# Do NOT run --allow_code_execution outside of a secure environment.
|
||||
# Leonardo compute nodes are isolated — this is acceptable here.
|
||||
# ============================================================================
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# ─── Environment ─────────────────────────────────────────────────────────────
|
||||
module purge
|
||||
module load cuda/12.4 python/3.11.7
|
||||
|
||||
export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
|
||||
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
|
||||
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
|
||||
export HF_HUB_CACHE="${HF_HOME}/hub"
|
||||
export TOKENIZERS_PARALLELISM=false
|
||||
|
||||
# ─── Config ──────────────────────────────────────────────────────────────────
|
||||
MODEL="OdaxAI/DANTE-Mosaic-3.5B"
|
||||
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
||||
RESULTS="evaluation/results/canonical"
|
||||
BIGCODE_DIR="/leonardo_scratch/large/userexternal/nsavioli/bigcode-evaluation-harness"
|
||||
mkdir -p "${RESULTS}"
|
||||
|
||||
echo "======================================="
|
||||
echo "DANTE-Mosaic-3.5B — Canonical code eval"
|
||||
echo "Model: ${MODEL}"
|
||||
echo "Job ID: ${SLURM_JOB_ID}"
|
||||
echo "Timestamp: ${TIMESTAMP}"
|
||||
echo "======================================="
|
||||
|
||||
# ─── Install bigcode harness if not present ───────────────────────────────────
|
||||
if [ ! -d "${BIGCODE_DIR}" ]; then
|
||||
echo "Cloning bigcode-evaluation-harness..."
|
||||
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git \
|
||||
"${BIGCODE_DIR}"
|
||||
pip install --quiet -e "${BIGCODE_DIR}"
|
||||
else
|
||||
echo "bigcode-evaluation-harness found at ${BIGCODE_DIR}"
|
||||
pip install --quiet -e "${BIGCODE_DIR}"
|
||||
fi
|
||||
|
||||
# ─── HumanEval (pass@1, 164 problems, greedy, 0-shot) ────────────────────────
|
||||
echo ""
|
||||
echo ">>> HumanEval pass@1 (164 problems, greedy decoding)..."
|
||||
HE_OUT="${RESULTS}/humaneval_${TIMESTAMP}"
|
||||
mkdir -p "${HE_OUT}"
|
||||
|
||||
accelerate launch "${BIGCODE_DIR}/main.py" \
|
||||
--model "${MODEL}" \
|
||||
--tasks humaneval \
|
||||
--do_sample False \
|
||||
--temperature 0.0 \
|
||||
--n_samples 1 \
|
||||
--batch_size 8 \
|
||||
--allow_code_execution \
|
||||
--precision bf16 \
|
||||
--trust_remote_code \
|
||||
--save_generations \
|
||||
--save_generations_path "${HE_OUT}/humaneval_generations.json" \
|
||||
--metric_output_path "${HE_OUT}/humaneval_metrics.json" \
|
||||
2>&1 | tee "${HE_OUT}/humaneval.log"
|
||||
|
||||
echo ">>> HumanEval done -> ${HE_OUT}/humaneval_metrics.json"
|
||||
|
||||
# ─── MBPP (pass@1, 374 problems, greedy, 0-shot) ─────────────────────────────
|
||||
echo ""
|
||||
echo ">>> MBPP pass@1 (374 problems, greedy decoding)..."
|
||||
MBPP_OUT="${RESULTS}/mbpp_${TIMESTAMP}"
|
||||
mkdir -p "${MBPP_OUT}"
|
||||
|
||||
accelerate launch "${BIGCODE_DIR}/main.py" \
|
||||
--model "${MODEL}" \
|
||||
--tasks mbpp \
|
||||
--do_sample False \
|
||||
--temperature 0.0 \
|
||||
--n_samples 1 \
|
||||
--batch_size 8 \
|
||||
--allow_code_execution \
|
||||
--precision bf16 \
|
||||
--trust_remote_code \
|
||||
--save_generations \
|
||||
--save_generations_path "${MBPP_OUT}/mbpp_generations.json" \
|
||||
--metric_output_path "${MBPP_OUT}/mbpp_metrics.json" \
|
||||
2>&1 | tee "${MBPP_OUT}/mbpp.log"
|
||||
|
||||
echo ">>> MBPP done -> ${MBPP_OUT}/mbpp_metrics.json"
|
||||
|
||||
# ─── Summary ─────────────────────────────────────────────────────────────────
|
||||
echo ""
|
||||
echo "======================================="
|
||||
echo "CODE EVAL COMPLETE"
|
||||
python3 - <<'PYEOF'
|
||||
import json, glob
|
||||
|
||||
for label, pat in [
|
||||
("HumanEval", "evaluation/results/canonical/humaneval_*/humaneval_metrics.json"),
|
||||
("MBPP", "evaluation/results/canonical/mbpp_*/mbpp_metrics.json"),
|
||||
]:
|
||||
files = sorted(glob.glob(pat))
|
||||
if files:
|
||||
with open(files[-1]) as f:
|
||||
d = json.load(f)
|
||||
score = d.get("pass@1", d)
|
||||
print(f" {label}: pass@1 = {score}")
|
||||
else:
|
||||
print(f" {label}: no result file found")
|
||||
PYEOF
|
||||
echo "======================================="
|
||||
156
evaluation/slurm_lm_eval.slurm
Normal file
156
evaluation/slurm_lm_eval.slurm
Normal file
@@ -0,0 +1,156 @@
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=dante_lm_eval
|
||||
#SBATCH --account=AIFAC_F02_254_0
|
||||
#SBATCH --partition=boost_usr_prod
|
||||
#SBATCH --nodes=1
|
||||
#SBATCH --ntasks-per-node=1
|
||||
#SBATCH --gpus-per-node=1
|
||||
#SBATCH --cpus-per-task=8
|
||||
#SBATCH --mem=64G
|
||||
#SBATCH --time=08:00:00
|
||||
#SBATCH --output=evaluation/slurm_logs/lm_eval_%j.out
|
||||
#SBATCH --error=evaluation/slurm_logs/lm_eval_%j.err
|
||||
|
||||
# ============================================================================
|
||||
# DANTE-Mosaic-3.5B — Canonical lm-evaluation-harness benchmark
|
||||
# Leonardo Booster, 1x A100-40GB
|
||||
#
|
||||
# Tasks: MMLU, GSM8K, ARC-Challenge, HellaSwag, TruthfulQA, Winogrande, IFEval
|
||||
# Harness: EleutherAI lm-evaluation-harness 0.4.5
|
||||
# All outputs -> evaluation/results/canonical/
|
||||
# ============================================================================
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# ─── Environment ─────────────────────────────────────────────────────────────
|
||||
module purge
|
||||
module load cuda/12.4 python/3.11.7
|
||||
|
||||
export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
|
||||
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
|
||||
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
|
||||
export HF_HUB_CACHE="${HF_HOME}/hub"
|
||||
export TOKENIZERS_PARALLELISM=false
|
||||
|
||||
# ─── Config ──────────────────────────────────────────────────────────────────
|
||||
MODEL="OdaxAI/DANTE-Mosaic-3.5B"
|
||||
MODEL_ARGS="pretrained=${MODEL},dtype=bfloat16,trust_remote_code=True"
|
||||
BATCH=8
|
||||
SEED=42
|
||||
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
||||
RESULTS="evaluation/results/canonical"
|
||||
mkdir -p "${RESULTS}"
|
||||
|
||||
echo "======================================="
|
||||
echo "DANTE-Mosaic-3.5B — Canonical lm-eval"
|
||||
echo "Model: ${MODEL}"
|
||||
echo "Job ID: ${SLURM_JOB_ID}"
|
||||
echo "GPU: $(nvidia-smi --query-gpu=name --format=csv,noheader | head -1)"
|
||||
echo "Timestamp: ${TIMESTAMP}"
|
||||
echo "======================================="
|
||||
|
||||
# ─── Install harness if needed ────────────────────────────────────────────────
|
||||
pip install --quiet lm-eval==0.4.5
|
||||
|
||||
# ─── Helper: run one task ─────────────────────────────────────────────────────
|
||||
run_task() {
|
||||
local TASK=$1
|
||||
local FEWSHOT=$2
|
||||
local EXTRA="${3:-}"
|
||||
local OUT="${RESULTS}/${TASK}_${TIMESTAMP}.json"
|
||||
|
||||
echo ""
|
||||
echo ">>> ${TASK} (${FEWSHOT}-shot) ..."
|
||||
lm_eval \
|
||||
--model hf \
|
||||
--model_args "${MODEL_ARGS}" \
|
||||
--tasks "${TASK}" \
|
||||
--num_fewshot "${FEWSHOT}" \
|
||||
--batch_size "${BATCH}" \
|
||||
--seed "${SEED}" \
|
||||
--output_path "${OUT}" \
|
||||
--log_samples \
|
||||
${EXTRA} \
|
||||
2>&1 | tee "${RESULTS}/${TASK}_${TIMESTAMP}.log"
|
||||
echo ">>> DONE ${TASK} -> ${OUT}"
|
||||
}
|
||||
|
||||
# ─── Run all canonical tasks ──────────────────────────────────────────────────
|
||||
# MMLU: 57 subjects, 5-shot, accuracy
|
||||
run_task "mmlu" 5
|
||||
|
||||
# MMLU-Pro: harder 10-option MCQ, 5-shot
|
||||
run_task "mmlu_pro" 5
|
||||
|
||||
# GSM8K: 8-shot chain-of-thought, exact match on final answer
|
||||
run_task "gsm8k" 8
|
||||
|
||||
# ARC-Challenge: 25-shot, normalised accuracy
|
||||
run_task "arc_challenge" 25
|
||||
|
||||
# HellaSwag: 10-shot, normalised accuracy
|
||||
run_task "hellaswag" 10
|
||||
|
||||
# TruthfulQA MC2: 0-shot, mc2 multiple true answers
|
||||
run_task "truthfulqa_mc2" 0
|
||||
|
||||
# Winogrande: 5-shot, accuracy
|
||||
run_task "winogrande" 5
|
||||
|
||||
# IFEval: 0-shot, instruction-level strict accuracy
|
||||
run_task "ifeval" 0
|
||||
|
||||
# ─── Summary ─────────────────────────────────────────────────────────────────
|
||||
echo ""
|
||||
echo "======================================="
|
||||
echo "ALL TASKS COMPLETE"
|
||||
echo "Results saved to: ${RESULTS}/"
|
||||
ls -lh "${RESULTS}/"
|
||||
echo "======================================="
|
||||
|
||||
# ─── Parse and print summary ─────────────────────────────────────────────────
|
||||
python3 - <<'PYEOF'
|
||||
import json, glob, os, sys
|
||||
|
||||
results_dir = "evaluation/results/canonical"
|
||||
files = sorted(glob.glob(f"{results_dir}/*.json"))
|
||||
if not files:
|
||||
print("No result JSON files found.")
|
||||
sys.exit(0)
|
||||
|
||||
# One score per task
|
||||
METRIC_MAP = {
|
||||
"mmlu": ("acc,none", "MMLU"),
|
||||
"mmlu_pro": ("acc,none", "MMLU-Pro"),
|
||||
"gsm8k": ("exact_match,strict-match", "GSM8K"),
|
||||
"arc_challenge": ("acc_norm,none", "ARC-Challenge"),
|
||||
"hellaswag": ("acc_norm,none", "HellaSwag"),
|
||||
"truthfulqa_mc2": ("mc2,none", "TruthfulQA"),
|
||||
"winogrande": ("acc,none", "Winogrande"),
|
||||
"ifeval": ("prompt_level_strict_acc,none", "IFEval"),
|
||||
}
|
||||
|
||||
print("\n" + "="*60)
|
||||
print(" CANONICAL BENCHMARK RESULTS — DANTE-Mosaic-3.5B")
|
||||
print(" lm-evaluation-harness 0.4.5 | BF16 | A100-40GB | seed=42")
|
||||
print("="*60)
|
||||
print(f" {'Benchmark':<20} {'Metric':<35} {'Score':>8}")
|
||||
print(" " + "-"*58)
|
||||
|
||||
for f in files:
|
||||
try:
|
||||
with open(f) as fh:
|
||||
data = json.load(fh)
|
||||
results = data.get("results", {})
|
||||
for task_key, (metric_key, label) in METRIC_MAP.items():
|
||||
if task_key in results:
|
||||
score = results[task_key].get(metric_key, None)
|
||||
if score is not None:
|
||||
print(f" {label:<20} {metric_key:<35} {score*100:>7.2f}%")
|
||||
except Exception as e:
|
||||
print(f" [parse error: {f}: {e}]")
|
||||
|
||||
print("="*60)
|
||||
print(f" Source: {results_dir}/")
|
||||
print("="*60 + "\n")
|
||||
PYEOF
|
||||
33
evaluation/submit_all.sh
Normal file
33
evaluation/submit_all.sh
Normal file
@@ -0,0 +1,33 @@
|
||||
#!/usr/bin/env bash
|
||||
# ============================================================================
|
||||
# submit_all.sh — Submit all canonical evaluation jobs on Leonardo
|
||||
#
|
||||
# Usage (from the repo root on Leonardo):
|
||||
# bash evaluation/submit_all.sh
|
||||
# ============================================================================
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
cd "$(dirname "$0")/.." # repo root
|
||||
|
||||
mkdir -p evaluation/slurm_logs evaluation/results/canonical
|
||||
|
||||
echo "=== Submitting DANTE-Mosaic-3.5B canonical evaluations ==="
|
||||
echo ""
|
||||
|
||||
LM_JOB=$(sbatch --parsable evaluation/slurm_lm_eval.slurm)
|
||||
echo " lm-eval job submitted: ${LM_JOB}"
|
||||
|
||||
CODE_JOB=$(sbatch --parsable evaluation/slurm_code_eval.slurm)
|
||||
echo " code-eval job submitted: ${CODE_JOB}"
|
||||
|
||||
echo ""
|
||||
echo "Monitor:"
|
||||
echo " squeue -u \$USER"
|
||||
echo " tail -f evaluation/slurm_logs/lm_eval_${LM_JOB}.out"
|
||||
echo " tail -f evaluation/slurm_logs/code_eval_${CODE_JOB}.out"
|
||||
echo ""
|
||||
echo "After completion, parse results:"
|
||||
echo " python3 evaluation/parse_canonical_results.py"
|
||||
echo ""
|
||||
echo "Results will be saved to: evaluation/results/canonical/"
|
||||
Reference in New Issue
Block a user