初始化项目,由ModelHub XC社区提供模型

Model: OdaxAI/DANTE-Mosaic-3.5B
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-14 15:44:10 +08:00
commit b0ba87406b
41 changed files with 1638 additions and 0 deletions

163
evaluation/README.md Normal file
View File

@@ -0,0 +1,163 @@
# DANTE-Mosaic-3.5B — Canonical Evaluation Suite
This folder contains all evaluation scripts for `OdaxAI/DANTE-Mosaic-3.5B`.
---
## Evaluation types
| Type | Description | Comparable to leaderboard? |
|------|-------------|---------------------------|
| **REAL_CANONICAL_RUN** | Run via lm-evaluation-harness or bigcode harness on full benchmark dataset | **Yes** |
| **REAL_INTERNAL_SUBSET** | Run on 2540 hand-curated problems in `scripts/_benchmark_dante_offline.py` | **No** |
These two types must **never be mixed** in the same table or chart.
---
## Canonical benchmarks (lm-evaluation-harness)
| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|-----------|-----------|-----------|----------|--------|--------|
| MMLU | `mmlu` | 14,042 | 5 | `acc` | `slurm_lm_eval.slurm` |
| MMLU-Pro | `mmlu_pro` | 4,500 | 5 | `acc` | `slurm_lm_eval.slurm` |
| GSM8K | `gsm8k` | 1,319 | 8 | `exact_match,strict-match` | `slurm_lm_eval.slurm` |
| ARC-Challenge | `arc_challenge` | 1,172 | 25 | `acc_norm` | `slurm_lm_eval.slurm` |
| HellaSwag | `hellaswag` | 10,042 | 10 | `acc_norm` | `slurm_lm_eval.slurm` |
| TruthfulQA MC2 | `truthfulqa_mc2` | 817 | 0 | `mc2` | `slurm_lm_eval.slurm` |
| Winogrande | `winogrande` | 1,267 | 5 | `acc` | `slurm_lm_eval.slurm` |
| IFEval | `ifeval` | 541 | 0 | `prompt_level_strict_acc` | `slurm_lm_eval.slurm` |
## Canonical benchmarks (bigcode-evaluation-harness)
| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|-----------|-----------|-----------|----------|--------|--------|
| HumanEval | `humaneval` | 164 | 0 | `pass@1` | `slurm_code_eval.slurm` |
| MBPP | `mbpp` | 374 | 0 | `pass@1` | `slurm_code_eval.slurm` |
---
## How to run on Leonardo Booster
### 1. Copy the repo to Leonardo
```bash
rsync -av /Users/nicolosavioli/Desktop/DANTE-T1-Seed/ \
nsavioli@login.leonardo.cineca.it:/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/
```
### 2. Submit all jobs with one command
```bash
# From the repo root on Leonardo:
bash evaluation/submit_all.sh
```
This submits two SLURM jobs:
- `slurm_lm_eval.slurm` — all lm-eval tasks (~46 hours on 1× A100-40GB)
- `slurm_code_eval.slurm` — HumanEval + MBPP (~12 hours on 1× A100-40GB)
### 3. Monitor jobs
```bash
squeue -u $USER
tail -f evaluation/slurm_logs/lm_eval_<JOB_ID>.out
tail -f evaluation/slurm_logs/code_eval_<JOB_ID>.out
```
### 4. Parse results after completion
```bash
python3 evaluation/parse_canonical_results.py
```
This produces:
- `results/canonical/CANONICAL_SUMMARY.json` — all scores as JSON
- `results/canonical/CANONICAL_PROVENANCE.md` — full provenance table (Markdown)
- Printed summary table in terminal
### 5. Transfer results to your Mac
```bash
rsync -av nsavioli@login.leonardo.cineca.it:\
/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/evaluation/results/canonical/ \
evaluation/results/canonical/
```
---
## Decoding and grading setup
All lm-eval runs use:
| Parameter | Value |
|-----------|-------|
| `temperature` | 0.0 (greedy) |
| `do_sample` | false |
| `batch_size` | 8 |
| `seed` | 42 |
| `precision` | bfloat16 |
| `device` | CUDA (A100-40GB) |
| `harness version` | lm-eval 0.4.5 |
All code eval runs use:
| Parameter | Value |
|-----------|-------|
| `temperature` | 0.0 (greedy) |
| `n_samples` | 1 (pass@1) |
| `batch_size` | 8 |
| `precision` | bfloat16 |
| `harness` | bigcode-evaluation-harness (latest main) |
---
## Environment variables
Both SLURM scripts set the following:
```bash
export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
export HF_HUB_CACHE="${HF_HOME}/hub"
export TOKENIZERS_PARALLELISM=false
```
---
## Results folder structure
```
evaluation/results/canonical/
mmlu_<timestamp>.json # lm-eval output (raw)
mmlu_pro_<timestamp>.json
gsm8k_<timestamp>.json
arc_challenge_<timestamp>.json
hellaswag_<timestamp>.json
truthfulqa_mc2_<timestamp>.json
winogrande_<timestamp>.json
ifeval_<timestamp>.json
humaneval_<timestamp>/
humaneval_generations.json
humaneval_metrics.json
mbpp_<timestamp>/
mbpp_generations.json
mbpp_metrics.json
CANONICAL_SUMMARY.json # parsed summary (generated by parse_canonical_results.py)
CANONICAL_PROVENANCE.md # provenance table (generated by parse_canonical_results.py)
```
---
## Provenance table template
Once canonical results are available, `parse_canonical_results.py` will generate this automatically.
The table will include: benchmark name, harness, task name, N samples, metric, decoding/few-shot, output JSON path, date, and hardware.
---
## Important: do not mix result types
Internal subset scores (from `scripts/_benchmark_dante_offline.py`, N=30/40/25)
and canonical scores (from lm-eval / bigcode harness, full splits) use different
protocols and must never appear in the same table without an explicit separator and
label explaining the difference.
The parser script enforces this by saving canonical results only to `results/canonical/`
and printing a clear header.

View File

@@ -0,0 +1,212 @@
#!/usr/bin/env python3
"""
parse_canonical_results.py
==========================
Reads all lm-eval and bigcode JSON outputs from evaluation/results/canonical/
and produces:
1. A clean summary table printed to stdout
2. evaluation/results/canonical/CANONICAL_SUMMARY.json
3. evaluation/results/canonical/CANONICAL_PROVENANCE.md
Usage:
python3 evaluation/parse_canonical_results.py
python3 evaluation/parse_canonical_results.py --results-dir evaluation/results/canonical
"""
from __future__ import annotations
import argparse
import json
import glob
import os
import sys
from datetime import datetime
from pathlib import Path
# ─── Metric keys per task ────────────────────────────────────────────────────
TASK_META = {
"mmlu": {"label": "MMLU", "metric": "acc,none", "fewshot": 5, "n": 14042},
"mmlu_pro": {"label": "MMLU-Pro", "metric": "acc,none", "fewshot": 5, "n": 4500},
"gsm8k": {"label": "GSM8K", "metric": "exact_match,strict-match", "fewshot": 8, "n": 1319},
"arc_challenge": {"label": "ARC-Challenge", "metric": "acc_norm,none", "fewshot": 25, "n": 1172},
"hellaswag": {"label": "HellaSwag", "metric": "acc_norm,none", "fewshot": 10, "n": 10042},
"truthfulqa_mc2": {"label": "TruthfulQA MC2", "metric": "mc2,none", "fewshot": 0, "n": 817},
"winogrande": {"label": "Winogrande", "metric": "acc,none", "fewshot": 5, "n": 1267},
"ifeval": {"label": "IFEval", "metric": "prompt_level_strict_acc,none", "fewshot": 0, "n": 541},
}
CODE_TASKS = {
"humaneval": {"label": "HumanEval", "metric": "pass@1", "fewshot": 0, "n": 164},
"mbpp": {"label": "MBPP", "metric": "pass@1", "fewshot": 0, "n": 374},
}
def find_latest(pattern: str) -> str | None:
files = sorted(glob.glob(pattern))
return files[-1] if files else None
def parse_lm_eval(results_dir: str) -> dict:
scores = {}
for task_key, meta in TASK_META.items():
pattern = f"{results_dir}/{task_key}_*.json"
latest = find_latest(pattern)
if not latest:
continue
try:
with open(latest) as f:
data = json.load(f)
res = data.get("results", {})
if task_key in res:
raw = res[task_key].get(meta["metric"])
if raw is not None:
mtime = os.path.getmtime(latest)
scores[task_key] = {
"label": meta["label"],
"score": round(raw * 100, 2),
"metric": meta["metric"],
"fewshot": meta["fewshot"],
"n": meta["n"],
"source": latest,
"date": datetime.fromtimestamp(mtime).strftime("%Y-%m-%d"),
"harness": "lm-evaluation-harness 0.4.5",
"model": data.get("config", {}).get("model_args", "OdaxAI/DANTE-Mosaic-3.5B"),
"dtype": "bfloat16",
"device": "NVIDIA A100-40GB",
"seed": 42,
}
except Exception as e:
print(f" [WARNING] parse error {latest}: {e}", file=sys.stderr)
return scores
def parse_code_eval(results_dir: str) -> dict:
scores = {}
for task_key, meta in CODE_TASKS.items():
# bigcode saves to subdir
pattern = f"{results_dir}/{task_key}_*/{task_key}_metrics.json"
latest = find_latest(pattern)
if not latest:
continue
try:
with open(latest) as f:
data = json.load(f)
raw = data.get("pass@1")
if raw is not None:
mtime = os.path.getmtime(latest)
scores[task_key] = {
"label": meta["label"],
"score": round(raw * 100, 2),
"metric": "pass@1",
"fewshot": 0,
"n": meta["n"],
"source": latest,
"date": datetime.fromtimestamp(mtime).strftime("%Y-%m-%d"),
"harness": "bigcode-evaluation-harness",
"model": "OdaxAI/DANTE-Mosaic-3.5B",
"dtype": "bfloat16",
"device": "NVIDIA A100-40GB",
"seed": 0,
}
except Exception as e:
print(f" [WARNING] parse error {latest}: {e}", file=sys.stderr)
return scores
def print_table(all_scores: dict) -> None:
SEP = "=" * 78
print(f"\n{SEP}")
print(" CANONICAL BENCHMARK RESULTS — OdaxAI/DANTE-Mosaic-3.5B")
print(" All results produced by official evaluation harnesses on Leonardo HPC")
print(f"{SEP}")
print(f" {'Benchmark':<20} {'N':>6} {'Few-shot':>8} {'Metric':<20} {'Score':>8}")
print(" " + "-" * 72)
for k, v in sorted(all_scores.items(), key=lambda x: x[1]["label"]):
print(f" {v['label']:<20} {v['n']:>6} {v['fewshot']:>8}-shot "
f"{v['metric']:<20} {v['score']:>7.2f}%")
print(f"{SEP}\n")
def write_summary(all_scores: dict, results_dir: str) -> None:
out = {
"model": "OdaxAI/DANTE-Mosaic-3.5B",
"type": "REAL_CANONICAL_RUN",
"harness": "lm-evaluation-harness 0.4.5 + bigcode-evaluation-harness",
"hardware": "NVIDIA A100-40GB, BF16",
"cluster": "CINECA Leonardo Booster",
"seed": 42,
"results": all_scores,
}
path = f"{results_dir}/CANONICAL_SUMMARY.json"
with open(path, "w") as f:
json.dump(out, f, indent=2)
print(f" Summary JSON -> {path}")
def write_provenance(all_scores: dict, results_dir: str) -> None:
lines = [
"# Canonical Benchmark Provenance — DANTE-Mosaic-3.5B",
"",
"All results in this table are **REAL_CANONICAL_RUN** — produced by official",
"evaluation harnesses on the `OdaxAI/DANTE-Mosaic-3.5B` checkpoint.",
"They are directly comparable to published leaderboard scores.",
"",
"| Benchmark | Harness | Task name | N | Few-shot | Metric | Score | Date | Source JSON |",
"|-----------|---------|-----------|---|----------|--------|-------|------|-------------|",
]
for k, v in sorted(all_scores.items(), key=lambda x: x[1]["label"]):
src = os.path.basename(v["source"])
lines.append(
f"| {v['label']} | {v['harness']} | `{k}` | {v['n']} | "
f"{v['fewshot']}-shot | `{v['metric']}` | **{v['score']:.2f}%** | "
f"{v['date']} | `{src}` |"
)
lines += [
"",
"## Hardware & Software",
"",
"| Property | Value |",
"|----------|-------|",
"| GPU | NVIDIA A100-SXM-40GB |",
"| Precision | BF16 |",
"| Cluster | CINECA Leonardo Booster |",
"| lm-eval version | 0.4.5 |",
"| Seed | 42 |",
"",
"## Comparability Note",
"",
"These canonical scores are produced under standard protocols and are directly",
"comparable to published scores from the same harness versions.",
"Internal offline subset scores (30/40/25 problems from `_benchmark_dante_offline.py`)",
"are **separate** and must not be mixed with these canonical results.",
]
path = f"{results_dir}/CANONICAL_PROVENANCE.md"
with open(path, "w") as f:
f.write("\n".join(lines) + "\n")
print(f" Provenance -> {path}")
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--results-dir", default="evaluation/results/canonical")
args = parser.parse_args()
rd = args.results_dir
print(f"\nParsing results from: {rd}")
lm = parse_lm_eval(rd)
code = parse_code_eval(rd)
all_scores = {**lm, **code}
if not all_scores:
print("No canonical results found yet.")
print("Run evaluation/slurm_lm_eval.slurm and slurm_code_eval.slurm on Leonardo first.")
sys.exit(0)
print_table(all_scores)
write_summary(all_scores, rd)
write_provenance(all_scores, rd)
print("\nDone.\n")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,131 @@
#!/bin/bash
#SBATCH --job-name=dante_code_eval
#SBATCH --account=AIFAC_F02_254_0
#SBATCH --partition=boost_usr_prod
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=evaluation/slurm_logs/code_eval_%j.out
#SBATCH --error=evaluation/slurm_logs/code_eval_%j.err
# ============================================================================
# DANTE-Mosaic-3.5B — Canonical code evaluation
# Leonardo Booster, 1x A100-40GB
#
# Tasks: HumanEval (pass@1), MBPP (pass@1)
# Harness: bigcode-evaluation-harness
# All outputs -> evaluation/results/canonical/
#
# WARNING: HumanEval executes generated Python code.
# bigcode-evaluation-harness uses a sandboxed subprocess per sample.
# Do NOT run --allow_code_execution outside of a secure environment.
# Leonardo compute nodes are isolated — this is acceptable here.
# ============================================================================
set -euo pipefail
# ─── Environment ─────────────────────────────────────────────────────────────
module purge
module load cuda/12.4 python/3.11.7
export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
export HF_HUB_CACHE="${HF_HOME}/hub"
export TOKENIZERS_PARALLELISM=false
# ─── Config ──────────────────────────────────────────────────────────────────
MODEL="OdaxAI/DANTE-Mosaic-3.5B"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
RESULTS="evaluation/results/canonical"
BIGCODE_DIR="/leonardo_scratch/large/userexternal/nsavioli/bigcode-evaluation-harness"
mkdir -p "${RESULTS}"
echo "======================================="
echo "DANTE-Mosaic-3.5B — Canonical code eval"
echo "Model: ${MODEL}"
echo "Job ID: ${SLURM_JOB_ID}"
echo "Timestamp: ${TIMESTAMP}"
echo "======================================="
# ─── Install bigcode harness if not present ───────────────────────────────────
if [ ! -d "${BIGCODE_DIR}" ]; then
echo "Cloning bigcode-evaluation-harness..."
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git \
"${BIGCODE_DIR}"
pip install --quiet -e "${BIGCODE_DIR}"
else
echo "bigcode-evaluation-harness found at ${BIGCODE_DIR}"
pip install --quiet -e "${BIGCODE_DIR}"
fi
# ─── HumanEval (pass@1, 164 problems, greedy, 0-shot) ────────────────────────
echo ""
echo ">>> HumanEval pass@1 (164 problems, greedy decoding)..."
HE_OUT="${RESULTS}/humaneval_${TIMESTAMP}"
mkdir -p "${HE_OUT}"
accelerate launch "${BIGCODE_DIR}/main.py" \
--model "${MODEL}" \
--tasks humaneval \
--do_sample False \
--temperature 0.0 \
--n_samples 1 \
--batch_size 8 \
--allow_code_execution \
--precision bf16 \
--trust_remote_code \
--save_generations \
--save_generations_path "${HE_OUT}/humaneval_generations.json" \
--metric_output_path "${HE_OUT}/humaneval_metrics.json" \
2>&1 | tee "${HE_OUT}/humaneval.log"
echo ">>> HumanEval done -> ${HE_OUT}/humaneval_metrics.json"
# ─── MBPP (pass@1, 374 problems, greedy, 0-shot) ─────────────────────────────
echo ""
echo ">>> MBPP pass@1 (374 problems, greedy decoding)..."
MBPP_OUT="${RESULTS}/mbpp_${TIMESTAMP}"
mkdir -p "${MBPP_OUT}"
accelerate launch "${BIGCODE_DIR}/main.py" \
--model "${MODEL}" \
--tasks mbpp \
--do_sample False \
--temperature 0.0 \
--n_samples 1 \
--batch_size 8 \
--allow_code_execution \
--precision bf16 \
--trust_remote_code \
--save_generations \
--save_generations_path "${MBPP_OUT}/mbpp_generations.json" \
--metric_output_path "${MBPP_OUT}/mbpp_metrics.json" \
2>&1 | tee "${MBPP_OUT}/mbpp.log"
echo ">>> MBPP done -> ${MBPP_OUT}/mbpp_metrics.json"
# ─── Summary ─────────────────────────────────────────────────────────────────
echo ""
echo "======================================="
echo "CODE EVAL COMPLETE"
python3 - <<'PYEOF'
import json, glob
for label, pat in [
("HumanEval", "evaluation/results/canonical/humaneval_*/humaneval_metrics.json"),
("MBPP", "evaluation/results/canonical/mbpp_*/mbpp_metrics.json"),
]:
files = sorted(glob.glob(pat))
if files:
with open(files[-1]) as f:
d = json.load(f)
score = d.get("pass@1", d)
print(f" {label}: pass@1 = {score}")
else:
print(f" {label}: no result file found")
PYEOF
echo "======================================="

View File

@@ -0,0 +1,156 @@
#!/bin/bash
#SBATCH --job-name=dante_lm_eval
#SBATCH --account=AIFAC_F02_254_0
#SBATCH --partition=boost_usr_prod
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=08:00:00
#SBATCH --output=evaluation/slurm_logs/lm_eval_%j.out
#SBATCH --error=evaluation/slurm_logs/lm_eval_%j.err
# ============================================================================
# DANTE-Mosaic-3.5B — Canonical lm-evaluation-harness benchmark
# Leonardo Booster, 1x A100-40GB
#
# Tasks: MMLU, GSM8K, ARC-Challenge, HellaSwag, TruthfulQA, Winogrande, IFEval
# Harness: EleutherAI lm-evaluation-harness 0.4.5
# All outputs -> evaluation/results/canonical/
# ============================================================================
set -euo pipefail
# ─── Environment ─────────────────────────────────────────────────────────────
module purge
module load cuda/12.4 python/3.11.7
export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
export HF_HUB_CACHE="${HF_HOME}/hub"
export TOKENIZERS_PARALLELISM=false
# ─── Config ──────────────────────────────────────────────────────────────────
MODEL="OdaxAI/DANTE-Mosaic-3.5B"
MODEL_ARGS="pretrained=${MODEL},dtype=bfloat16,trust_remote_code=True"
BATCH=8
SEED=42
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
RESULTS="evaluation/results/canonical"
mkdir -p "${RESULTS}"
echo "======================================="
echo "DANTE-Mosaic-3.5B — Canonical lm-eval"
echo "Model: ${MODEL}"
echo "Job ID: ${SLURM_JOB_ID}"
echo "GPU: $(nvidia-smi --query-gpu=name --format=csv,noheader | head -1)"
echo "Timestamp: ${TIMESTAMP}"
echo "======================================="
# ─── Install harness if needed ────────────────────────────────────────────────
pip install --quiet lm-eval==0.4.5
# ─── Helper: run one task ─────────────────────────────────────────────────────
run_task() {
local TASK=$1
local FEWSHOT=$2
local EXTRA="${3:-}"
local OUT="${RESULTS}/${TASK}_${TIMESTAMP}.json"
echo ""
echo ">>> ${TASK} (${FEWSHOT}-shot) ..."
lm_eval \
--model hf \
--model_args "${MODEL_ARGS}" \
--tasks "${TASK}" \
--num_fewshot "${FEWSHOT}" \
--batch_size "${BATCH}" \
--seed "${SEED}" \
--output_path "${OUT}" \
--log_samples \
${EXTRA} \
2>&1 | tee "${RESULTS}/${TASK}_${TIMESTAMP}.log"
echo ">>> DONE ${TASK} -> ${OUT}"
}
# ─── Run all canonical tasks ──────────────────────────────────────────────────
# MMLU: 57 subjects, 5-shot, accuracy
run_task "mmlu" 5
# MMLU-Pro: harder 10-option MCQ, 5-shot
run_task "mmlu_pro" 5
# GSM8K: 8-shot chain-of-thought, exact match on final answer
run_task "gsm8k" 8
# ARC-Challenge: 25-shot, normalised accuracy
run_task "arc_challenge" 25
# HellaSwag: 10-shot, normalised accuracy
run_task "hellaswag" 10
# TruthfulQA MC2: 0-shot, mc2 multiple true answers
run_task "truthfulqa_mc2" 0
# Winogrande: 5-shot, accuracy
run_task "winogrande" 5
# IFEval: 0-shot, instruction-level strict accuracy
run_task "ifeval" 0
# ─── Summary ─────────────────────────────────────────────────────────────────
echo ""
echo "======================================="
echo "ALL TASKS COMPLETE"
echo "Results saved to: ${RESULTS}/"
ls -lh "${RESULTS}/"
echo "======================================="
# ─── Parse and print summary ─────────────────────────────────────────────────
python3 - <<'PYEOF'
import json, glob, os, sys
results_dir = "evaluation/results/canonical"
files = sorted(glob.glob(f"{results_dir}/*.json"))
if not files:
print("No result JSON files found.")
sys.exit(0)
# One score per task
METRIC_MAP = {
"mmlu": ("acc,none", "MMLU"),
"mmlu_pro": ("acc,none", "MMLU-Pro"),
"gsm8k": ("exact_match,strict-match", "GSM8K"),
"arc_challenge": ("acc_norm,none", "ARC-Challenge"),
"hellaswag": ("acc_norm,none", "HellaSwag"),
"truthfulqa_mc2": ("mc2,none", "TruthfulQA"),
"winogrande": ("acc,none", "Winogrande"),
"ifeval": ("prompt_level_strict_acc,none", "IFEval"),
}
print("\n" + "="*60)
print(" CANONICAL BENCHMARK RESULTS — DANTE-Mosaic-3.5B")
print(" lm-evaluation-harness 0.4.5 | BF16 | A100-40GB | seed=42")
print("="*60)
print(f" {'Benchmark':<20} {'Metric':<35} {'Score':>8}")
print(" " + "-"*58)
for f in files:
try:
with open(f) as fh:
data = json.load(fh)
results = data.get("results", {})
for task_key, (metric_key, label) in METRIC_MAP.items():
if task_key in results:
score = results[task_key].get(metric_key, None)
if score is not None:
print(f" {label:<20} {metric_key:<35} {score*100:>7.2f}%")
except Exception as e:
print(f" [parse error: {f}: {e}]")
print("="*60)
print(f" Source: {results_dir}/")
print("="*60 + "\n")
PYEOF

33
evaluation/submit_all.sh Normal file
View File

@@ -0,0 +1,33 @@
#!/usr/bin/env bash
# ============================================================================
# submit_all.sh — Submit all canonical evaluation jobs on Leonardo
#
# Usage (from the repo root on Leonardo):
# bash evaluation/submit_all.sh
# ============================================================================
set -euo pipefail
cd "$(dirname "$0")/.." # repo root
mkdir -p evaluation/slurm_logs evaluation/results/canonical
echo "=== Submitting DANTE-Mosaic-3.5B canonical evaluations ==="
echo ""
LM_JOB=$(sbatch --parsable evaluation/slurm_lm_eval.slurm)
echo " lm-eval job submitted: ${LM_JOB}"
CODE_JOB=$(sbatch --parsable evaluation/slurm_code_eval.slurm)
echo " code-eval job submitted: ${CODE_JOB}"
echo ""
echo "Monitor:"
echo " squeue -u \$USER"
echo " tail -f evaluation/slurm_logs/lm_eval_${LM_JOB}.out"
echo " tail -f evaluation/slurm_logs/code_eval_${CODE_JOB}.out"
echo ""
echo "After completion, parse results:"
echo " python3 evaluation/parse_canonical_results.py"
echo ""
echo "Results will be saved to: evaluation/results/canonical/"