初始化项目，由ModelHub XC社区提供模型

Model: OdaxAI/DANTE-Mosaic-3.5B Source: Original Platform
2026-05-14 15:44:10 +08:00
commit b0ba87406b
41 changed files with 1638 additions and 0 deletions
--- a/evaluation/README.md
+++ b/evaluation/README.md
@@ -0,0 +1,163 @@
+# DANTE-Mosaic-3.5B — Canonical Evaluation Suite
+
+This folder contains all evaluation scripts for `OdaxAI/DANTE-Mosaic-3.5B`.
+
+---
+
+## Evaluation types
+
+| Type | Description | Comparable to leaderboard? |
+|------|-------------|---------------------------|
+| **REAL_CANONICAL_RUN** | Run via lm-evaluation-harness or bigcode harness on full benchmark dataset | **Yes** |
+| **REAL_INTERNAL_SUBSET** | Run on 25–40 hand-curated problems in `scripts/_benchmark_dante_offline.py` | **No** |
+
+These two types must **never be mixed** in the same table or chart.
+
+---
+
+## Canonical benchmarks (lm-evaluation-harness)
+
+| Benchmark | Task name | N samples | Few-shot | Metric | Script |
+|-----------|-----------|-----------|----------|--------|--------|
+| MMLU | `mmlu` | 14,042 | 5 | `acc` | `slurm_lm_eval.slurm` |
+| MMLU-Pro | `mmlu_pro` | 4,500 | 5 | `acc` | `slurm_lm_eval.slurm` |
+| GSM8K | `gsm8k` | 1,319 | 8 | `exact_match,strict-match` | `slurm_lm_eval.slurm` |
+| ARC-Challenge | `arc_challenge` | 1,172 | 25 | `acc_norm` | `slurm_lm_eval.slurm` |
+| HellaSwag | `hellaswag` | 10,042 | 10 | `acc_norm` | `slurm_lm_eval.slurm` |
+| TruthfulQA MC2 | `truthfulqa_mc2` | 817 | 0 | `mc2` | `slurm_lm_eval.slurm` |
+| Winogrande | `winogrande` | 1,267 | 5 | `acc` | `slurm_lm_eval.slurm` |
+| IFEval | `ifeval` | 541 | 0 | `prompt_level_strict_acc` | `slurm_lm_eval.slurm` |
+
+## Canonical benchmarks (bigcode-evaluation-harness)
+
+| Benchmark | Task name | N samples | Few-shot | Metric | Script |
+|-----------|-----------|-----------|----------|--------|--------|
+| HumanEval | `humaneval` | 164 | 0 | `pass@1` | `slurm_code_eval.slurm` |
+| MBPP | `mbpp` | 374 | 0 | `pass@1` | `slurm_code_eval.slurm` |
+
+---
+
+## How to run on Leonardo Booster
+
+### 1. Copy the repo to Leonardo
+```bash
+rsync -av /Users/nicolosavioli/Desktop/DANTE-T1-Seed/ \
+  nsavioli@login.leonardo.cineca.it:/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/
+```
+
+### 2. Submit all jobs with one command
+```bash
+# From the repo root on Leonardo:
+bash evaluation/submit_all.sh
+```
+
+This submits two SLURM jobs:
+- `slurm_lm_eval.slurm` — all lm-eval tasks (~4–6 hours on 1× A100-40GB)
+- `slurm_code_eval.slurm` — HumanEval + MBPP (~1–2 hours on 1× A100-40GB)
+
+### 3. Monitor jobs
+```bash
+squeue -u $USER
+tail -f evaluation/slurm_logs/lm_eval_<JOB_ID>.out
+tail -f evaluation/slurm_logs/code_eval_<JOB_ID>.out
+```
+
+### 4. Parse results after completion
+```bash
+python3 evaluation/parse_canonical_results.py
+```
+
+This produces:
+- `results/canonical/CANONICAL_SUMMARY.json` — all scores as JSON
+- `results/canonical/CANONICAL_PROVENANCE.md` — full provenance table (Markdown)
+- Printed summary table in terminal
+
+### 5. Transfer results to your Mac
+```bash
+rsync -av nsavioli@login.leonardo.cineca.it:\
+/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/evaluation/results/canonical/ \
+  evaluation/results/canonical/
+```
+
+---
+
+## Decoding and grading setup
+
+All lm-eval runs use:
+
+| Parameter | Value |
+|-----------|-------|
+| `temperature` | 0.0 (greedy) |
+| `do_sample` | false |
+| `batch_size` | 8 |
+| `seed` | 42 |
+| `precision` | bfloat16 |
+| `device` | CUDA (A100-40GB) |
+| `harness version` | lm-eval 0.4.5 |
+
+All code eval runs use:
+
+| Parameter | Value |
+|-----------|-------|
+| `temperature` | 0.0 (greedy) |
+| `n_samples` | 1 (pass@1) |
+| `batch_size` | 8 |
+| `precision` | bfloat16 |
+| `harness` | bigcode-evaluation-harness (latest main) |
+
+---
+
+## Environment variables
+
+Both SLURM scripts set the following:
+
+```bash
+export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
+export HF_DATASETS_CACHE="${HF_HOME}/datasets"
+export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
+export HF_HUB_CACHE="${HF_HOME}/hub"
+export TOKENIZERS_PARALLELISM=false
+```
+
+---
+
+## Results folder structure
+
+```
+evaluation/results/canonical/
+  mmlu_<timestamp>.json            # lm-eval output (raw)
+  mmlu_pro_<timestamp>.json
+  gsm8k_<timestamp>.json
+  arc_challenge_<timestamp>.json
+  hellaswag_<timestamp>.json
+  truthfulqa_mc2_<timestamp>.json
+  winogrande_<timestamp>.json
+  ifeval_<timestamp>.json
+  humaneval_<timestamp>/
+    humaneval_generations.json
+    humaneval_metrics.json
+  mbpp_<timestamp>/
+    mbpp_generations.json
+    mbpp_metrics.json
+  CANONICAL_SUMMARY.json           # parsed summary (generated by parse_canonical_results.py)
+  CANONICAL_PROVENANCE.md          # provenance table (generated by parse_canonical_results.py)
+```
+
+---
+
+## Provenance table template
+
+Once canonical results are available, `parse_canonical_results.py` will generate this automatically.
+The table will include: benchmark name, harness, task name, N samples, metric, decoding/few-shot, output JSON path, date, and hardware.
+
+---
+
+## Important: do not mix result types
+
+Internal subset scores (from `scripts/_benchmark_dante_offline.py`, N=30/40/25)
+and canonical scores (from lm-eval / bigcode harness, full splits) use different
+protocols and must never appear in the same table without an explicit separator and
+label explaining the difference.
+
+The parser script enforces this by saving canonical results only to `results/canonical/`
+and printing a clear header.
--- a/evaluation/parse_canonical_results.py
+++ b/evaluation/parse_canonical_results.py
@@ -0,0 +1,212 @@
+#!/usr/bin/env python3
+"""
+parse_canonical_results.py
+==========================
+Reads all lm-eval and bigcode JSON outputs from evaluation/results/canonical/
+and produces:
+  1. A clean summary table printed to stdout
+  2. evaluation/results/canonical/CANONICAL_SUMMARY.json
+  3. evaluation/results/canonical/CANONICAL_PROVENANCE.md
+
+Usage:
+    python3 evaluation/parse_canonical_results.py
+    python3 evaluation/parse_canonical_results.py --results-dir evaluation/results/canonical
+"""
+
+from __future__ import annotations
+import argparse
+import json
+import glob
+import os
+import sys
+from datetime import datetime
+from pathlib import Path
+
+# ─── Metric keys per task ────────────────────────────────────────────────────
+TASK_META = {
+    "mmlu":           {"label": "MMLU",          "metric": "acc,none",                        "fewshot": 5,  "n": 14042},
+    "mmlu_pro":       {"label": "MMLU-Pro",       "metric": "acc,none",                        "fewshot": 5,  "n": 4500},
+    "gsm8k":          {"label": "GSM8K",          "metric": "exact_match,strict-match",        "fewshot": 8,  "n": 1319},
+    "arc_challenge":  {"label": "ARC-Challenge",  "metric": "acc_norm,none",                   "fewshot": 25, "n": 1172},
+    "hellaswag":      {"label": "HellaSwag",      "metric": "acc_norm,none",                   "fewshot": 10, "n": 10042},
+    "truthfulqa_mc2": {"label": "TruthfulQA MC2", "metric": "mc2,none",                        "fewshot": 0,  "n": 817},
+    "winogrande":     {"label": "Winogrande",     "metric": "acc,none",                        "fewshot": 5,  "n": 1267},
+    "ifeval":         {"label": "IFEval",         "metric": "prompt_level_strict_acc,none",    "fewshot": 0,  "n": 541},
+}
+
+CODE_TASKS = {
+    "humaneval": {"label": "HumanEval", "metric": "pass@1", "fewshot": 0, "n": 164},
+    "mbpp":      {"label": "MBPP",      "metric": "pass@1", "fewshot": 0, "n": 374},
+}
+
+
+def find_latest(pattern: str) -> str | None:
+    files = sorted(glob.glob(pattern))
+    return files[-1] if files else None
+
+
+def parse_lm_eval(results_dir: str) -> dict:
+    scores = {}
+    for task_key, meta in TASK_META.items():
+        pattern = f"{results_dir}/{task_key}_*.json"
+        latest = find_latest(pattern)
+        if not latest:
+            continue
+        try:
+            with open(latest) as f:
+                data = json.load(f)
+            res = data.get("results", {})
+            if task_key in res:
+                raw = res[task_key].get(meta["metric"])
+                if raw is not None:
+                    mtime = os.path.getmtime(latest)
+                    scores[task_key] = {
+                        "label":    meta["label"],
+                        "score":    round(raw * 100, 2),
+                        "metric":   meta["metric"],
+                        "fewshot":  meta["fewshot"],
+                        "n":        meta["n"],
+                        "source":   latest,
+                        "date":     datetime.fromtimestamp(mtime).strftime("%Y-%m-%d"),
+                        "harness":  "lm-evaluation-harness 0.4.5",
+                        "model":    data.get("config", {}).get("model_args", "OdaxAI/DANTE-Mosaic-3.5B"),
+                        "dtype":    "bfloat16",
+                        "device":   "NVIDIA A100-40GB",
+                        "seed":     42,
+                    }
+        except Exception as e:
+            print(f"  [WARNING] parse error {latest}: {e}", file=sys.stderr)
+    return scores
+
+
+def parse_code_eval(results_dir: str) -> dict:
+    scores = {}
+    for task_key, meta in CODE_TASKS.items():
+        # bigcode saves to subdir
+        pattern = f"{results_dir}/{task_key}_*/{task_key}_metrics.json"
+        latest = find_latest(pattern)
+        if not latest:
+            continue
+        try:
+            with open(latest) as f:
+                data = json.load(f)
+            raw = data.get("pass@1")
+            if raw is not None:
+                mtime = os.path.getmtime(latest)
+                scores[task_key] = {
+                    "label":    meta["label"],
+                    "score":    round(raw * 100, 2),
+                    "metric":   "pass@1",
+                    "fewshot":  0,
+                    "n":        meta["n"],
+                    "source":   latest,
+                    "date":     datetime.fromtimestamp(mtime).strftime("%Y-%m-%d"),
+                    "harness":  "bigcode-evaluation-harness",
+                    "model":    "OdaxAI/DANTE-Mosaic-3.5B",
+                    "dtype":    "bfloat16",
+                    "device":   "NVIDIA A100-40GB",
+                    "seed":     0,
+                }
+        except Exception as e:
+            print(f"  [WARNING] parse error {latest}: {e}", file=sys.stderr)
+    return scores
+
+
+def print_table(all_scores: dict) -> None:
+    SEP = "=" * 78
+    print(f"\n{SEP}")
+    print("  CANONICAL BENCHMARK RESULTS — OdaxAI/DANTE-Mosaic-3.5B")
+    print("  All results produced by official evaluation harnesses on Leonardo HPC")
+    print(f"{SEP}")
+    print(f"  {'Benchmark':<20} {'N':>6}  {'Few-shot':>8}  {'Metric':<20} {'Score':>8}")
+    print("  " + "-" * 72)
+    for k, v in sorted(all_scores.items(), key=lambda x: x[1]["label"]):
+        print(f"  {v['label']:<20} {v['n']:>6}  {v['fewshot']:>8}-shot  "
+              f"{v['metric']:<20} {v['score']:>7.2f}%")
+    print(f"{SEP}\n")
+
+
+def write_summary(all_scores: dict, results_dir: str) -> None:
+    out = {
+        "model": "OdaxAI/DANTE-Mosaic-3.5B",
+        "type": "REAL_CANONICAL_RUN",
+        "harness": "lm-evaluation-harness 0.4.5 + bigcode-evaluation-harness",
+        "hardware": "NVIDIA A100-40GB, BF16",
+        "cluster": "CINECA Leonardo Booster",
+        "seed": 42,
+        "results": all_scores,
+    }
+    path = f"{results_dir}/CANONICAL_SUMMARY.json"
+    with open(path, "w") as f:
+        json.dump(out, f, indent=2)
+    print(f"  Summary JSON -> {path}")
+
+
+def write_provenance(all_scores: dict, results_dir: str) -> None:
+    lines = [
+        "# Canonical Benchmark Provenance — DANTE-Mosaic-3.5B",
+        "",
+        "All results in this table are **REAL_CANONICAL_RUN** — produced by official",
+        "evaluation harnesses on the `OdaxAI/DANTE-Mosaic-3.5B` checkpoint.",
+        "They are directly comparable to published leaderboard scores.",
+        "",
+        "| Benchmark | Harness | Task name | N | Few-shot | Metric | Score | Date | Source JSON |",
+        "|-----------|---------|-----------|---|----------|--------|-------|------|-------------|",
+    ]
+    for k, v in sorted(all_scores.items(), key=lambda x: x[1]["label"]):
+        src = os.path.basename(v["source"])
+        lines.append(
+            f"| {v['label']} | {v['harness']} | `{k}` | {v['n']} | "
+            f"{v['fewshot']}-shot | `{v['metric']}` | **{v['score']:.2f}%** | "
+            f"{v['date']} | `{src}` |"
+        )
+    lines += [
+        "",
+        "## Hardware & Software",
+        "",
+        "| Property | Value |",
+        "|----------|-------|",
+        "| GPU | NVIDIA A100-SXM-40GB |",
+        "| Precision | BF16 |",
+        "| Cluster | CINECA Leonardo Booster |",
+        "| lm-eval version | 0.4.5 |",
+        "| Seed | 42 |",
+        "",
+        "## Comparability Note",
+        "",
+        "These canonical scores are produced under standard protocols and are directly",
+        "comparable to published scores from the same harness versions.",
+        "Internal offline subset scores (30/40/25 problems from `_benchmark_dante_offline.py`)",
+        "are **separate** and must not be mixed with these canonical results.",
+    ]
+    path = f"{results_dir}/CANONICAL_PROVENANCE.md"
+    with open(path, "w") as f:
+        f.write("\n".join(lines) + "\n")
+    print(f"  Provenance   -> {path}")
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--results-dir", default="evaluation/results/canonical")
+    args = parser.parse_args()
+
+    rd = args.results_dir
+    print(f"\nParsing results from: {rd}")
+
+    lm = parse_lm_eval(rd)
+    code = parse_code_eval(rd)
+    all_scores = {**lm, **code}
+
+    if not all_scores:
+        print("No canonical results found yet.")
+        print("Run evaluation/slurm_lm_eval.slurm and slurm_code_eval.slurm on Leonardo first.")
+        sys.exit(0)
+
+    print_table(all_scores)
+    write_summary(all_scores, rd)
+    write_provenance(all_scores, rd)
+    print("\nDone.\n")
+
+
+if __name__ == "__main__":
+    main()
--- a/evaluation/slurm_code_eval.slurm
+++ b/evaluation/slurm_code_eval.slurm
@@ -0,0 +1,131 @@
+#!/bin/bash
+#SBATCH --job-name=dante_code_eval
+#SBATCH --account=AIFAC_F02_254_0
+#SBATCH --partition=boost_usr_prod
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=1
+#SBATCH --cpus-per-task=8
+#SBATCH --mem=64G
+#SBATCH --time=04:00:00
+#SBATCH --output=evaluation/slurm_logs/code_eval_%j.out
+#SBATCH --error=evaluation/slurm_logs/code_eval_%j.err
+
+# ============================================================================
+# DANTE-Mosaic-3.5B — Canonical code evaluation
+# Leonardo Booster, 1x A100-40GB
+#
+# Tasks: HumanEval (pass@1), MBPP (pass@1)
+# Harness: bigcode-evaluation-harness
+# All outputs -> evaluation/results/canonical/
+#
+# WARNING: HumanEval executes generated Python code.
+# bigcode-evaluation-harness uses a sandboxed subprocess per sample.
+# Do NOT run --allow_code_execution outside of a secure environment.
+# Leonardo compute nodes are isolated — this is acceptable here.
+# ============================================================================
+
+set -euo pipefail
+
+# ─── Environment ─────────────────────────────────────────────────────────────
+module purge
+module load cuda/12.4 python/3.11.7
+
+export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
+export HF_DATASETS_CACHE="${HF_HOME}/datasets"
+export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
+export HF_HUB_CACHE="${HF_HOME}/hub"
+export TOKENIZERS_PARALLELISM=false
+
+# ─── Config ──────────────────────────────────────────────────────────────────
+MODEL="OdaxAI/DANTE-Mosaic-3.5B"
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+RESULTS="evaluation/results/canonical"
+BIGCODE_DIR="/leonardo_scratch/large/userexternal/nsavioli/bigcode-evaluation-harness"
+mkdir -p "${RESULTS}"
+
+echo "======================================="
+echo "DANTE-Mosaic-3.5B — Canonical code eval"
+echo "Model:     ${MODEL}"
+echo "Job ID:    ${SLURM_JOB_ID}"
+echo "Timestamp: ${TIMESTAMP}"
+echo "======================================="
+
+# ─── Install bigcode harness if not present ───────────────────────────────────
+if [ ! -d "${BIGCODE_DIR}" ]; then
+    echo "Cloning bigcode-evaluation-harness..."
+    git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git \
+        "${BIGCODE_DIR}"
+    pip install --quiet -e "${BIGCODE_DIR}"
+else
+    echo "bigcode-evaluation-harness found at ${BIGCODE_DIR}"
+    pip install --quiet -e "${BIGCODE_DIR}"
+fi
+
+# ─── HumanEval (pass@1, 164 problems, greedy, 0-shot) ────────────────────────
+echo ""
+echo ">>> HumanEval pass@1 (164 problems, greedy decoding)..."
+HE_OUT="${RESULTS}/humaneval_${TIMESTAMP}"
+mkdir -p "${HE_OUT}"
+
+accelerate launch "${BIGCODE_DIR}/main.py" \
+    --model "${MODEL}" \
+    --tasks humaneval \
+    --do_sample False \
+    --temperature 0.0 \
+    --n_samples 1 \
+    --batch_size 8 \
+    --allow_code_execution \
+    --precision bf16 \
+    --trust_remote_code \
+    --save_generations \
+    --save_generations_path "${HE_OUT}/humaneval_generations.json" \
+    --metric_output_path "${HE_OUT}/humaneval_metrics.json" \
+    2>&1 | tee "${HE_OUT}/humaneval.log"
+
+echo ">>> HumanEval done -> ${HE_OUT}/humaneval_metrics.json"
+
+# ─── MBPP (pass@1, 374 problems, greedy, 0-shot) ─────────────────────────────
+echo ""
+echo ">>> MBPP pass@1 (374 problems, greedy decoding)..."
+MBPP_OUT="${RESULTS}/mbpp_${TIMESTAMP}"
+mkdir -p "${MBPP_OUT}"
+
+accelerate launch "${BIGCODE_DIR}/main.py" \
+    --model "${MODEL}" \
+    --tasks mbpp \
+    --do_sample False \
+    --temperature 0.0 \
+    --n_samples 1 \
+    --batch_size 8 \
+    --allow_code_execution \
+    --precision bf16 \
+    --trust_remote_code \
+    --save_generations \
+    --save_generations_path "${MBPP_OUT}/mbpp_generations.json" \
+    --metric_output_path "${MBPP_OUT}/mbpp_metrics.json" \
+    2>&1 | tee "${MBPP_OUT}/mbpp.log"
+
+echo ">>> MBPP done -> ${MBPP_OUT}/mbpp_metrics.json"
+
+# ─── Summary ─────────────────────────────────────────────────────────────────
+echo ""
+echo "======================================="
+echo "CODE EVAL COMPLETE"
+python3 - <<'PYEOF'
+import json, glob
+
+for label, pat in [
+    ("HumanEval", "evaluation/results/canonical/humaneval_*/humaneval_metrics.json"),
+    ("MBPP",      "evaluation/results/canonical/mbpp_*/mbpp_metrics.json"),
+]:
+    files = sorted(glob.glob(pat))
+    if files:
+        with open(files[-1]) as f:
+            d = json.load(f)
+        score = d.get("pass@1", d)
+        print(f"  {label}: pass@1 = {score}")
+    else:
+        print(f"  {label}: no result file found")
+PYEOF
+echo "======================================="
--- a/evaluation/slurm_lm_eval.slurm
+++ b/evaluation/slurm_lm_eval.slurm
@@ -0,0 +1,156 @@
+#!/bin/bash
+#SBATCH --job-name=dante_lm_eval
+#SBATCH --account=AIFAC_F02_254_0
+#SBATCH --partition=boost_usr_prod
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=1
+#SBATCH --cpus-per-task=8
+#SBATCH --mem=64G
+#SBATCH --time=08:00:00
+#SBATCH --output=evaluation/slurm_logs/lm_eval_%j.out
+#SBATCH --error=evaluation/slurm_logs/lm_eval_%j.err
+
+# ============================================================================
+# DANTE-Mosaic-3.5B — Canonical lm-evaluation-harness benchmark
+# Leonardo Booster, 1x A100-40GB
+#
+# Tasks: MMLU, GSM8K, ARC-Challenge, HellaSwag, TruthfulQA, Winogrande, IFEval
+# Harness: EleutherAI lm-evaluation-harness 0.4.5
+# All outputs -> evaluation/results/canonical/
+# ============================================================================
+
+set -euo pipefail
+
+# ─── Environment ─────────────────────────────────────────────────────────────
+module purge
+module load cuda/12.4 python/3.11.7
+
+export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
+export HF_DATASETS_CACHE="${HF_HOME}/datasets"
+export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
+export HF_HUB_CACHE="${HF_HOME}/hub"
+export TOKENIZERS_PARALLELISM=false
+
+# ─── Config ──────────────────────────────────────────────────────────────────
+MODEL="OdaxAI/DANTE-Mosaic-3.5B"
+MODEL_ARGS="pretrained=${MODEL},dtype=bfloat16,trust_remote_code=True"
+BATCH=8
+SEED=42
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+RESULTS="evaluation/results/canonical"
+mkdir -p "${RESULTS}"
+
+echo "======================================="
+echo "DANTE-Mosaic-3.5B — Canonical lm-eval"
+echo "Model:     ${MODEL}"
+echo "Job ID:    ${SLURM_JOB_ID}"
+echo "GPU:       $(nvidia-smi --query-gpu=name --format=csv,noheader | head -1)"
+echo "Timestamp: ${TIMESTAMP}"
+echo "======================================="
+
+# ─── Install harness if needed ────────────────────────────────────────────────
+pip install --quiet lm-eval==0.4.5
+
+# ─── Helper: run one task ─────────────────────────────────────────────────────
+run_task() {
+    local TASK=$1
+    local FEWSHOT=$2
+    local EXTRA="${3:-}"
+    local OUT="${RESULTS}/${TASK}_${TIMESTAMP}.json"
+
+    echo ""
+    echo ">>> ${TASK} (${FEWSHOT}-shot) ..."
+    lm_eval \
+        --model hf \
+        --model_args "${MODEL_ARGS}" \
+        --tasks "${TASK}" \
+        --num_fewshot "${FEWSHOT}" \
+        --batch_size "${BATCH}" \
+        --seed "${SEED}" \
+        --output_path "${OUT}" \
+        --log_samples \
+        ${EXTRA} \
+        2>&1 | tee "${RESULTS}/${TASK}_${TIMESTAMP}.log"
+    echo ">>> DONE ${TASK} -> ${OUT}"
+}
+
+# ─── Run all canonical tasks ──────────────────────────────────────────────────
+# MMLU: 57 subjects, 5-shot, accuracy
+run_task "mmlu"           5
+
+# MMLU-Pro: harder 10-option MCQ, 5-shot
+run_task "mmlu_pro"       5
+
+# GSM8K: 8-shot chain-of-thought, exact match on final answer
+run_task "gsm8k"          8
+
+# ARC-Challenge: 25-shot, normalised accuracy
+run_task "arc_challenge"  25
+
+# HellaSwag: 10-shot, normalised accuracy
+run_task "hellaswag"      10
+
+# TruthfulQA MC2: 0-shot, mc2 multiple true answers
+run_task "truthfulqa_mc2" 0
+
+# Winogrande: 5-shot, accuracy
+run_task "winogrande"     5
+
+# IFEval: 0-shot, instruction-level strict accuracy
+run_task "ifeval"         0
+
+# ─── Summary ─────────────────────────────────────────────────────────────────
+echo ""
+echo "======================================="
+echo "ALL TASKS COMPLETE"
+echo "Results saved to: ${RESULTS}/"
+ls -lh "${RESULTS}/"
+echo "======================================="
+
+# ─── Parse and print summary ─────────────────────────────────────────────────
+python3 - <<'PYEOF'
+import json, glob, os, sys
+
+results_dir = "evaluation/results/canonical"
+files = sorted(glob.glob(f"{results_dir}/*.json"))
+if not files:
+    print("No result JSON files found.")
+    sys.exit(0)
+
+# One score per task
+METRIC_MAP = {
+    "mmlu":           ("acc,none",       "MMLU"),
+    "mmlu_pro":       ("acc,none",       "MMLU-Pro"),
+    "gsm8k":          ("exact_match,strict-match", "GSM8K"),
+    "arc_challenge":  ("acc_norm,none",  "ARC-Challenge"),
+    "hellaswag":      ("acc_norm,none",  "HellaSwag"),
+    "truthfulqa_mc2": ("mc2,none",       "TruthfulQA"),
+    "winogrande":     ("acc,none",       "Winogrande"),
+    "ifeval":         ("prompt_level_strict_acc,none", "IFEval"),
+}
+
+print("\n" + "="*60)
+print("  CANONICAL BENCHMARK RESULTS — DANTE-Mosaic-3.5B")
+print("  lm-evaluation-harness 0.4.5 | BF16 | A100-40GB | seed=42")
+print("="*60)
+print(f"  {'Benchmark':<20} {'Metric':<35} {'Score':>8}")
+print("  " + "-"*58)
+
+for f in files:
+    try:
+        with open(f) as fh:
+            data = json.load(fh)
+        results = data.get("results", {})
+        for task_key, (metric_key, label) in METRIC_MAP.items():
+            if task_key in results:
+                score = results[task_key].get(metric_key, None)
+                if score is not None:
+                    print(f"  {label:<20} {metric_key:<35} {score*100:>7.2f}%")
+    except Exception as e:
+        print(f"  [parse error: {f}: {e}]")
+
+print("="*60)
+print(f"  Source: {results_dir}/")
+print("="*60 + "\n")
+PYEOF
--- a/evaluation/submit_all.sh
+++ b/evaluation/submit_all.sh
@@ -0,0 +1,33 @@
+#!/usr/bin/env bash
+# ============================================================================
+# submit_all.sh — Submit all canonical evaluation jobs on Leonardo
+#
+# Usage (from the repo root on Leonardo):
+#   bash evaluation/submit_all.sh
+# ============================================================================
+
+set -euo pipefail
+
+cd "$(dirname "$0")/.."   # repo root
+
+mkdir -p evaluation/slurm_logs evaluation/results/canonical
+
+echo "=== Submitting DANTE-Mosaic-3.5B canonical evaluations ==="
+echo ""
+
+LM_JOB=$(sbatch --parsable evaluation/slurm_lm_eval.slurm)
+echo "  lm-eval job submitted:   ${LM_JOB}"
+
+CODE_JOB=$(sbatch --parsable evaluation/slurm_code_eval.slurm)
+echo "  code-eval job submitted: ${CODE_JOB}"
+
+echo ""
+echo "Monitor:"
+echo "  squeue -u \$USER"
+echo "  tail -f evaluation/slurm_logs/lm_eval_${LM_JOB}.out"
+echo "  tail -f evaluation/slurm_logs/code_eval_${CODE_JOB}.out"
+echo ""
+echo "After completion, parse results:"
+echo "  python3 evaluation/parse_canonical_results.py"
+echo ""
+echo "Results will be saved to: evaluation/results/canonical/"