Files
DANTE-Mosaic-3.5B/evaluation

DANTE-Mosaic-3.5B — Canonical Evaluation Suite

This folder contains all evaluation scripts for OdaxAI/DANTE-Mosaic-3.5B.


Evaluation types

Type Description Comparable to leaderboard?
REAL_CANONICAL_RUN Run via lm-evaluation-harness or bigcode harness on full benchmark dataset Yes
REAL_INTERNAL_SUBSET Run on 2540 hand-curated problems in scripts/_benchmark_dante_offline.py No

These two types must never be mixed in the same table or chart.


Canonical benchmarks (lm-evaluation-harness)

Benchmark Task name N samples Few-shot Metric Script
MMLU mmlu 14,042 5 acc slurm_lm_eval.slurm
MMLU-Pro mmlu_pro 4,500 5 acc slurm_lm_eval.slurm
GSM8K gsm8k 1,319 8 exact_match,strict-match slurm_lm_eval.slurm
ARC-Challenge arc_challenge 1,172 25 acc_norm slurm_lm_eval.slurm
HellaSwag hellaswag 10,042 10 acc_norm slurm_lm_eval.slurm
TruthfulQA MC2 truthfulqa_mc2 817 0 mc2 slurm_lm_eval.slurm
Winogrande winogrande 1,267 5 acc slurm_lm_eval.slurm
IFEval ifeval 541 0 prompt_level_strict_acc slurm_lm_eval.slurm

Canonical benchmarks (bigcode-evaluation-harness)

Benchmark Task name N samples Few-shot Metric Script
HumanEval humaneval 164 0 pass@1 slurm_code_eval.slurm
MBPP mbpp 374 0 pass@1 slurm_code_eval.slurm

How to run on Leonardo Booster

1. Copy the repo to Leonardo

rsync -av /Users/nicolosavioli/Desktop/DANTE-T1-Seed/ \
  nsavioli@login.leonardo.cineca.it:/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/

2. Submit all jobs with one command

# From the repo root on Leonardo:
bash evaluation/submit_all.sh

This submits two SLURM jobs:

  • slurm_lm_eval.slurm — all lm-eval tasks (~46 hours on 1× A100-40GB)
  • slurm_code_eval.slurm — HumanEval + MBPP (~12 hours on 1× A100-40GB)

3. Monitor jobs

squeue -u $USER
tail -f evaluation/slurm_logs/lm_eval_<JOB_ID>.out
tail -f evaluation/slurm_logs/code_eval_<JOB_ID>.out

4. Parse results after completion

python3 evaluation/parse_canonical_results.py

This produces:

  • results/canonical/CANONICAL_SUMMARY.json — all scores as JSON
  • results/canonical/CANONICAL_PROVENANCE.md — full provenance table (Markdown)
  • Printed summary table in terminal

5. Transfer results to your Mac

rsync -av nsavioli@login.leonardo.cineca.it:\
/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/evaluation/results/canonical/ \
  evaluation/results/canonical/

Decoding and grading setup

All lm-eval runs use:

Parameter Value
temperature 0.0 (greedy)
do_sample false
batch_size 8
seed 42
precision bfloat16
device CUDA (A100-40GB)
harness version lm-eval 0.4.5

All code eval runs use:

Parameter Value
temperature 0.0 (greedy)
n_samples 1 (pass@1)
batch_size 8
precision bfloat16
harness bigcode-evaluation-harness (latest main)

Environment variables

Both SLURM scripts set the following:

export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
export HF_HUB_CACHE="${HF_HOME}/hub"
export TOKENIZERS_PARALLELISM=false

Results folder structure

evaluation/results/canonical/
  mmlu_<timestamp>.json            # lm-eval output (raw)
  mmlu_pro_<timestamp>.json
  gsm8k_<timestamp>.json
  arc_challenge_<timestamp>.json
  hellaswag_<timestamp>.json
  truthfulqa_mc2_<timestamp>.json
  winogrande_<timestamp>.json
  ifeval_<timestamp>.json
  humaneval_<timestamp>/
    humaneval_generations.json
    humaneval_metrics.json
  mbpp_<timestamp>/
    mbpp_generations.json
    mbpp_metrics.json
  CANONICAL_SUMMARY.json           # parsed summary (generated by parse_canonical_results.py)
  CANONICAL_PROVENANCE.md          # provenance table (generated by parse_canonical_results.py)

Provenance table template

Once canonical results are available, parse_canonical_results.py will generate this automatically. The table will include: benchmark name, harness, task name, N samples, metric, decoding/few-shot, output JSON path, date, and hardware.


Important: do not mix result types

Internal subset scores (from scripts/_benchmark_dante_offline.py, N=30/40/25) and canonical scores (from lm-eval / bigcode harness, full splits) use different protocols and must never appear in the same table without an explicit separator and label explaining the difference.

The parser script enforces this by saving canonical results only to results/canonical/ and printing a clear header.