DANTE-Mosaic-3.5B — Canonical Evaluation Suite
This folder contains all evaluation scripts for OdaxAI/DANTE-Mosaic-3.5B.
Evaluation types
| Type | Description | Comparable to leaderboard? |
|---|---|---|
| REAL_CANONICAL_RUN | Run via lm-evaluation-harness or bigcode harness on full benchmark dataset | Yes |
| REAL_INTERNAL_SUBSET | Run on 25–40 hand-curated problems in scripts/_benchmark_dante_offline.py |
No |
These two types must never be mixed in the same table or chart.
Canonical benchmarks (lm-evaluation-harness)
| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|---|---|---|---|---|---|
| MMLU | mmlu |
14,042 | 5 | acc |
slurm_lm_eval.slurm |
| MMLU-Pro | mmlu_pro |
4,500 | 5 | acc |
slurm_lm_eval.slurm |
| GSM8K | gsm8k |
1,319 | 8 | exact_match,strict-match |
slurm_lm_eval.slurm |
| ARC-Challenge | arc_challenge |
1,172 | 25 | acc_norm |
slurm_lm_eval.slurm |
| HellaSwag | hellaswag |
10,042 | 10 | acc_norm |
slurm_lm_eval.slurm |
| TruthfulQA MC2 | truthfulqa_mc2 |
817 | 0 | mc2 |
slurm_lm_eval.slurm |
| Winogrande | winogrande |
1,267 | 5 | acc |
slurm_lm_eval.slurm |
| IFEval | ifeval |
541 | 0 | prompt_level_strict_acc |
slurm_lm_eval.slurm |
Canonical benchmarks (bigcode-evaluation-harness)
| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|---|---|---|---|---|---|
| HumanEval | humaneval |
164 | 0 | pass@1 |
slurm_code_eval.slurm |
| MBPP | mbpp |
374 | 0 | pass@1 |
slurm_code_eval.slurm |
How to run on Leonardo Booster
1. Copy the repo to Leonardo
rsync -av /Users/nicolosavioli/Desktop/DANTE-T1-Seed/ \
nsavioli@login.leonardo.cineca.it:/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/
2. Submit all jobs with one command
# From the repo root on Leonardo:
bash evaluation/submit_all.sh
This submits two SLURM jobs:
slurm_lm_eval.slurm— all lm-eval tasks (~4–6 hours on 1× A100-40GB)slurm_code_eval.slurm— HumanEval + MBPP (~1–2 hours on 1× A100-40GB)
3. Monitor jobs
squeue -u $USER
tail -f evaluation/slurm_logs/lm_eval_<JOB_ID>.out
tail -f evaluation/slurm_logs/code_eval_<JOB_ID>.out
4. Parse results after completion
python3 evaluation/parse_canonical_results.py
This produces:
results/canonical/CANONICAL_SUMMARY.json— all scores as JSONresults/canonical/CANONICAL_PROVENANCE.md— full provenance table (Markdown)- Printed summary table in terminal
5. Transfer results to your Mac
rsync -av nsavioli@login.leonardo.cineca.it:\
/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/evaluation/results/canonical/ \
evaluation/results/canonical/
Decoding and grading setup
All lm-eval runs use:
| Parameter | Value |
|---|---|
temperature |
0.0 (greedy) |
do_sample |
false |
batch_size |
8 |
seed |
42 |
precision |
bfloat16 |
device |
CUDA (A100-40GB) |
harness version |
lm-eval 0.4.5 |
All code eval runs use:
| Parameter | Value |
|---|---|
temperature |
0.0 (greedy) |
n_samples |
1 (pass@1) |
batch_size |
8 |
precision |
bfloat16 |
harness |
bigcode-evaluation-harness (latest main) |
Environment variables
Both SLURM scripts set the following:
export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
export HF_HUB_CACHE="${HF_HOME}/hub"
export TOKENIZERS_PARALLELISM=false
Results folder structure
evaluation/results/canonical/
mmlu_<timestamp>.json # lm-eval output (raw)
mmlu_pro_<timestamp>.json
gsm8k_<timestamp>.json
arc_challenge_<timestamp>.json
hellaswag_<timestamp>.json
truthfulqa_mc2_<timestamp>.json
winogrande_<timestamp>.json
ifeval_<timestamp>.json
humaneval_<timestamp>/
humaneval_generations.json
humaneval_metrics.json
mbpp_<timestamp>/
mbpp_generations.json
mbpp_metrics.json
CANONICAL_SUMMARY.json # parsed summary (generated by parse_canonical_results.py)
CANONICAL_PROVENANCE.md # provenance table (generated by parse_canonical_results.py)
Provenance table template
Once canonical results are available, parse_canonical_results.py will generate this automatically.
The table will include: benchmark name, harness, task name, N samples, metric, decoding/few-shot, output JSON path, date, and hardware.
Important: do not mix result types
Internal subset scores (from scripts/_benchmark_dante_offline.py, N=30/40/25)
and canonical scores (from lm-eval / bigcode harness, full splits) use different
protocols and must never appear in the same table without an explicit separator and
label explaining the difference.
The parser script enforces this by saving canonical results only to results/canonical/
and printing a clear header.