# DANTE-Mosaic-3.5B — Canonical Evaluation Suite This folder contains all evaluation scripts for `OdaxAI/DANTE-Mosaic-3.5B`. --- ## Evaluation types | Type | Description | Comparable to leaderboard? | |------|-------------|---------------------------| | **REAL_CANONICAL_RUN** | Run via lm-evaluation-harness or bigcode harness on full benchmark dataset | **Yes** | | **REAL_INTERNAL_SUBSET** | Run on 25–40 hand-curated problems in `scripts/_benchmark_dante_offline.py` | **No** | These two types must **never be mixed** in the same table or chart. --- ## Canonical benchmarks (lm-evaluation-harness) | Benchmark | Task name | N samples | Few-shot | Metric | Script | |-----------|-----------|-----------|----------|--------|--------| | MMLU | `mmlu` | 14,042 | 5 | `acc` | `slurm_lm_eval.slurm` | | MMLU-Pro | `mmlu_pro` | 4,500 | 5 | `acc` | `slurm_lm_eval.slurm` | | GSM8K | `gsm8k` | 1,319 | 8 | `exact_match,strict-match` | `slurm_lm_eval.slurm` | | ARC-Challenge | `arc_challenge` | 1,172 | 25 | `acc_norm` | `slurm_lm_eval.slurm` | | HellaSwag | `hellaswag` | 10,042 | 10 | `acc_norm` | `slurm_lm_eval.slurm` | | TruthfulQA MC2 | `truthfulqa_mc2` | 817 | 0 | `mc2` | `slurm_lm_eval.slurm` | | Winogrande | `winogrande` | 1,267 | 5 | `acc` | `slurm_lm_eval.slurm` | | IFEval | `ifeval` | 541 | 0 | `prompt_level_strict_acc` | `slurm_lm_eval.slurm` | ## Canonical benchmarks (bigcode-evaluation-harness) | Benchmark | Task name | N samples | Few-shot | Metric | Script | |-----------|-----------|-----------|----------|--------|--------| | HumanEval | `humaneval` | 164 | 0 | `pass@1` | `slurm_code_eval.slurm` | | MBPP | `mbpp` | 374 | 0 | `pass@1` | `slurm_code_eval.slurm` | --- ## How to run on Leonardo Booster ### 1. Copy the repo to Leonardo ```bash rsync -av /Users/nicolosavioli/Desktop/DANTE-T1-Seed/ \ nsavioli@login.leonardo.cineca.it:/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/ ``` ### 2. Submit all jobs with one command ```bash # From the repo root on Leonardo: bash evaluation/submit_all.sh ``` This submits two SLURM jobs: - `slurm_lm_eval.slurm` — all lm-eval tasks (~4–6 hours on 1× A100-40GB) - `slurm_code_eval.slurm` — HumanEval + MBPP (~1–2 hours on 1× A100-40GB) ### 3. Monitor jobs ```bash squeue -u $USER tail -f evaluation/slurm_logs/lm_eval_.out tail -f evaluation/slurm_logs/code_eval_.out ``` ### 4. Parse results after completion ```bash python3 evaluation/parse_canonical_results.py ``` This produces: - `results/canonical/CANONICAL_SUMMARY.json` — all scores as JSON - `results/canonical/CANONICAL_PROVENANCE.md` — full provenance table (Markdown) - Printed summary table in terminal ### 5. Transfer results to your Mac ```bash rsync -av nsavioli@login.leonardo.cineca.it:\ /leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/evaluation/results/canonical/ \ evaluation/results/canonical/ ``` --- ## Decoding and grading setup All lm-eval runs use: | Parameter | Value | |-----------|-------| | `temperature` | 0.0 (greedy) | | `do_sample` | false | | `batch_size` | 8 | | `seed` | 42 | | `precision` | bfloat16 | | `device` | CUDA (A100-40GB) | | `harness version` | lm-eval 0.4.5 | All code eval runs use: | Parameter | Value | |-----------|-------| | `temperature` | 0.0 (greedy) | | `n_samples` | 1 (pass@1) | | `batch_size` | 8 | | `precision` | bfloat16 | | `harness` | bigcode-evaluation-harness (latest main) | --- ## Environment variables Both SLURM scripts set the following: ```bash export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache" export HF_DATASETS_CACHE="${HF_HOME}/datasets" export TRANSFORMERS_CACHE="${HF_HOME}/transformers" export HF_HUB_CACHE="${HF_HOME}/hub" export TOKENIZERS_PARALLELISM=false ``` --- ## Results folder structure ``` evaluation/results/canonical/ mmlu_.json # lm-eval output (raw) mmlu_pro_.json gsm8k_.json arc_challenge_.json hellaswag_.json truthfulqa_mc2_.json winogrande_.json ifeval_.json humaneval_/ humaneval_generations.json humaneval_metrics.json mbpp_/ mbpp_generations.json mbpp_metrics.json CANONICAL_SUMMARY.json # parsed summary (generated by parse_canonical_results.py) CANONICAL_PROVENANCE.md # provenance table (generated by parse_canonical_results.py) ``` --- ## Provenance table template Once canonical results are available, `parse_canonical_results.py` will generate this automatically. The table will include: benchmark name, harness, task name, N samples, metric, decoding/few-shot, output JSON path, date, and hardware. --- ## Important: do not mix result types Internal subset scores (from `scripts/_benchmark_dante_offline.py`, N=30/40/25) and canonical scores (from lm-eval / bigcode harness, full splits) use different protocols and must never appear in the same table without an explicit separator and label explaining the difference. The parser script enforces this by saving canonical results only to `results/canonical/` and printing a clear header.