DANTE-Mosaic-3.5B/evaluation/README.md

# DANTE-Mosaic-3.5B — Canonical Evaluation Suite

This folder contains all evaluation scripts for `OdaxAI/DANTE-Mosaic-3.5B`.

---

## Evaluation types

| Type | Description | Comparable to leaderboard? |
|------|-------------|---------------------------|
| **REAL_CANONICAL_RUN** | Run via lm-evaluation-harness or bigcode harness on full benchmark dataset | **Yes** |
| **REAL_INTERNAL_SUBSET** | Run on 25–40 hand-curated problems in `scripts/_benchmark_dante_offline.py` | **No** |

These two types must **never be mixed** in the same table or chart.

---

## Canonical benchmarks (lm-evaluation-harness)

| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|-----------|-----------|-----------|----------|--------|--------|
| MMLU | `mmlu` | 14,042 | 5 | `acc` | `slurm_lm_eval.slurm` |
| MMLU-Pro | `mmlu_pro` | 4,500 | 5 | `acc` | `slurm_lm_eval.slurm` |
| GSM8K | `gsm8k` | 1,319 | 8 | `exact_match,strict-match` | `slurm_lm_eval.slurm` |
| ARC-Challenge | `arc_challenge` | 1,172 | 25 | `acc_norm` | `slurm_lm_eval.slurm` |
| HellaSwag | `hellaswag` | 10,042 | 10 | `acc_norm` | `slurm_lm_eval.slurm` |
| TruthfulQA MC2 | `truthfulqa_mc2` | 817 | 0 | `mc2` | `slurm_lm_eval.slurm` |
| Winogrande | `winogrande` | 1,267 | 5 | `acc` | `slurm_lm_eval.slurm` |
| IFEval | `ifeval` | 541 | 0 | `prompt_level_strict_acc` | `slurm_lm_eval.slurm` |

## Canonical benchmarks (bigcode-evaluation-harness)

| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|-----------|-----------|-----------|----------|--------|--------|
| HumanEval | `humaneval` | 164 | 0 | `pass@1` | `slurm_code_eval.slurm` |
| MBPP | `mbpp` | 374 | 0 | `pass@1` | `slurm_code_eval.slurm` |

---

## How to run on Leonardo Booster

### 1. Copy the repo to Leonardo
```bash
rsync -av /Users/nicolosavioli/Desktop/DANTE-T1-Seed/ \
  nsavioli@login.leonardo.cineca.it:/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/
```

### 2. Submit all jobs with one command
```bash
# From the repo root on Leonardo:
bash evaluation/submit_all.sh
```

This submits two SLURM jobs:
- `slurm_lm_eval.slurm` — all lm-eval tasks (~4–6 hours on 1× A100-40GB)
- `slurm_code_eval.slurm` — HumanEval + MBPP (~1–2 hours on 1× A100-40GB)

### 3. Monitor jobs
```bash
squeue -u $USER
tail -f evaluation/slurm_logs/lm_eval_<JOB_ID>.out
tail -f evaluation/slurm_logs/code_eval_<JOB_ID>.out
```

### 4. Parse results after completion
```bash
python3 evaluation/parse_canonical_results.py
```

This produces:
- `results/canonical/CANONICAL_SUMMARY.json` — all scores as JSON
- `results/canonical/CANONICAL_PROVENANCE.md` — full provenance table (Markdown)
- Printed summary table in terminal

### 5. Transfer results to your Mac
```bash
rsync -av nsavioli@login.leonardo.cineca.it:\
/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/evaluation/results/canonical/ \
  evaluation/results/canonical/
```

---

## Decoding and grading setup

All lm-eval runs use:

| Parameter | Value |
|-----------|-------|
| `temperature` | 0.0 (greedy) |
| `do_sample` | false |
| `batch_size` | 8 |
| `seed` | 42 |
| `precision` | bfloat16 |
| `device` | CUDA (A100-40GB) |
| `harness version` | lm-eval 0.4.5 |

All code eval runs use:

| Parameter | Value |
|-----------|-------|
| `temperature` | 0.0 (greedy) |
| `n_samples` | 1 (pass@1) |
| `batch_size` | 8 |
| `precision` | bfloat16 |
| `harness` | bigcode-evaluation-harness (latest main) |

---

## Environment variables

Both SLURM scripts set the following:

```bash
export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
export HF_HUB_CACHE="${HF_HOME}/hub"
export TOKENIZERS_PARALLELISM=false
```

---

## Results folder structure

```
evaluation/results/canonical/
  mmlu_<timestamp>.json            # lm-eval output (raw)
  mmlu_pro_<timestamp>.json
  gsm8k_<timestamp>.json
  arc_challenge_<timestamp>.json
  hellaswag_<timestamp>.json
  truthfulqa_mc2_<timestamp>.json
  winogrande_<timestamp>.json
  ifeval_<timestamp>.json
  humaneval_<timestamp>/
    humaneval_generations.json
    humaneval_metrics.json
  mbpp_<timestamp>/
    mbpp_generations.json
    mbpp_metrics.json
  CANONICAL_SUMMARY.json           # parsed summary (generated by parse_canonical_results.py)
  CANONICAL_PROVENANCE.md          # provenance table (generated by parse_canonical_results.py)
```

---

## Provenance table template

Once canonical results are available, `parse_canonical_results.py` will generate this automatically.
The table will include: benchmark name, harness, task name, N samples, metric, decoding/few-shot, output JSON path, date, and hardware.

---

## Important: do not mix result types

Internal subset scores (from `scripts/_benchmark_dante_offline.py`, N=30/40/25)
and canonical scores (from lm-eval / bigcode harness, full splits) use different
protocols and must never appear in the same table without an explicit separator and
label explaining the difference.

The parser script enforces this by saving canonical results only to `results/canonical/`
and printing a clear header.