164 lines
5.0 KiB
Markdown
164 lines
5.0 KiB
Markdown
# DANTE-Mosaic-3.5B — Canonical Evaluation Suite
|
||
|
||
This folder contains all evaluation scripts for `OdaxAI/DANTE-Mosaic-3.5B`.
|
||
|
||
---
|
||
|
||
## Evaluation types
|
||
|
||
| Type | Description | Comparable to leaderboard? |
|
||
|------|-------------|---------------------------|
|
||
| **REAL_CANONICAL_RUN** | Run via lm-evaluation-harness or bigcode harness on full benchmark dataset | **Yes** |
|
||
| **REAL_INTERNAL_SUBSET** | Run on 25–40 hand-curated problems in `scripts/_benchmark_dante_offline.py` | **No** |
|
||
|
||
These two types must **never be mixed** in the same table or chart.
|
||
|
||
---
|
||
|
||
## Canonical benchmarks (lm-evaluation-harness)
|
||
|
||
| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|
||
|-----------|-----------|-----------|----------|--------|--------|
|
||
| MMLU | `mmlu` | 14,042 | 5 | `acc` | `slurm_lm_eval.slurm` |
|
||
| MMLU-Pro | `mmlu_pro` | 4,500 | 5 | `acc` | `slurm_lm_eval.slurm` |
|
||
| GSM8K | `gsm8k` | 1,319 | 8 | `exact_match,strict-match` | `slurm_lm_eval.slurm` |
|
||
| ARC-Challenge | `arc_challenge` | 1,172 | 25 | `acc_norm` | `slurm_lm_eval.slurm` |
|
||
| HellaSwag | `hellaswag` | 10,042 | 10 | `acc_norm` | `slurm_lm_eval.slurm` |
|
||
| TruthfulQA MC2 | `truthfulqa_mc2` | 817 | 0 | `mc2` | `slurm_lm_eval.slurm` |
|
||
| Winogrande | `winogrande` | 1,267 | 5 | `acc` | `slurm_lm_eval.slurm` |
|
||
| IFEval | `ifeval` | 541 | 0 | `prompt_level_strict_acc` | `slurm_lm_eval.slurm` |
|
||
|
||
## Canonical benchmarks (bigcode-evaluation-harness)
|
||
|
||
| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|
||
|-----------|-----------|-----------|----------|--------|--------|
|
||
| HumanEval | `humaneval` | 164 | 0 | `pass@1` | `slurm_code_eval.slurm` |
|
||
| MBPP | `mbpp` | 374 | 0 | `pass@1` | `slurm_code_eval.slurm` |
|
||
|
||
---
|
||
|
||
## How to run on Leonardo Booster
|
||
|
||
### 1. Copy the repo to Leonardo
|
||
```bash
|
||
rsync -av /Users/nicolosavioli/Desktop/DANTE-T1-Seed/ \
|
||
nsavioli@login.leonardo.cineca.it:/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/
|
||
```
|
||
|
||
### 2. Submit all jobs with one command
|
||
```bash
|
||
# From the repo root on Leonardo:
|
||
bash evaluation/submit_all.sh
|
||
```
|
||
|
||
This submits two SLURM jobs:
|
||
- `slurm_lm_eval.slurm` — all lm-eval tasks (~4–6 hours on 1× A100-40GB)
|
||
- `slurm_code_eval.slurm` — HumanEval + MBPP (~1–2 hours on 1× A100-40GB)
|
||
|
||
### 3. Monitor jobs
|
||
```bash
|
||
squeue -u $USER
|
||
tail -f evaluation/slurm_logs/lm_eval_<JOB_ID>.out
|
||
tail -f evaluation/slurm_logs/code_eval_<JOB_ID>.out
|
||
```
|
||
|
||
### 4. Parse results after completion
|
||
```bash
|
||
python3 evaluation/parse_canonical_results.py
|
||
```
|
||
|
||
This produces:
|
||
- `results/canonical/CANONICAL_SUMMARY.json` — all scores as JSON
|
||
- `results/canonical/CANONICAL_PROVENANCE.md` — full provenance table (Markdown)
|
||
- Printed summary table in terminal
|
||
|
||
### 5. Transfer results to your Mac
|
||
```bash
|
||
rsync -av nsavioli@login.leonardo.cineca.it:\
|
||
/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/evaluation/results/canonical/ \
|
||
evaluation/results/canonical/
|
||
```
|
||
|
||
---
|
||
|
||
## Decoding and grading setup
|
||
|
||
All lm-eval runs use:
|
||
|
||
| Parameter | Value |
|
||
|-----------|-------|
|
||
| `temperature` | 0.0 (greedy) |
|
||
| `do_sample` | false |
|
||
| `batch_size` | 8 |
|
||
| `seed` | 42 |
|
||
| `precision` | bfloat16 |
|
||
| `device` | CUDA (A100-40GB) |
|
||
| `harness version` | lm-eval 0.4.5 |
|
||
|
||
All code eval runs use:
|
||
|
||
| Parameter | Value |
|
||
|-----------|-------|
|
||
| `temperature` | 0.0 (greedy) |
|
||
| `n_samples` | 1 (pass@1) |
|
||
| `batch_size` | 8 |
|
||
| `precision` | bfloat16 |
|
||
| `harness` | bigcode-evaluation-harness (latest main) |
|
||
|
||
---
|
||
|
||
## Environment variables
|
||
|
||
Both SLURM scripts set the following:
|
||
|
||
```bash
|
||
export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
|
||
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
|
||
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
|
||
export HF_HUB_CACHE="${HF_HOME}/hub"
|
||
export TOKENIZERS_PARALLELISM=false
|
||
```
|
||
|
||
---
|
||
|
||
## Results folder structure
|
||
|
||
```
|
||
evaluation/results/canonical/
|
||
mmlu_<timestamp>.json # lm-eval output (raw)
|
||
mmlu_pro_<timestamp>.json
|
||
gsm8k_<timestamp>.json
|
||
arc_challenge_<timestamp>.json
|
||
hellaswag_<timestamp>.json
|
||
truthfulqa_mc2_<timestamp>.json
|
||
winogrande_<timestamp>.json
|
||
ifeval_<timestamp>.json
|
||
humaneval_<timestamp>/
|
||
humaneval_generations.json
|
||
humaneval_metrics.json
|
||
mbpp_<timestamp>/
|
||
mbpp_generations.json
|
||
mbpp_metrics.json
|
||
CANONICAL_SUMMARY.json # parsed summary (generated by parse_canonical_results.py)
|
||
CANONICAL_PROVENANCE.md # provenance table (generated by parse_canonical_results.py)
|
||
```
|
||
|
||
---
|
||
|
||
## Provenance table template
|
||
|
||
Once canonical results are available, `parse_canonical_results.py` will generate this automatically.
|
||
The table will include: benchmark name, harness, task name, N samples, metric, decoding/few-shot, output JSON path, date, and hardware.
|
||
|
||
---
|
||
|
||
## Important: do not mix result types
|
||
|
||
Internal subset scores (from `scripts/_benchmark_dante_offline.py`, N=30/40/25)
|
||
and canonical scores (from lm-eval / bigcode harness, full splits) use different
|
||
protocols and must never appear in the same table without an explicit separator and
|
||
label explaining the difference.
|
||
|
||
The parser script enforces this by saving canonical results only to `results/canonical/`
|
||
and printing a clear header.
|