Files
ModelHub XC b0ba87406b 初始化项目,由ModelHub XC社区提供模型
Model: OdaxAI/DANTE-Mosaic-3.5B
Source: Original Platform
2026-05-14 15:44:10 +08:00

164 lines
5.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# DANTE-Mosaic-3.5B — Canonical Evaluation Suite
This folder contains all evaluation scripts for `OdaxAI/DANTE-Mosaic-3.5B`.
---
## Evaluation types
| Type | Description | Comparable to leaderboard? |
|------|-------------|---------------------------|
| **REAL_CANONICAL_RUN** | Run via lm-evaluation-harness or bigcode harness on full benchmark dataset | **Yes** |
| **REAL_INTERNAL_SUBSET** | Run on 2540 hand-curated problems in `scripts/_benchmark_dante_offline.py` | **No** |
These two types must **never be mixed** in the same table or chart.
---
## Canonical benchmarks (lm-evaluation-harness)
| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|-----------|-----------|-----------|----------|--------|--------|
| MMLU | `mmlu` | 14,042 | 5 | `acc` | `slurm_lm_eval.slurm` |
| MMLU-Pro | `mmlu_pro` | 4,500 | 5 | `acc` | `slurm_lm_eval.slurm` |
| GSM8K | `gsm8k` | 1,319 | 8 | `exact_match,strict-match` | `slurm_lm_eval.slurm` |
| ARC-Challenge | `arc_challenge` | 1,172 | 25 | `acc_norm` | `slurm_lm_eval.slurm` |
| HellaSwag | `hellaswag` | 10,042 | 10 | `acc_norm` | `slurm_lm_eval.slurm` |
| TruthfulQA MC2 | `truthfulqa_mc2` | 817 | 0 | `mc2` | `slurm_lm_eval.slurm` |
| Winogrande | `winogrande` | 1,267 | 5 | `acc` | `slurm_lm_eval.slurm` |
| IFEval | `ifeval` | 541 | 0 | `prompt_level_strict_acc` | `slurm_lm_eval.slurm` |
## Canonical benchmarks (bigcode-evaluation-harness)
| Benchmark | Task name | N samples | Few-shot | Metric | Script |
|-----------|-----------|-----------|----------|--------|--------|
| HumanEval | `humaneval` | 164 | 0 | `pass@1` | `slurm_code_eval.slurm` |
| MBPP | `mbpp` | 374 | 0 | `pass@1` | `slurm_code_eval.slurm` |
---
## How to run on Leonardo Booster
### 1. Copy the repo to Leonardo
```bash
rsync -av /Users/nicolosavioli/Desktop/DANTE-T1-Seed/ \
nsavioli@login.leonardo.cineca.it:/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/
```
### 2. Submit all jobs with one command
```bash
# From the repo root on Leonardo:
bash evaluation/submit_all.sh
```
This submits two SLURM jobs:
- `slurm_lm_eval.slurm` — all lm-eval tasks (~46 hours on 1× A100-40GB)
- `slurm_code_eval.slurm` — HumanEval + MBPP (~12 hours on 1× A100-40GB)
### 3. Monitor jobs
```bash
squeue -u $USER
tail -f evaluation/slurm_logs/lm_eval_<JOB_ID>.out
tail -f evaluation/slurm_logs/code_eval_<JOB_ID>.out
```
### 4. Parse results after completion
```bash
python3 evaluation/parse_canonical_results.py
```
This produces:
- `results/canonical/CANONICAL_SUMMARY.json` — all scores as JSON
- `results/canonical/CANONICAL_PROVENANCE.md` — full provenance table (Markdown)
- Printed summary table in terminal
### 5. Transfer results to your Mac
```bash
rsync -av nsavioli@login.leonardo.cineca.it:\
/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/evaluation/results/canonical/ \
evaluation/results/canonical/
```
---
## Decoding and grading setup
All lm-eval runs use:
| Parameter | Value |
|-----------|-------|
| `temperature` | 0.0 (greedy) |
| `do_sample` | false |
| `batch_size` | 8 |
| `seed` | 42 |
| `precision` | bfloat16 |
| `device` | CUDA (A100-40GB) |
| `harness version` | lm-eval 0.4.5 |
All code eval runs use:
| Parameter | Value |
|-----------|-------|
| `temperature` | 0.0 (greedy) |
| `n_samples` | 1 (pass@1) |
| `batch_size` | 8 |
| `precision` | bfloat16 |
| `harness` | bigcode-evaluation-harness (latest main) |
---
## Environment variables
Both SLURM scripts set the following:
```bash
export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
export HF_HUB_CACHE="${HF_HOME}/hub"
export TOKENIZERS_PARALLELISM=false
```
---
## Results folder structure
```
evaluation/results/canonical/
mmlu_<timestamp>.json # lm-eval output (raw)
mmlu_pro_<timestamp>.json
gsm8k_<timestamp>.json
arc_challenge_<timestamp>.json
hellaswag_<timestamp>.json
truthfulqa_mc2_<timestamp>.json
winogrande_<timestamp>.json
ifeval_<timestamp>.json
humaneval_<timestamp>/
humaneval_generations.json
humaneval_metrics.json
mbpp_<timestamp>/
mbpp_generations.json
mbpp_metrics.json
CANONICAL_SUMMARY.json # parsed summary (generated by parse_canonical_results.py)
CANONICAL_PROVENANCE.md # provenance table (generated by parse_canonical_results.py)
```
---
## Provenance table template
Once canonical results are available, `parse_canonical_results.py` will generate this automatically.
The table will include: benchmark name, harness, task name, N samples, metric, decoding/few-shot, output JSON path, date, and hardware.
---
## Important: do not mix result types
Internal subset scores (from `scripts/_benchmark_dante_offline.py`, N=30/40/25)
and canonical scores (from lm-eval / bigcode harness, full splits) use different
protocols and must never appear in the same table without an explicit separator and
label explaining the difference.
The parser script enforces this by saving canonical results only to `results/canonical/`
and printing a clear header.