Files

ModelHub XC b0ba87406b 初始化项目，由ModelHub XC社区提供模型

Model: OdaxAI/DANTE-Mosaic-3.5B
Source: Original Platform

2026-05-14 15:44:10 +08:00

parse_canonical_results.py

初始化项目，由ModelHub XC社区提供模型

2026-05-14 15:44:10 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-14 15:44:10 +08:00

slurm_code_eval.slurm

初始化项目，由ModelHub XC社区提供模型

2026-05-14 15:44:10 +08:00

slurm_lm_eval.slurm

初始化项目，由ModelHub XC社区提供模型

2026-05-14 15:44:10 +08:00

submit_all.sh

初始化项目，由ModelHub XC社区提供模型

2026-05-14 15:44:10 +08:00

README.md

DANTE-Mosaic-3.5B — Canonical Evaluation Suite

This folder contains all evaluation scripts for OdaxAI/DANTE-Mosaic-3.5B.

Evaluation types

Type	Description	Comparable to leaderboard?
REAL_CANONICAL_RUN	Run via lm-evaluation-harness or bigcode harness on full benchmark dataset	Yes
REAL_INTERNAL_SUBSET	Run on 25–40 hand-curated problems in `scripts/_benchmark_dante_offline.py`	No

These two types must never be mixed in the same table or chart.

Canonical benchmarks (lm-evaluation-harness)

Benchmark	Task name	N samples	Few-shot	Metric	Script
MMLU	`mmlu`	14,042	5	`acc`	`slurm_lm_eval.slurm`
MMLU-Pro	`mmlu_pro`	4,500	5	`acc`	`slurm_lm_eval.slurm`
GSM8K	`gsm8k`	1,319	8	`exact_match,strict-match`	`slurm_lm_eval.slurm`
ARC-Challenge	`arc_challenge`	1,172	25	`acc_norm`	`slurm_lm_eval.slurm`
HellaSwag	`hellaswag`	10,042	10	`acc_norm`	`slurm_lm_eval.slurm`
TruthfulQA MC2	`truthfulqa_mc2`	817	0	`mc2`	`slurm_lm_eval.slurm`
Winogrande	`winogrande`	1,267	5	`acc`	`slurm_lm_eval.slurm`
IFEval	`ifeval`	541	0	`prompt_level_strict_acc`	`slurm_lm_eval.slurm`

Canonical benchmarks (bigcode-evaluation-harness)

Benchmark	Task name	N samples	Few-shot	Metric	Script
HumanEval	`humaneval`	164	0	`pass@1`	`slurm_code_eval.slurm`
MBPP	`mbpp`	374	0	`pass@1`	`slurm_code_eval.slurm`

How to run on Leonardo Booster

1. Copy the repo to Leonardo

rsync -av /Users/nicolosavioli/Desktop/DANTE-T1-Seed/ \
  nsavioli@login.leonardo.cineca.it:/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/

2. Submit all jobs with one command

# From the repo root on Leonardo:
bash evaluation/submit_all.sh

This submits two SLURM jobs:

slurm_lm_eval.slurm — all lm-eval tasks (~4–6 hours on 1× A100-40GB)
slurm_code_eval.slurm — HumanEval + MBPP (~1–2 hours on 1× A100-40GB)

3. Monitor jobs

squeue -u $USER
tail -f evaluation/slurm_logs/lm_eval_<JOB_ID>.out
tail -f evaluation/slurm_logs/code_eval_<JOB_ID>.out

4. Parse results after completion

python3 evaluation/parse_canonical_results.py

This produces:

results/canonical/CANONICAL_SUMMARY.json — all scores as JSON
results/canonical/CANONICAL_PROVENANCE.md — full provenance table (Markdown)
Printed summary table in terminal

5. Transfer results to your Mac

rsync -av nsavioli@login.leonardo.cineca.it:\
/leonardo_scratch/large/userexternal/nsavioli/DANTE-T1-Seed/evaluation/results/canonical/ \
  evaluation/results/canonical/

Decoding and grading setup

All lm-eval runs use:

Parameter	Value
`temperature`	0.0 (greedy)
`do_sample`	false
`batch_size`	8
`seed`	42
`precision`	bfloat16
`device`	CUDA (A100-40GB)
`harness version`	lm-eval 0.4.5

All code eval runs use:

Parameter	Value
`temperature`	0.0 (greedy)
`n_samples`	1 (pass@1)
`batch_size`	8
`precision`	bfloat16
`harness`	bigcode-evaluation-harness (latest main)

Environment variables

Both SLURM scripts set the following:

export HF_HOME="/leonardo_scratch/large/userexternal/nsavioli/hf_cache"
export HF_DATASETS_CACHE="${HF_HOME}/datasets"
export TRANSFORMERS_CACHE="${HF_HOME}/transformers"
export HF_HUB_CACHE="${HF_HOME}/hub"
export TOKENIZERS_PARALLELISM=false

Results folder structure

evaluation/results/canonical/
  mmlu_<timestamp>.json            # lm-eval output (raw)
  mmlu_pro_<timestamp>.json
  gsm8k_<timestamp>.json
  arc_challenge_<timestamp>.json
  hellaswag_<timestamp>.json
  truthfulqa_mc2_<timestamp>.json
  winogrande_<timestamp>.json
  ifeval_<timestamp>.json
  humaneval_<timestamp>/
    humaneval_generations.json
    humaneval_metrics.json
  mbpp_<timestamp>/
    mbpp_generations.json
    mbpp_metrics.json
  CANONICAL_SUMMARY.json           # parsed summary (generated by parse_canonical_results.py)
  CANONICAL_PROVENANCE.md          # provenance table (generated by parse_canonical_results.py)

Provenance table template

Once canonical results are available, parse_canonical_results.py will generate this automatically. The table will include: benchmark name, harness, task name, N samples, metric, decoding/few-shot, output JSON path, date, and hardware.

Important: do not mix result types

Internal subset scores (from scripts/_benchmark_dante_offline.py, N=30/40/25) and canonical scores (from lm-eval / bigcode harness, full splits) use different protocols and must never appear in the same table without an explicit separator and label explaining the difference.

The parser script enforces this by saving canonical results only to results/canonical/ and printing a clear header.

README.md Unescape Escape

DANTE-Mosaic-3.5B — Canonical Evaluation Suite

Evaluation types

Canonical benchmarks (lm-evaluation-harness)

Canonical benchmarks (bigcode-evaluation-harness)

How to run on Leonardo Booster

1. Copy the repo to Leonardo

2. Submit all jobs with one command

3. Monitor jobs

4. Parse results after completion

5. Transfer results to your Mac

Decoding and grading setup

Environment variables

Results folder structure

Provenance table template

Important: do not mix result types

README.md