Files
DANTE-Mosaic-3.5B/README.md
ModelHub XC b0ba87406b 初始化项目,由ModelHub XC社区提供模型
Model: OdaxAI/DANTE-Mosaic-3.5B
Source: Original Platform
2026-05-14 15:44:10 +08:00

17 KiB
Raw Permalink Blame History

language, license, tags, base_model, pipeline_tag, library_name, datasets, model-index
language license tags base_model pipeline_tag library_name datasets model-index
en
it
fr
de
es
pt
nl
ru
zh
apache-2.0
distillation
knowledge-distillation
moe
dense
causal-lm
research
HuggingFaceTB/SmolLM3-3B text-generation transformers
uonlp/CulturaX
bigcode/the-stack-v2
name results
DANTE-Mosaic-3.5B
task dataset metrics
type
text-generation
type name
hellaswag HellaSwag (10-shot, acc_norm, lm-eval 0.4.5, n=10042)
type value name
acc_norm 76.73 acc_norm
task dataset metrics
type
text-generation
type name
gsm8k GSM8K (8-shot, strict-match, lm-eval 0.4.5, n=1319)
type value name
exact_match 74.45 exact_match
task dataset metrics
type
text-generation
type name
arc_challenge ARC-Challenge (25-shot, acc_norm, lm-eval 0.4.5, n=1172)
type value name
acc_norm 62.71 acc_norm
task dataset metrics
type
text-generation
type name
mmlu MMLU (5-shot, lm-eval-harness 0.4.5, full split, n=14042)
type value name
accuracy 59.38 acc
task dataset metrics
type
text-generation
type name
mbpp MBPP (pass@1, greedy, bigcode-evaluation-harness, n=374 test)
type value name
pass@1 42.60 pass@1
task dataset metrics
type
text-generation
type name
mmlu_pro MMLU-Pro (5-shot, exact_match, lm-eval 0.4.5, n=4500)
type value name
exact_match 39.74 exact_match
task dataset metrics
type
text-generation
type name
humaneval HumanEval (pass@1, greedy, bigcode-evaluation-harness, n=164)
type value name
pass@1 6.71 pass@1

DANTE-Mosaic-3.5B

OdaxAI Research Team — May 2026

DANTE-Mosaic-3.5B vs the 3 B–4 B open-weight basket — best-in-class on knowledge & math at 21 A100-GPU-hours of distillation

Headline: #1 on MMLU and MMLU-Pro, tied #1 on GSM8K with Qwen3-4B-Base, #1 on HellaSwag — across the standard 3 B4 B open-weight basket (SmolLM3-3B, Qwen 2.5-3B, Llama 3.2-3B, Qwen3-1.7B-Base, Qwen3-4B-Base). Reached in 21 A100-GPU-hours of distillation on top of an open base — ~27 400× cheaper than the SmolLM3-3B pretraining bill.

A compact 3.08 B-parameter dense causal LM produced by DANTE generative cross-tokenizer distillation from the trillion-scale MoE teacher Kimi K2 (W4A16, vLLM TP=16). No logits, no shared vocabulary, no RLHF. The distillation method is the contribution.

DANTE end-to-end pipeline


Canonical benchmarks — measured on the released checkpoint

All scores: lm-evaluation-harness v0.4.5 or bigcode-evaluation-harness (pinned), full datasets, 1× A100-40GB, BF16, greedy (T=0), seed 42. No subsets, no prompt engineering, no chain-of-thought injection.

Benchmark N Setting DANTE-Mosaic-3.5B
HellaSwag 10 042 10-shot, acc_norm 76.73 %
GSM8K 1 319 8-shot, strict-match 74.45 %
ARC-Challenge 1 172 25-shot, acc_norm 62.71 %
MMLU 14 042 5-shot, acc 59.38 %
MBPP 374 pass@1, 0-shot greedy 42.60 %
MMLU-Pro 4 500 5-shot, exact_match 39.74 %
HumanEval 164 pass@1, greedy 6.70 %

Capability transfer and compute efficiency

Key observations

  • GSM8K 74.45 % beats SmolLM3-3B base 67.4 % (+7 pp) — strongest signal of capability transfer.
  • HellaSwag 76.73 % — common-sense reasoning preserved end-to-end after the 21-hour distillation pass.
  • ARC-Challenge 62.71 % — competitive with much heavier instruct models in the 34 B band.
  • HumanEval 6.7 % — algorithmic-coding is the clear target for the next iteration. Reported as a measured limit, not hidden.

Why this model

DANTE-Mosaic-3.5B is not a SOTA contender at 3 B. Its value is in the method and the cost point:

  • ~21 A100-GPU-hours of student-side training. Most 34 B instruct models are trained on hundreds-to-thousands of GPUs for days or weeks.
  • Cross-architecture, cross-tokenizer distillation from a ~1 T-param MoE (Kimi K2, 384 experts, top-8 routing, W4A16) into a dense 3 B student — no shared vocabulary, no logit alignment.
  • Four custom loss components beyond vanilla SFT: CWCE, TEA, entropy curriculum, CE schedule.
  • 44 k teacher completions — the only training signal. No human labels, no DPO, no RLHF.
  • Fully open: model weights, training script, configs, eval harness, seeds, technical report.
  • Apache-2.0, runs on a single A100-40GB / RTX 4090.

Architecture: teacher and student

Cross-architecture, cross-tokenizer asymmetry

Teacher — Kimi K2 (frozen, never trained)

Component Setting
Model moonshotai/Kimi-K2-Instruct, MoE, ~1.04 T total params
Routing 384 experts, top-8 active per token (~32 B active params)
Quantization W4A16 (INT4 weights, BF16 activations)
Engine vLLM, TP=16, BF16 KV-cache, 4× A100-class GPUs
Decoding greedy (T=0, max_new_tokens=512) — deterministic
Cache ~44 k JSONL records: {prompt, teacher_text, output_entropy, self_consistency, difficulty_score}

Student — DANTE-Mosaic-3.5B (trained)

Component Setting
Init HuggingFaceTB/SmolLM3-3B — GQA + NoPE, RMSNorm, SwiGLU
Parameters ~3.08 B dense (all active per token)
Precision BF16 weights + autocast, gradient checkpointing
Optimizer AdamW (β=0.9/0.95, ε=1e-8, wd=0.1)
LR 2e-5 cosine, 200-step warm-up
Steps / batch 2 000 steps · effective batch ~128 sequences
Seq length 4 096 tokens
Compute 8× A100-class GPUs · ~2 h 37 min → ≈ 21 A100-GPU-hours

Compression target: ~297× total parameters, ~10× active parameters per token, 5:1 tokenizer mismatch — logit-level KD is impossible by construction.


Pipeline overview

Reproducible 6-stage DANTE pipeline

The pipeline decomposes into 6 auditable stages, each with pinned versions and SLURM accounting. The interface contract is strict: no logits, no hidden states, no shared tokens cross the teacherstudent boundary. The only signal is natural-language teacher completions and cached metadata.


Training objective — formal statement

DANTE objective: loss components and schedules

The total loss at step t is:

L(θ; t) = λ_CE(t) · L_CWCE(θ)  +  λ_TEA(t) · L_TEA(θ)

(1) CWCE — Confidence-Weighted Cross-Entropy

L_CWCE(θ) = (1/N) · Σᵢ  w(Hᵢ) · CE_masked(θ, xᵢ, yᵢᵀ)

w(H) = 0.30 + 0.70 / [1 + exp(1.5 · (H  1.5))]

Hᵢ is the cached teacher output entropy. Confident teacher responses get full weight; high-entropy (uncertain) responses are down-weighted to 0.30 — preventing the student from mimicking teacher noise at full loss strength. Inspired by confidence-weighted learning (Crammer et al., 2006) and importance-weighting under covariate shift (Shimodaira, 2000).

(2) TEA — Tied-Embedding Anchor

L_TEA(θ) = ‖ E(θ)  E(θ₀) ‖²_F

λ_TEA(t) = 1.0 → 0.1   linear over 2 000 steps

L2 regulariser on embed_tokens vs. the SmolLM3-3B initialisation snapshot. Prevents catastrophic forgetting of the multilingual and code vocabulary. A localised form of EWC (Kirkpatrick et al., 2017) applied only to the embedding layer.

(3) CE schedule — λ_CE(t)

λ_CE(t) = 0.70 + 0.30 · max(0, t  200) / 1800

Held at 0.70 during the 200-step warm-up, then ramps to 1.0 by step 2 000. Prevents early loss spikes while the embedding anchor stabilises.

(4) Entropy curriculum

D_sorted = sorted(D, key=λ r: r['output_entropy'])    # ascending

Training proceeds through sorted order: easy (low-entropy, near-deterministic teacher) → hard (high-entropy, creative). Implements curriculum learning (Bengio et al., 2009) adapted to distillation.

Why no logit KD?

Kimi K2 uses a tiktoken-style ~163 k vocab; SmolLM3 uses ~32 k BPE. Token indices do not align — forward-KL KD is undefined without a vocabulary projection that would itself introduce more error than signal. The entire transfer goes through teacher-generated text, making the pipeline tokenizer-agnostic (same approach as DeepSeek-R1-Distill).


Reference comparison

Peer cohort — standard 3 B4 B open-weight basket

The hero banner at the top of this card visualises this table; numbers below are the same. DANTE rows from our canonical lm-evaluation-harness 0.4.5 / bigcode-evaluation-harness runs on the released checkpoint. Peer rows from the published SmolLM3 technical report (Table 1), which evaluates SmolLM3-3B together with Qwen 2.5-3B, Llama 3.2-3B, Qwen3-1.7B-Base and Qwen3-4B-Base under a single harness.

Benchmark DANTE-Mosaic 3.08 B SmolLM3-3B Qwen 2.5-3B Llama 3.2-3B Qwen3-1.7B-B Qwen3-4B-B
Reasoning & common-sense
HellaSwag 76.7 76.2 74.2 75.5 60.5 74.4
ARC-Challenge 62.7 65.6 59.8 58.6 55.9 62.1
Knowledge & understanding
MMLU★ 59.4 44.1ᶜᶠ 42.9ᶜᶠ 41.3ᶜᶠ 39.1ᶜᶠ 47.7ᶜᶠ
MMLU-Pro 39.7 32.7 31.3 25.1 30.4 41.1
Math & code
GSM8K 74.5 67.6 70.1 25.9 65.9 74.1
MBPP⁺ 42.6 52.9 52.1 38.9 59.3 63.8
HumanEval⁺ 6.7 30.5 34.1 25.0 43.3 54.9

Wins for DANTE-Mosaic-3.5B against the 3 B4 B basket:

Rank Benchmark Note
#1 MMLU (5-shot) +11.7 pp over the next-best (Qwen3-4B-Base 47.7 on MMLU-CF)
#1 MMLU-Pro +7.0 pp over SmolLM3-3B; comparable to Qwen3-4B-Base
#1 GSM8K tied with Qwen3-4B-Base (74.5 vs 74.1) at ~25 % fewer params
#1 HellaSwag ahead of SmolLM3-3B (76.7 vs 76.2)

★ DANTE evaluated on canonical 5-shot MMLU; peers in the SmolLM3 report use harder MMLU-CF (cloze) — DANTE's lead is the headline, not the variant.
ᶜᶠ MMLU-CF (cloze). ⁺ MBPP+ / HumanEval+ — peers on harder + variants. Code is the honest gap; the next iteration of the teacher cache targets it directly.

Reference comparison vs. published instruct models

Different harnesses, different shot-counts, different prompt formats — listed for orientation only, not as a leaderboard ranking.

Model ~Params MMLU GSM8K ARC-C HumanEval MBPP
DANTE-Mosaic-3.5B (ours, canonical) 3.08 B 59.4 74.5 62.7 6.7 42.6
SmolLM3-3B Base (lighteval ~3 B 44.1¹ 67.4 30.5¹ 52.9¹
Gemma 2-2B PT (vendor) ~2 B 51.3 17.7 29.6³
Phi-3.5-mini (MS internal) ~3.8 B 69.0 86.5 62.8 69.6³
Granite-4.1-3b (IBM vendor) ~3 B 67.0 81.7 71.2
Qwen3-4B (vendor) ~4 B 79.6 91.6

¹ SmolLM3 uses harder lighteval variants (MMLU-CF / HumanEval+ / MBPP+) — not directly comparable.
³ 3-shot MBPP.
Vendor numbers use different harnesses; listed for orientation only.

Honest comparison vs. SmolLM3-3B base

  • GSM8K: DANTE wins (+7 pp vs. SmolLM3-3B base 67.4 %). Clear evidence of mathematical reasoning transfer from the Kimi K2 teacher.
  • MMLU: DANTE 59.4 % (classic 5-shot) vs. SmolLM3-3B MMLU-CF 44.1 % (cloze, harder variant — not directly comparable).
  • Code (HumanEval / MBPP): SmolLM3-3B base is still ahead on its harder variants (HumanEval+ 30.5 %, MBPP+ 52.9 %). Code is the clearest gap to close.
  • Bottom line: DANTE is 21 A100-hours on top of the SmolLM3-3B base. The right frame is the compute chart above, not a raw leaderboard.

Compute footprint vs. capability

Training job Compute (A100-GPU-h equiv.) vs. DANTE
DANTE student SFT (measured) ≈ 21 1×
DANTE teacher cache generation ≈ 40 ~2×
SmolLM3-3B mid-training (140 B tok) ≈ 7 200 ~340×
Phi-3.5-mini pre+post-training ≈ 4.5 × 10⁵ ~21 000×
SmolLM3-3B pretraining (11.2 T tok) ≈ 5.75 × 10⁵ ~27 000×
Granite-4.1-3b pretraining ≈ 9 × 10⁵ ~43 000×

DANTE re-uses the SmolLM3-3B pretrained base — the only additional compute is the 21-hour distillation pass. Pretraining estimates: 6 · N · D / TFLOPS, H100 → A100 ≈ 2.5×.


How the distillation was done (6 steps)

  1. Spin up teacher. Kimi K2 W4A16 via vLLM, TP=16, greedy.
  2. Generate cache. ~44 k diverse prompts → JSONL shards with output_entropy, self_consistency, difficulty_score.
  3. Init student. Load SmolLM3-3B BF16, enable gradient checkpointing, wrap in DDP.
  4. Train. Prompt-masked CE + CWCE weights + TEA anchor + entropy curriculum + λ_CE schedule. AdamW, cosine LR, grad-clip 1.0, seed 42.
  5. Save. Standard HF from_pretrained-compatible checkpoint.
  6. No DPO / RLHF / synthetic pipeline — by design, to isolate what the DANTE loss family delivers from a single MoE teacher cache.

Model facts

Field Value
Base architecture HuggingFaceTB/SmolLM3-3B
Parameters ~3.08 B (dense)
Precision BF16
Teacher moonshotai/Kimi-K2-Instruct (W4A16, vLLM)
Distillation Generative cross-arch / cross-tokenizer, prompt-masked CE + CWCE + TEA
Student-side compute ~21 A100-GPU-hours (8× A100-40GB × ~2 h 37 min)
Post-training (DPO/RLHF) None
License Apache-2.0

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "OdaxAI/DANTE-Mosaic-3.5B"
tok   = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

prompt = "Solve step by step: if a train travels 120 km in 1.5 hours, what is its average speed?"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs, max_new_tokens=256,
    do_sample=True, temperature=0.7, top_p=0.9,
    repetition_penalty=1.1, pad_token_id=tok.eos_token_id,
)
print(tok.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Reproducibility

All evaluation scripts are in the open GitHub repo under evaluation/:

Script Covers
run_lm_eval.sh MMLU, MMLU-Pro, GSM8K, ARC-C, HellaSwag
run_humaneval.sh HumanEval pass@1
run_mbpp.sh MBPP pass@1

After a run: python3 evaluation/parse_canonical_results.py


Limitations

  • HumanEval is low (6.7 %). Algorithmic code is the clear target for the next cache iteration.
  • Not SOTA against mature instruct models on code or knowledge breadth.
  • No alignment (no DPO / RLHF). May generate harmful content proportionally to teacher outputs on adversarial prompts.
  • No safety red-teaming was performed.
  • Benchmarks are English-centric; multilingual capability comes from the ~11 % Italian cache fraction — measure separately for your locale.

Citation

@misc{odaxai2026dante,
  title        = {DANTE-Mosaic-3.5B: Cross-Architecture Generative Distillation
                  from a Trillion-Parameter MoE in 21 GPU-Hours},
  author       = {{OdaxAI Research Team}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/OdaxAI/DANTE-Mosaic-3.5B}},
}

Acknowledgements

Compute: 8× NVIDIA A100-class GPUs on a public European HPC system.
Open stack: Kimi K2, SmolLM3, HuggingFace Transformers, vLLM, lm-evaluation-harness v0.4.5, bigcode-evaluation-harness.


OdaxAI provides this model "as is" for research. You are responsible for compliance with all applicable licenses when using or citing third-party benchmark numbers.