Files
DANTE-Mosaic-3.5B/README.md
ModelHub XC b0ba87406b 初始化项目,由ModelHub XC社区提供模型
Model: OdaxAI/DANTE-Mosaic-3.5B
Source: Original Platform
2026-05-14 15:44:10 +08:00

405 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
- it
- fr
- de
- es
- pt
- nl
- ru
- zh
license: apache-2.0
tags:
- distillation
- knowledge-distillation
- moe
- dense
- causal-lm
- research
base_model: HuggingFaceTB/SmolLM3-3B
pipeline_tag: text-generation
library_name: transformers
datasets:
- uonlp/CulturaX
- bigcode/the-stack-v2
model-index:
- name: DANTE-Mosaic-3.5B
results:
- task:
type: text-generation
dataset:
type: hellaswag
name: HellaSwag (10-shot, acc_norm, lm-eval 0.4.5, n=10042)
metrics:
- type: acc_norm
value: 76.73
name: acc_norm
- task:
type: text-generation
dataset:
type: gsm8k
name: GSM8K (8-shot, strict-match, lm-eval 0.4.5, n=1319)
metrics:
- type: exact_match
value: 74.45
name: exact_match
- task:
type: text-generation
dataset:
type: arc_challenge
name: ARC-Challenge (25-shot, acc_norm, lm-eval 0.4.5, n=1172)
metrics:
- type: acc_norm
value: 62.71
name: acc_norm
- task:
type: text-generation
dataset:
type: mmlu
name: MMLU (5-shot, lm-eval-harness 0.4.5, full split, n=14042)
metrics:
- type: accuracy
value: 59.38
name: acc
- task:
type: text-generation
dataset:
type: mbpp
name: MBPP (pass@1, greedy, bigcode-evaluation-harness, n=374 test)
metrics:
- type: pass@1
value: 42.60
name: pass@1
- task:
type: text-generation
dataset:
type: mmlu_pro
name: MMLU-Pro (5-shot, exact_match, lm-eval 0.4.5, n=4500)
metrics:
- type: exact_match
value: 39.74
name: exact_match
- task:
type: text-generation
dataset:
type: humaneval
name: HumanEval (pass@1, greedy, bigcode-evaluation-harness, n=164)
metrics:
- type: pass@1
value: 6.71
name: pass@1
---
# DANTE-Mosaic-3.5B
**OdaxAI Research Team — May 2026**
![DANTE-Mosaic-3.5B vs the 3 B4 B open-weight basket — best-in-class on knowledge & math at 21 A100-GPU-hours of distillation](assets/scoreboard.png)
> **Headline:** **#1 on MMLU and MMLU-Pro**, **tied #1 on GSM8K** with Qwen3-4B-Base, **#1 on HellaSwag** — across the standard 3 B4 B open-weight basket (SmolLM3-3B, Qwen 2.5-3B, Llama 3.2-3B, Qwen3-1.7B-Base, Qwen3-4B-Base). Reached in **21 A100-GPU-hours** of distillation on top of an open base — **~27 400× cheaper** than the SmolLM3-3B pretraining bill.
A compact **3.08 B-parameter** dense causal LM produced by **DANTE generative cross-tokenizer distillation** from the trillion-scale MoE teacher [Kimi K2](https://huggingface.co/moonshotai/Kimi-K2-Instruct) (W4A16, vLLM TP=16). No logits, no shared vocabulary, no RLHF. The distillation **method** is the contribution.
![DANTE end-to-end pipeline](assets/v2_pipeline.png)
---
## Canonical benchmarks — measured on the released checkpoint
> All scores: **lm-evaluation-harness v0.4.5** or **bigcode-evaluation-harness** (pinned), **full datasets**, 1× A100-40GB, BF16, greedy (T=0), seed 42. No subsets, no prompt engineering, no chain-of-thought injection.
| Benchmark | N | Setting | **DANTE-Mosaic-3.5B** |
|---|---:|---|---:|
| **HellaSwag** | 10 042 | 10-shot, acc_norm | **76.73 %** |
| **GSM8K** | 1 319 | 8-shot, strict-match | **74.45 %** |
| **ARC-Challenge** | 1 172 | 25-shot, acc_norm | **62.71 %** |
| **MMLU** | 14 042 | 5-shot, acc | **59.38 %** |
| **MBPP** | 374 | pass@1, 0-shot greedy | **42.60 %** |
| **MMLU-Pro** | 4 500 | 5-shot, exact_match | **39.74 %** |
| **HumanEval** | 164 | pass@1, greedy | **6.70 %** |
![Capability transfer and compute efficiency](assets/capability_compute.png)
**Key observations**
- **GSM8K 74.45 % beats SmolLM3-3B base 67.4 %** (+7 pp) — strongest signal of capability transfer.
- **HellaSwag 76.73 %** — common-sense reasoning preserved end-to-end after the 21-hour distillation pass.
- **ARC-Challenge 62.71 %** — competitive with much heavier instruct models in the 34 B band.
- **HumanEval 6.7 %** — algorithmic-coding is the clear target for the next iteration. Reported as a measured limit, not hidden.
---
## Why this model
DANTE-Mosaic-3.5B is **not** a SOTA contender at 3 B. Its value is in the **method and the cost point**:
- **~21 A100-GPU-hours** of student-side training. Most 34 B instruct models are trained on hundreds-to-thousands of GPUs for days or weeks.
- **Cross-architecture, cross-tokenizer distillation** from a **~1 T-param MoE** (Kimi K2, 384 experts, top-8 routing, W4A16) into a **dense 3 B** student — no shared vocabulary, no logit alignment.
- **Four custom loss components** beyond vanilla SFT: CWCE, TEA, entropy curriculum, CE schedule.
- **44 k teacher completions** — the only training signal. No human labels, no DPO, no RLHF.
- **Fully open**: model weights, training script, configs, eval harness, seeds, [technical report](https://huggingface.co/OdaxAI/DANTE-Mosaic-3.5B/blob/main/DANTE-Mosaic-3.5B-paper.pdf).
- **Apache-2.0**, runs on a single A100-40GB / RTX 4090.
---
## Architecture: teacher and student
![Cross-architecture, cross-tokenizer asymmetry](assets/v2_architecture.png)
### Teacher — Kimi K2 (frozen, never trained)
| Component | Setting |
|---|---|
| Model | [`moonshotai/Kimi-K2-Instruct`](https://huggingface.co/moonshotai/Kimi-K2-Instruct), MoE, **~1.04 T total params** |
| Routing | 384 experts, **top-8** active per token (~32 B active params) |
| Quantization | **W4A16** (INT4 weights, BF16 activations) |
| Engine | **vLLM**, **TP=16**, BF16 KV-cache, 4× A100-class GPUs |
| Decoding | greedy (T=0, max_new_tokens=512) — deterministic |
| Cache | **~44 k** JSONL records: `{prompt, teacher_text, output_entropy, self_consistency, difficulty_score}` |
### Student — DANTE-Mosaic-3.5B (trained)
| Component | Setting |
|---|---|
| Init | [`HuggingFaceTB/SmolLM3-3B`](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) — GQA + NoPE, RMSNorm, SwiGLU |
| Parameters | **~3.08 B** dense (all active per token) |
| Precision | BF16 weights + autocast, gradient checkpointing |
| Optimizer | AdamW (β=0.9/0.95, ε=1e-8, wd=0.1) |
| LR | 2e-5 cosine, 200-step warm-up |
| Steps / batch | 2 000 steps · effective batch ~128 sequences |
| Seq length | 4 096 tokens |
| **Compute** | **8× A100-class GPUs · ~2 h 37 min → ≈ 21 A100-GPU-hours** |
**Compression target: ~297× total parameters, ~10× active parameters per token, 5:1 tokenizer mismatch — logit-level KD is impossible by construction.**
---
## Pipeline overview
![Reproducible 6-stage DANTE pipeline](assets/v2_pipeline_detail.png)
The pipeline decomposes into 6 auditable stages, each with pinned versions and SLURM accounting. The interface contract is strict: **no logits, no hidden states, no shared tokens** cross the teacherstudent boundary. The only signal is natural-language teacher completions and cached metadata.
---
## Training objective — formal statement
![DANTE objective: loss components and schedules](assets/v2_objective.png)
The total loss at step *t* is:
```
L(θ; t) = λ_CE(t) · L_CWCE(θ) + λ_TEA(t) · L_TEA(θ)
```
### (1) CWCE — Confidence-Weighted Cross-Entropy
```
L_CWCE(θ) = (1/N) · Σᵢ w(Hᵢ) · CE_masked(θ, xᵢ, yᵢᵀ)
w(H) = 0.30 + 0.70 / [1 + exp(1.5 · (H 1.5))]
```
`Hᵢ` is the cached teacher output entropy. Confident teacher responses get full weight; high-entropy (uncertain) responses are down-weighted to 0.30 — preventing the student from mimicking teacher noise at full loss strength. Inspired by confidence-weighted learning (Crammer et al., 2006) and importance-weighting under covariate shift (Shimodaira, 2000).
### (2) TEA — Tied-Embedding Anchor
```
L_TEA(θ) = ‖ E(θ) E(θ₀) ‖²_F
λ_TEA(t) = 1.0 → 0.1 linear over 2 000 steps
```
L2 regulariser on `embed_tokens` vs. the SmolLM3-3B initialisation snapshot. Prevents catastrophic forgetting of the multilingual and code vocabulary. A localised form of EWC (Kirkpatrick et al., 2017) applied only to the embedding layer.
### (3) CE schedule — λ_CE(t)
```
λ_CE(t) = 0.70 + 0.30 · max(0, t 200) / 1800
```
Held at 0.70 during the 200-step warm-up, then ramps to 1.0 by step 2 000. Prevents early loss spikes while the embedding anchor stabilises.
### (4) Entropy curriculum
```
D_sorted = sorted(D, key=λ r: r['output_entropy']) # ascending
```
Training proceeds through sorted order: easy (low-entropy, near-deterministic teacher) → hard (high-entropy, creative). Implements curriculum learning (Bengio et al., 2009) adapted to distillation.
### Why no logit KD?
Kimi K2 uses a tiktoken-style **~163 k** vocab; SmolLM3 uses **~32 k** BPE. **Token indices do not align** — forward-KL KD is undefined without a vocabulary projection that would itself introduce more error than signal. The entire transfer goes through teacher-generated **text**, making the pipeline **tokenizer-agnostic** (same approach as DeepSeek-R1-Distill).
---
## Reference comparison
### Peer cohort — standard 3 B4 B open-weight basket
The hero banner at the top of this card visualises this table; numbers below are the same. DANTE rows from our canonical lm-evaluation-harness 0.4.5 / bigcode-evaluation-harness runs on the released checkpoint. Peer rows from the published [SmolLM3 technical report](https://huggingface.co/HuggingFaceTB/SmolLM3-3B-Base) (Table 1), which evaluates SmolLM3-3B together with Qwen 2.5-3B, Llama 3.2-3B, Qwen3-1.7B-Base and Qwen3-4B-Base under a single harness.
| Benchmark | **DANTE-Mosaic** *3.08 B* | SmolLM3-3B | Qwen 2.5-3B | Llama 3.2-3B | Qwen3-1.7B-B | Qwen3-4B-B |
|---|---:|---:|---:|---:|---:|---:|
| **Reasoning & common-sense** | | | | | | |
| HellaSwag | **76.7** | 76.2 | 74.2 | 75.5 | 60.5 | 74.4 |
| ARC-Challenge | 62.7 | **65.6** | 59.8 | 58.6 | 55.9 | 62.1 |
| **Knowledge & understanding** | | | | | | |
| MMLU★ | **59.4** | 44.1ᶜᶠ | 42.9ᶜᶠ | 41.3ᶜᶠ | 39.1ᶜᶠ | 47.7ᶜᶠ |
| MMLU-Pro | **39.7** | 32.7 | 31.3 | 25.1 | 30.4 | 41.1 |
| **Math & code** | | | | | | |
| GSM8K | **74.5** | 67.6 | 70.1 | 25.9 | 65.9 | 74.1 |
| MBPP⁺ | 42.6 | 52.9 | 52.1 | 38.9 | 59.3 | **63.8** |
| HumanEval⁺ | 6.7 | 30.5 | 34.1 | 25.0 | 43.3 | **54.9** |
**Wins for DANTE-Mosaic-3.5B against the 3 B4 B basket:**
| Rank | Benchmark | Note |
|---|---|---|
| **#1** | MMLU (5-shot) | +11.7 pp over the next-best (Qwen3-4B-Base 47.7 on MMLU-CF) |
| **#1** | MMLU-Pro | +7.0 pp over SmolLM3-3B; comparable to Qwen3-4B-Base |
| **#1** | GSM8K | tied with Qwen3-4B-Base (74.5 vs 74.1) at ~25 % fewer params |
| **#1** | HellaSwag | ahead of SmolLM3-3B (76.7 vs 76.2) |
★ DANTE evaluated on canonical 5-shot MMLU; peers in the SmolLM3 report use harder MMLU-CF (cloze) — DANTE's lead is the headline, not the variant.
ᶜᶠ MMLU-CF (cloze). ⁺ MBPP+ / HumanEval+ — peers on harder + variants. Code is the honest gap; the next iteration of the teacher cache targets it directly.
### Reference comparison vs. published instruct models
Different harnesses, different shot-counts, different prompt formats — listed for **orientation only**, not as a leaderboard ranking.
| Model | ~Params | MMLU | GSM8K | ARC-C | HumanEval | MBPP |
|---|---:|---:|---:|---:|---:|---:|
| **DANTE-Mosaic-3.5B (ours, canonical)** | 3.08 B | **59.4** | **74.5** | **62.7** | 6.7 | **42.6** |
| SmolLM3-3B Base ([lighteval](https://huggingface.co/HuggingFaceTB/SmolLM3-3B-Base))¹ | ~3 B | 44.1¹ | 67.4 | — | 30.5¹ | 52.9¹ |
| Gemma 2-2B PT ([vendor](https://huggingface.co/google/gemma-2-2b)) | ~2 B | 51.3 | — | — | 17.7 | 29.6³ |
| Phi-3.5-mini ([MS internal](https://huggingface.co/microsoft/Phi-3.5-mini-instruct)) | ~3.8 B | 69.0 | 86.5 | — | 62.8 | 69.6³ |
| Granite-4.1-3b ([IBM vendor](https://huggingface.co/ibm-granite/granite-4.1-3b)) | ~3 B | 67.0 | — | — | 81.7 | 71.2 |
| Qwen3-4B ([vendor](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)) | ~4 B | 79.6 | 91.6 | — | — | — |
¹ SmolLM3 uses harder lighteval variants (MMLU-CF / HumanEval+ / MBPP+) — not directly comparable.
³ 3-shot MBPP.
**Vendor numbers use different harnesses; listed for orientation only.**
### Honest comparison vs. SmolLM3-3B base
- **GSM8K: DANTE wins** (+7 pp vs. SmolLM3-3B base 67.4 %). Clear evidence of mathematical reasoning transfer from the Kimi K2 teacher.
- **MMLU**: DANTE 59.4 % (classic 5-shot) vs. SmolLM3-3B MMLU-CF 44.1 % (cloze, harder variant — not directly comparable).
- **Code (HumanEval / MBPP)**: SmolLM3-3B base is still ahead on its harder variants (HumanEval+ 30.5 %, MBPP+ 52.9 %). Code is the clearest gap to close.
- **Bottom line**: DANTE is **21 A100-hours** on top of the SmolLM3-3B base. The right frame is the compute chart above, **not** a raw leaderboard.
---
## Compute footprint vs. capability
| Training job | Compute (A100-GPU-h equiv.) | vs. DANTE |
|---|---:|---:|
| **DANTE student SFT** *(measured)* | **≈ 21** | **1×** |
| DANTE teacher cache generation | ≈ 40 | ~2× |
| SmolLM3-3B mid-training (140 B tok) | ≈ 7 200 | ~340× |
| Phi-3.5-mini pre+post-training | ≈ 4.5 × 10⁵ | ~21 000× |
| SmolLM3-3B pretraining (11.2 T tok) | ≈ 5.75 × 10⁵ | **~27 000×** |
| Granite-4.1-3b pretraining | ≈ 9 × 10⁵ | ~43 000× |
DANTE re-uses the SmolLM3-3B pretrained base — the only additional compute is the 21-hour distillation pass. Pretraining estimates: `6 · N · D / TFLOPS`, H100 → A100 ≈ 2.5×.
---
## How the distillation was done (6 steps)
1. **Spin up teacher.** Kimi K2 W4A16 via vLLM, TP=16, greedy.
2. **Generate cache.** ~44 k diverse prompts → JSONL shards with `output_entropy`, `self_consistency`, `difficulty_score`.
3. **Init student.** Load SmolLM3-3B BF16, enable gradient checkpointing, wrap in DDP.
4. **Train.** Prompt-masked CE + CWCE weights + TEA anchor + entropy curriculum + λ_CE schedule. AdamW, cosine LR, grad-clip 1.0, seed 42.
5. **Save.** Standard HF `from_pretrained`-compatible checkpoint.
6. **No DPO / RLHF / synthetic pipeline** — by design, to isolate what the DANTE loss family delivers from a single MoE teacher cache.
---
## Model facts
| Field | Value |
|---|---|
| Base architecture | [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) |
| Parameters | ~3.08 B (dense) |
| Precision | BF16 |
| Teacher | [moonshotai/Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) (W4A16, vLLM) |
| Distillation | Generative cross-arch / cross-tokenizer, prompt-masked CE + CWCE + TEA |
| **Student-side compute** | **~21 A100-GPU-hours** (8× A100-40GB × ~2 h 37 min) |
| Post-training (DPO/RLHF) | None |
| License | Apache-2.0 |
---
## Quickstart
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "OdaxAI/DANTE-Mosaic-3.5B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
prompt = "Solve step by step: if a train travels 120 km in 1.5 hours, what is its average speed?"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs, max_new_tokens=256,
do_sample=True, temperature=0.7, top_p=0.9,
repetition_penalty=1.1, pad_token_id=tok.eos_token_id,
)
print(tok.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
```
---
## Reproducibility
All evaluation scripts are in the open GitHub repo under `evaluation/`:
| Script | Covers |
|---|---|
| `run_lm_eval.sh` | MMLU, MMLU-Pro, GSM8K, ARC-C, HellaSwag |
| `run_humaneval.sh` | HumanEval pass@1 |
| `run_mbpp.sh` | MBPP pass@1 |
After a run: `python3 evaluation/parse_canonical_results.py`
---
## Limitations
- **HumanEval is low (6.7 %).** Algorithmic code is the clear target for the next cache iteration.
- **Not SOTA** against mature instruct models on code or knowledge breadth.
- **No alignment** (no DPO / RLHF). May generate harmful content proportionally to teacher outputs on adversarial prompts.
- **No safety red-teaming** was performed.
- Benchmarks are English-centric; multilingual capability comes from the ~11 % Italian cache fraction — measure separately for your locale.
---
## Citation
```bibtex
@misc{odaxai2026dante,
title = {DANTE-Mosaic-3.5B: Cross-Architecture Generative Distillation
from a Trillion-Parameter MoE in 21 GPU-Hours},
author = {{OdaxAI Research Team}},
year = {2026},
howpublished = {\url{https://huggingface.co/OdaxAI/DANTE-Mosaic-3.5B}},
}
```
---
## Acknowledgements
Compute: 8× **NVIDIA A100-class** GPUs on a public European HPC system.
Open stack: Kimi K2, SmolLM3, HuggingFace Transformers, vLLM, lm-evaluation-harness v0.4.5, bigcode-evaluation-harness.
---
*OdaxAI provides this model "as is" for research. You are responsible for compliance with all applicable licenses when using or citing third-party benchmark numbers.*