--- language: - en - it - fr - de - es - pt - nl - ru - zh license: apache-2.0 tags: - distillation - knowledge-distillation - moe - dense - causal-lm - research base_model: HuggingFaceTB/SmolLM3-3B pipeline_tag: text-generation library_name: transformers datasets: - uonlp/CulturaX - bigcode/the-stack-v2 model-index: - name: DANTE-Mosaic-3.5B results: - task: type: text-generation dataset: type: hellaswag name: HellaSwag (10-shot, acc_norm, lm-eval 0.4.5, n=10042) metrics: - type: acc_norm value: 76.73 name: acc_norm - task: type: text-generation dataset: type: gsm8k name: GSM8K (8-shot, strict-match, lm-eval 0.4.5, n=1319) metrics: - type: exact_match value: 74.45 name: exact_match - task: type: text-generation dataset: type: arc_challenge name: ARC-Challenge (25-shot, acc_norm, lm-eval 0.4.5, n=1172) metrics: - type: acc_norm value: 62.71 name: acc_norm - task: type: text-generation dataset: type: mmlu name: MMLU (5-shot, lm-eval-harness 0.4.5, full split, n=14042) metrics: - type: accuracy value: 59.38 name: acc - task: type: text-generation dataset: type: mbpp name: MBPP (pass@1, greedy, bigcode-evaluation-harness, n=374 test) metrics: - type: pass@1 value: 42.60 name: pass@1 - task: type: text-generation dataset: type: mmlu_pro name: MMLU-Pro (5-shot, exact_match, lm-eval 0.4.5, n=4500) metrics: - type: exact_match value: 39.74 name: exact_match - task: type: text-generation dataset: type: humaneval name: HumanEval (pass@1, greedy, bigcode-evaluation-harness, n=164) metrics: - type: pass@1 value: 6.71 name: pass@1 --- # DANTE-Mosaic-3.5B **OdaxAI Research Team — May 2026** ![DANTE-Mosaic-3.5B vs the 3 B–4 B open-weight basket — best-in-class on knowledge & math at 21 A100-GPU-hours of distillation](assets/scoreboard.png) > **Headline:** **#1 on MMLU and MMLU-Pro**, **tied #1 on GSM8K** with Qwen3-4B-Base, **#1 on HellaSwag** — across the standard 3 B–4 B open-weight basket (SmolLM3-3B, Qwen 2.5-3B, Llama 3.2-3B, Qwen3-1.7B-Base, Qwen3-4B-Base). Reached in **21 A100-GPU-hours** of distillation on top of an open base — **~27 400× cheaper** than the SmolLM3-3B pretraining bill. A compact **3.08 B-parameter** dense causal LM produced by **DANTE generative cross-tokenizer distillation** from the trillion-scale MoE teacher [Kimi K2](https://huggingface.co/moonshotai/Kimi-K2-Instruct) (W4A16, vLLM TP=16). No logits, no shared vocabulary, no RLHF. The distillation **method** is the contribution. ![DANTE end-to-end pipeline](assets/v2_pipeline.png) --- ## Canonical benchmarks — measured on the released checkpoint > All scores: **lm-evaluation-harness v0.4.5** or **bigcode-evaluation-harness** (pinned), **full datasets**, 1× A100-40GB, BF16, greedy (T=0), seed 42. No subsets, no prompt engineering, no chain-of-thought injection. | Benchmark | N | Setting | **DANTE-Mosaic-3.5B** | |---|---:|---|---:| | **HellaSwag** | 10 042 | 10-shot, acc_norm | **76.73 %** | | **GSM8K** | 1 319 | 8-shot, strict-match | **74.45 %** | | **ARC-Challenge** | 1 172 | 25-shot, acc_norm | **62.71 %** | | **MMLU** | 14 042 | 5-shot, acc | **59.38 %** | | **MBPP** | 374 | pass@1, 0-shot greedy | **42.60 %** | | **MMLU-Pro** | 4 500 | 5-shot, exact_match | **39.74 %** | | **HumanEval** | 164 | pass@1, greedy | **6.70 %** | ![Capability transfer and compute efficiency](assets/capability_compute.png) **Key observations** - **GSM8K 74.45 % beats SmolLM3-3B base 67.4 %** (+7 pp) — strongest signal of capability transfer. - **HellaSwag 76.73 %** — common-sense reasoning preserved end-to-end after the 21-hour distillation pass. - **ARC-Challenge 62.71 %** — competitive with much heavier instruct models in the 3–4 B band. - **HumanEval 6.7 %** — algorithmic-coding is the clear target for the next iteration. Reported as a measured limit, not hidden. --- ## Why this model DANTE-Mosaic-3.5B is **not** a SOTA contender at 3 B. Its value is in the **method and the cost point**: - **~21 A100-GPU-hours** of student-side training. Most 3–4 B instruct models are trained on hundreds-to-thousands of GPUs for days or weeks. - **Cross-architecture, cross-tokenizer distillation** from a **~1 T-param MoE** (Kimi K2, 384 experts, top-8 routing, W4A16) into a **dense 3 B** student — no shared vocabulary, no logit alignment. - **Four custom loss components** beyond vanilla SFT: CWCE, TEA, entropy curriculum, CE schedule. - **44 k teacher completions** — the only training signal. No human labels, no DPO, no RLHF. - **Fully open**: model weights, training script, configs, eval harness, seeds, [technical report](https://huggingface.co/OdaxAI/DANTE-Mosaic-3.5B/blob/main/DANTE-Mosaic-3.5B-paper.pdf). - **Apache-2.0**, runs on a single A100-40GB / RTX 4090. --- ## Architecture: teacher and student ![Cross-architecture, cross-tokenizer asymmetry](assets/v2_architecture.png) ### Teacher — Kimi K2 (frozen, never trained) | Component | Setting | |---|---| | Model | [`moonshotai/Kimi-K2-Instruct`](https://huggingface.co/moonshotai/Kimi-K2-Instruct), MoE, **~1.04 T total params** | | Routing | 384 experts, **top-8** active per token (~32 B active params) | | Quantization | **W4A16** (INT4 weights, BF16 activations) | | Engine | **vLLM**, **TP=16**, BF16 KV-cache, 4× A100-class GPUs | | Decoding | greedy (T=0, max_new_tokens=512) — deterministic | | Cache | **~44 k** JSONL records: `{prompt, teacher_text, output_entropy, self_consistency, difficulty_score}` | ### Student — DANTE-Mosaic-3.5B (trained) | Component | Setting | |---|---| | Init | [`HuggingFaceTB/SmolLM3-3B`](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) — GQA + NoPE, RMSNorm, SwiGLU | | Parameters | **~3.08 B** dense (all active per token) | | Precision | BF16 weights + autocast, gradient checkpointing | | Optimizer | AdamW (β=0.9/0.95, ε=1e-8, wd=0.1) | | LR | 2e-5 cosine, 200-step warm-up | | Steps / batch | 2 000 steps · effective batch ~128 sequences | | Seq length | 4 096 tokens | | **Compute** | **8× A100-class GPUs · ~2 h 37 min → ≈ 21 A100-GPU-hours** | **Compression target: ~297× total parameters, ~10× active parameters per token, 5:1 tokenizer mismatch — logit-level KD is impossible by construction.** --- ## Pipeline overview ![Reproducible 6-stage DANTE pipeline](assets/v2_pipeline_detail.png) The pipeline decomposes into 6 auditable stages, each with pinned versions and SLURM accounting. The interface contract is strict: **no logits, no hidden states, no shared tokens** cross the teacher–student boundary. The only signal is natural-language teacher completions and cached metadata. --- ## Training objective — formal statement ![DANTE objective: loss components and schedules](assets/v2_objective.png) The total loss at step *t* is: ``` L(θ; t) = λ_CE(t) · L_CWCE(θ) + λ_TEA(t) · L_TEA(θ) ``` ### (1) CWCE — Confidence-Weighted Cross-Entropy ``` L_CWCE(θ) = (1/N) · Σᵢ w(Hᵢ) · CE_masked(θ, xᵢ, yᵢᵀ) w(H) = 0.30 + 0.70 / [1 + exp(1.5 · (H − 1.5))] ``` `Hᵢ` is the cached teacher output entropy. Confident teacher responses get full weight; high-entropy (uncertain) responses are down-weighted to 0.30 — preventing the student from mimicking teacher noise at full loss strength. Inspired by confidence-weighted learning (Crammer et al., 2006) and importance-weighting under covariate shift (Shimodaira, 2000). ### (2) TEA — Tied-Embedding Anchor ``` L_TEA(θ) = ‖ E(θ) − E(θ₀) ‖²_F λ_TEA(t) = 1.0 → 0.1 linear over 2 000 steps ``` L2 regulariser on `embed_tokens` vs. the SmolLM3-3B initialisation snapshot. Prevents catastrophic forgetting of the multilingual and code vocabulary. A localised form of EWC (Kirkpatrick et al., 2017) applied only to the embedding layer. ### (3) CE schedule — λ_CE(t) ``` λ_CE(t) = 0.70 + 0.30 · max(0, t − 200) / 1800 ``` Held at 0.70 during the 200-step warm-up, then ramps to 1.0 by step 2 000. Prevents early loss spikes while the embedding anchor stabilises. ### (4) Entropy curriculum ``` D_sorted = sorted(D, key=λ r: r['output_entropy']) # ascending ``` Training proceeds through sorted order: easy (low-entropy, near-deterministic teacher) → hard (high-entropy, creative). Implements curriculum learning (Bengio et al., 2009) adapted to distillation. ### Why no logit KD? Kimi K2 uses a tiktoken-style **~163 k** vocab; SmolLM3 uses **~32 k** BPE. **Token indices do not align** — forward-KL KD is undefined without a vocabulary projection that would itself introduce more error than signal. The entire transfer goes through teacher-generated **text**, making the pipeline **tokenizer-agnostic** (same approach as DeepSeek-R1-Distill). --- ## Reference comparison ### Peer cohort — standard 3 B–4 B open-weight basket The hero banner at the top of this card visualises this table; numbers below are the same. DANTE rows from our canonical lm-evaluation-harness 0.4.5 / bigcode-evaluation-harness runs on the released checkpoint. Peer rows from the published [SmolLM3 technical report](https://huggingface.co/HuggingFaceTB/SmolLM3-3B-Base) (Table 1), which evaluates SmolLM3-3B together with Qwen 2.5-3B, Llama 3.2-3B, Qwen3-1.7B-Base and Qwen3-4B-Base under a single harness. | Benchmark | **DANTE-Mosaic** *3.08 B* | SmolLM3-3B | Qwen 2.5-3B | Llama 3.2-3B | Qwen3-1.7B-B | Qwen3-4B-B | |---|---:|---:|---:|---:|---:|---:| | **Reasoning & common-sense** | | | | | | | | HellaSwag | **76.7** | 76.2 | 74.2 | 75.5 | 60.5 | 74.4 | | ARC-Challenge | 62.7 | **65.6** | 59.8 | 58.6 | 55.9 | 62.1 | | **Knowledge & understanding** | | | | | | | | MMLU★ | **59.4** | 44.1ᶜᶠ | 42.9ᶜᶠ | 41.3ᶜᶠ | 39.1ᶜᶠ | 47.7ᶜᶠ | | MMLU-Pro | **39.7** | 32.7 | 31.3 | 25.1 | 30.4 | 41.1 | | **Math & code** | | | | | | | | GSM8K | **74.5** | 67.6 | 70.1 | 25.9 | 65.9 | 74.1 | | MBPP⁺ | 42.6 | 52.9 | 52.1 | 38.9 | 59.3 | **63.8** | | HumanEval⁺ | 6.7 | 30.5 | 34.1 | 25.0 | 43.3 | **54.9** | **Wins for DANTE-Mosaic-3.5B against the 3 B–4 B basket:** | Rank | Benchmark | Note | |---|---|---| | **#1** | MMLU (5-shot) | +11.7 pp over the next-best (Qwen3-4B-Base 47.7 on MMLU-CF) | | **#1** | MMLU-Pro | +7.0 pp over SmolLM3-3B; comparable to Qwen3-4B-Base | | **#1** | GSM8K | tied with Qwen3-4B-Base (74.5 vs 74.1) at ~25 % fewer params | | **#1** | HellaSwag | ahead of SmolLM3-3B (76.7 vs 76.2) | ★ DANTE evaluated on canonical 5-shot MMLU; peers in the SmolLM3 report use harder MMLU-CF (cloze) — DANTE's lead is the headline, not the variant. ᶜᶠ MMLU-CF (cloze). ⁺ MBPP+ / HumanEval+ — peers on harder + variants. Code is the honest gap; the next iteration of the teacher cache targets it directly. ### Reference comparison vs. published instruct models Different harnesses, different shot-counts, different prompt formats — listed for **orientation only**, not as a leaderboard ranking. | Model | ~Params | MMLU | GSM8K | ARC-C | HumanEval | MBPP | |---|---:|---:|---:|---:|---:|---:| | **DANTE-Mosaic-3.5B (ours, canonical)** | 3.08 B | **59.4** | **74.5** | **62.7** | 6.7 | **42.6** | | SmolLM3-3B Base ([lighteval](https://huggingface.co/HuggingFaceTB/SmolLM3-3B-Base))¹ | ~3 B | 44.1¹ | 67.4 | — | 30.5¹ | 52.9¹ | | Gemma 2-2B PT ([vendor](https://huggingface.co/google/gemma-2-2b)) | ~2 B | 51.3 | — | — | 17.7 | 29.6³ | | Phi-3.5-mini ([MS internal](https://huggingface.co/microsoft/Phi-3.5-mini-instruct)) | ~3.8 B | 69.0 | 86.5 | — | 62.8 | 69.6³ | | Granite-4.1-3b ([IBM vendor](https://huggingface.co/ibm-granite/granite-4.1-3b)) | ~3 B | 67.0 | — | — | 81.7 | 71.2 | | Qwen3-4B ([vendor](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)) | ~4 B | 79.6 | 91.6 | — | — | — | ¹ SmolLM3 uses harder lighteval variants (MMLU-CF / HumanEval+ / MBPP+) — not directly comparable. ³ 3-shot MBPP. **Vendor numbers use different harnesses; listed for orientation only.** ### Honest comparison vs. SmolLM3-3B base - **GSM8K: DANTE wins** (+7 pp vs. SmolLM3-3B base 67.4 %). Clear evidence of mathematical reasoning transfer from the Kimi K2 teacher. - **MMLU**: DANTE 59.4 % (classic 5-shot) vs. SmolLM3-3B MMLU-CF 44.1 % (cloze, harder variant — not directly comparable). - **Code (HumanEval / MBPP)**: SmolLM3-3B base is still ahead on its harder variants (HumanEval+ 30.5 %, MBPP+ 52.9 %). Code is the clearest gap to close. - **Bottom line**: DANTE is **21 A100-hours** on top of the SmolLM3-3B base. The right frame is the compute chart above, **not** a raw leaderboard. --- ## Compute footprint vs. capability | Training job | Compute (A100-GPU-h equiv.) | vs. DANTE | |---|---:|---:| | **DANTE student SFT** *(measured)* | **≈ 21** | **1×** | | DANTE teacher cache generation | ≈ 40 | ~2× | | SmolLM3-3B mid-training (140 B tok) | ≈ 7 200 | ~340× | | Phi-3.5-mini pre+post-training | ≈ 4.5 × 10⁵ | ~21 000× | | SmolLM3-3B pretraining (11.2 T tok) | ≈ 5.75 × 10⁵ | **~27 000×** | | Granite-4.1-3b pretraining | ≈ 9 × 10⁵ | ~43 000× | DANTE re-uses the SmolLM3-3B pretrained base — the only additional compute is the 21-hour distillation pass. Pretraining estimates: `6 · N · D / TFLOPS`, H100 → A100 ≈ 2.5×. --- ## How the distillation was done (6 steps) 1. **Spin up teacher.** Kimi K2 W4A16 via vLLM, TP=16, greedy. 2. **Generate cache.** ~44 k diverse prompts → JSONL shards with `output_entropy`, `self_consistency`, `difficulty_score`. 3. **Init student.** Load SmolLM3-3B BF16, enable gradient checkpointing, wrap in DDP. 4. **Train.** Prompt-masked CE + CWCE weights + TEA anchor + entropy curriculum + λ_CE schedule. AdamW, cosine LR, grad-clip 1.0, seed 42. 5. **Save.** Standard HF `from_pretrained`-compatible checkpoint. 6. **No DPO / RLHF / synthetic pipeline** — by design, to isolate what the DANTE loss family delivers from a single MoE teacher cache. --- ## Model facts | Field | Value | |---|---| | Base architecture | [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) | | Parameters | ~3.08 B (dense) | | Precision | BF16 | | Teacher | [moonshotai/Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) (W4A16, vLLM) | | Distillation | Generative cross-arch / cross-tokenizer, prompt-masked CE + CWCE + TEA | | **Student-side compute** | **~21 A100-GPU-hours** (8× A100-40GB × ~2 h 37 min) | | Post-training (DPO/RLHF) | None | | License | Apache-2.0 | --- ## Quickstart ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "OdaxAI/DANTE-Mosaic-3.5B" tok = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) prompt = "Solve step by step: if a train travels 120 km in 1.5 hours, what is its average speed?" inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate( **inputs, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.1, pad_token_id=tok.eos_token_id, ) print(tok.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True)) ``` --- ## Reproducibility All evaluation scripts are in the open GitHub repo under `evaluation/`: | Script | Covers | |---|---| | `run_lm_eval.sh` | MMLU, MMLU-Pro, GSM8K, ARC-C, HellaSwag | | `run_humaneval.sh` | HumanEval pass@1 | | `run_mbpp.sh` | MBPP pass@1 | After a run: `python3 evaluation/parse_canonical_results.py` --- ## Limitations - **HumanEval is low (6.7 %).** Algorithmic code is the clear target for the next cache iteration. - **Not SOTA** against mature instruct models on code or knowledge breadth. - **No alignment** (no DPO / RLHF). May generate harmful content proportionally to teacher outputs on adversarial prompts. - **No safety red-teaming** was performed. - Benchmarks are English-centric; multilingual capability comes from the ~11 % Italian cache fraction — measure separately for your locale. --- ## Citation ```bibtex @misc{odaxai2026dante, title = {DANTE-Mosaic-3.5B: Cross-Architecture Generative Distillation from a Trillion-Parameter MoE in 21 GPU-Hours}, author = {{OdaxAI Research Team}}, year = {2026}, howpublished = {\url{https://huggingface.co/OdaxAI/DANTE-Mosaic-3.5B}}, } ``` --- ## Acknowledgements Compute: 8× **NVIDIA A100-class** GPUs on a public European HPC system. Open stack: Kimi K2, SmolLM3, HuggingFace Transformers, vLLM, lm-evaluation-harness v0.4.5, bigcode-evaluation-harness. --- *OdaxAI provides this model "as is" for research. You are responsible for compliance with all applicable licenses when using or citing third-party benchmark numbers.*