library_name, pipeline_tag, license, language, base_model, datasets, tags
library_name pipeline_tag license language base_model datasets tags
transformers text-generation apache-2.0
en
Qwen/Qwen3-0.6B
0xZee/dataset-CoT-Advanced-Calculus-268
0xZee/dataset-CoT-Modern-Physics-177
0xZee/dataset-CoT-Theoretical-Mechanics-307
0xZee/dataset-CoT-Linear-Algebra-667
0xZee/dataset-CoT-Electromagnetism-580
0xZee/dataset-CoT-Molecular-Biology-71
0xZee/dataset-CoT-Physiology-114
0xZee/dataset-CoT-Classical-Mechanics-343
0xZee/dataset-CoT-Differential-Equations-636
0xZee/dataset-CoT-Physics-2254
0xZee/dataset-CoT-Engineering-574
0xZee/dataset-CoT-mathematics
causal-lm
text-generation
distillation
knowledge-distillation
reasoning
chain-of-thought
mathematics
physics
engineering
stem
convergentintel
edge

Qwen3-0.6B STEM Proof Distilled (Thinking Teacher)

A 0.6B parameter model distilled from Qwen3-30B-A3B-Thinking on 6,122 STEM chain-of-thought samples. 50x parameter compression. The Thinking variant teacher produces richer extended reasoning traces than the Instruct variant, transferring deeper deliberation structure into the smallest possible student.

The result: a model under 500MB quantized that produces structured STEM derivations because a 30B thinking model showed it how to reason.

"Structure beats scale." — Convergent Intelligence LLC: Research Division

What Makes This Different

Two key differences from standard small-model distillation:

1. Thinking teacher, not Instruct teacher. The Qwen3-30B-A3B-Thinking variant generates extended internal reasoning before committing to an answer. Its softmax distributions are higher-entropy — it considers more reasoning paths at each step. At distillation temperature T=2.0, this means the 0.6B student sees a much richer landscape of alternative derivation strategies than it would from an Instruct teacher. The student doesn't just learn the answer — it learns the deliberation.

2. Proof-weighted loss. Tokens inside the derivation region (Proof: to Final Answer:) receive 2.5x amplified loss, decaying to 1.5x over training. The model is penalized more for errors in reasoning steps than for errors in answer formatting. At 0.6B, every parameter has to count — proof weighting ensures they're allocated to reasoning capability, not boilerplate reproduction.

Model Details

Attribute Value
Architecture Qwen3 (causal LM, RoPE, GQA)
Parameters 0.6B
Base model Qwen/Qwen3-0.6B
Teacher model Qwen/Qwen3-30B-A3B-Thinking-2507
Compression ratio 50x (30B → 0.6B)
Context length 1024 tokens (training)
Precision bf16
License Apache 2.0
Developer Reaperdoesntrun / Convergent Intelligence LLC: Research Division

Training

Loss Function

  1. Proof-Weighted Cross-Entropy (55%) — Amplified weight on derivation tokens (2.5x → 1.5x linear decay)
  2. Knowledge Distillation KL Divergence (45%) — Student/teacher softmax divergence at T=2.0, scaled by T²

Combined: L = 0.55 * CE_weighted + 0.45 * KD_kl

Hyperparameters

Parameter Value
Epochs 1
Training samples 5,815 (95% of 6,122)
Eval samples 307 (5% held out)
Effective batch size 8
Optimizer AdamW (weight decay 0.01)
Learning rate 1.5e-5 → 1e-6 (cosine, 30-step warmup)
Gradient clipping 1.0
Temperature 2.0
Proof weight 2.5 → 1.5
Precision bf16

Dataset

6,122 STEM CoT samples from 12 domains (Physics 2,254 / Linear Algebra 667 / Differential Equations 636 / Electromagnetism 580 / Mathematics 576 / Engineering 574 / Classical Mechanics 343 / Theoretical Mechanics 307 / Advanced Calculus 268 / Modern Physics 177 / Physiology 114 / Molecular Biology 71). All from 0xZee.

Training Format

Solve the following problem carefully and show a rigorous derivation.

Problem:
{question}

Proof:
{CoT}

Final Answer:
{response}

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qwen3-0.6B-STEM-Proof-Distilled-Thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)

prompt = """Solve the following problem carefully and show a rigorous derivation.

Problem:
Find the eigenvalues of the matrix [[3, 1], [0, 3]].

Proof:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Uses

Good for: Lightweight STEM reasoning on edge/mobile devices, educational tutoring, proof drafting, component in multi-model pipelines where a small fast reasoner is needed, IoT and embedded inference.

Not for: Formal proof verification, safety-critical analysis, medical or legal advice, or tasks requiring long-context reasoning beyond 1024 tokens.

Limitations

0.6B is a hard capacity constraint. The model will struggle with multi-step proofs requiring more than ~8 reasoning steps, complex multi-variable problems, or domains underrepresented in training data (molecular biology, physiology). It will sometimes generate plausible but incorrect intermediate steps. Always verify.

Mathematical Foundations: Discrepancy Calculus (DISC)

This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt quantifies local structural mismatch that standard KL divergence averages away.

Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).

Model Description
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT This model + legal SFT
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF Quantized for edge deployment
Qwen3-1.7B-STEM-Proof-Distilled Larger 1.7B variant (Instruct teacher)

Citation

@misc{colca2026distilled06b,
  title={Qwen3-0.6B STEM Proof Distilled: 50x Compression from a Thinking Teacher},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-STEM-Proof-Distilled-Thinking},
  note={Convergent Intelligence LLC: Research Division}
}

Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."


Convergent Intelligence Portfolio

Part of the Qwen3 0.6B Distillation Series by Convergent Intelligence LLC: Research Division

Mathematical Foundations: Discrepancy Calculus (DISC)

This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt quantifies local structural mismatch that standard KL divergence averages away.

Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).

Model Downloads Format
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT 33 HF
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF 203 GGUF

Top Models from Our Lab

Model Downloads
Qwen3-1.7B-Thinking-Distil 501
LFM2.5-1.2B-Distilled-SFT 342
Qwen3-1.7B-Coder-Distilled-SFT 302
Qwen3-1.7B-Coder-Distilled-SFT-GGUF 194
Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF 175

Total Portfolio: 41 models | 2,781 total downloads

Last updated: 2026-03-28 12:56 UTC

DistilQwen Collection

This model is part of the DistilQwen proof-weighted distillation series. Collection: 9 models | 2,788 downloads

Teacher Variant Comparison

Teacher Student Size Strength Models
Qwen3-30B-A3B (Instruct) 1.7B Instruction following, structured output, legal reasoning 3 (833 DL)
Qwen3-30B-A3B (Thinking) 0.6B Extended deliberation, higher-entropy distributions, proof derivation 3 (779 DL) ← this model
Qwen3-30B-A3B (Coder) 1.7B Structured decomposition, STEM derivation, logical inference 2 (825 DL)

Methodology

The only BF16 collection in the portfolio. While the broader Convergent Intelligence catalog (43 models, 12,000+ downloads) was trained on CPU at FP32 for $24 total compute, the DistilQwen series was trained on H100 at BF16 with a 30B-parameter teacher. Same methodology, premium hardware. This is what happens when you give the pipeline real compute.

All models use proof-weighted knowledge distillation: 55% cross-entropy with decaying proof weights (2.5× → 1.5×), 45% KL divergence at T=2.0. The proof weight amplifies loss on reasoning-critical tokens, forcing the student to allocate capacity to structural understanding rather than surface-level pattern matching.

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)

Description
Model synced from source: reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B
Readme 30 KiB
Languages
Jinja 100%