323 lines
13 KiB
Markdown
323 lines
13 KiB
Markdown
|
|
---
|
|||
|
|
library_name: transformers
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
license: apache-2.0
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
base_model: Qwen/Qwen3-0.6B
|
|||
|
|
datasets:
|
|||
|
|
- 0xZee/dataset-CoT-Advanced-Calculus-268
|
|||
|
|
- 0xZee/dataset-CoT-Modern-Physics-177
|
|||
|
|
- 0xZee/dataset-CoT-Theoretical-Mechanics-307
|
|||
|
|
- 0xZee/dataset-CoT-Linear-Algebra-667
|
|||
|
|
- 0xZee/dataset-CoT-Electromagnetism-580
|
|||
|
|
- 0xZee/dataset-CoT-Molecular-Biology-71
|
|||
|
|
- 0xZee/dataset-CoT-Physiology-114
|
|||
|
|
- 0xZee/dataset-CoT-Classical-Mechanics-343
|
|||
|
|
- 0xZee/dataset-CoT-Differential-Equations-636
|
|||
|
|
- 0xZee/dataset-CoT-Physics-2254
|
|||
|
|
- 0xZee/dataset-CoT-Engineering-574
|
|||
|
|
- 0xZee/dataset-CoT-mathematics
|
|||
|
|
- Alignment-Lab-AI/Lawyer-Instruct
|
|||
|
|
tags:
|
|||
|
|
- causal-lm
|
|||
|
|
- text-generation
|
|||
|
|
- distillation
|
|||
|
|
- knowledge-distillation
|
|||
|
|
- sft
|
|||
|
|
- reasoning
|
|||
|
|
- chain-of-thought
|
|||
|
|
- mathematics
|
|||
|
|
- physics
|
|||
|
|
- engineering
|
|||
|
|
- legal
|
|||
|
|
- stem
|
|||
|
|
- convergentintel
|
|||
|
|
- edge
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT
|
|||
|
|
|
|||
|
|
A 0.6B parameter model built in two stages: knowledge distillation from a 30B Thinking teacher to establish a structured reasoning backbone, then supervised fine-tuning on legal instruction data. 50x compression. Under 500MB quantized. Runs on a phone.
|
|||
|
|
|
|||
|
|
The training order is the thesis: teach the model *how to reason* first (distillation from Thinking teacher), then teach it *what to reason about* (legal SFT). The Thinking teacher's extended deliberation traces transfer deeper reasoning structure than an Instruct teacher — critical when the student has only 0.6B parameters to work with.
|
|||
|
|
|
|||
|
|
> *"Structure beats scale, collaboration beats hierarchy, observation beats theory."*
|
|||
|
|
> — Convergent Intelligence LLC: Research Division
|
|||
|
|
|
|||
|
|
## Training Pipeline
|
|||
|
|
|
|||
|
|
### Stage 1: Knowledge Distillation (STEM Reasoning Backbone)
|
|||
|
|
|
|||
|
|
Qwen3-0.6B distilled from [Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507) — a Mixture-of-Experts model with 30B total parameters, ~3B active per token, using the Thinking variant that generates extended internal reasoning traces.
|
|||
|
|
|
|||
|
|
**Why the Thinking teacher matters at 0.6B:** The Thinking variant produces higher-entropy softmax distributions than the Instruct variant — it considers more reasoning paths before committing. At distillation temperature T=2.0, the 0.6B student sees a richer landscape of alternative derivation strategies. With only 0.6B parameters, every bit of transferred structure counts. The Thinking teacher gives more.
|
|||
|
|
|
|||
|
|
**Data:** 6,122 STEM chain-of-thought samples across 12 domains:
|
|||
|
|
|
|||
|
|
| Domain | Samples |
|
|||
|
|
|---|---|
|
|||
|
|
| Physics | 2,254 |
|
|||
|
|
| Linear Algebra | 667 |
|
|||
|
|
| Differential Equations | 636 |
|
|||
|
|
| Electromagnetism | 580 |
|
|||
|
|
| Mathematics | 576 |
|
|||
|
|
| Engineering | 574 |
|
|||
|
|
| Classical Mechanics | 343 |
|
|||
|
|
| Theoretical Mechanics | 307 |
|
|||
|
|
| Advanced Calculus | 268 |
|
|||
|
|
| Modern Physics | 177 |
|
|||
|
|
| Physiology | 114 |
|
|||
|
|
| Molecular Biology | 71 |
|
|||
|
|
|
|||
|
|
All from [0xZee](https://huggingface.co/0xZee). Shuffled seed 42, split 95/5 train/eval.
|
|||
|
|
|
|||
|
|
**Loss function:**
|
|||
|
|
|
|||
|
|
1. **Proof-Weighted Cross-Entropy (55%)** — 2.5x weight on derivation tokens, decaying to 1.5x. Forces the student to allocate its limited capacity to reasoning steps, not answer formatting.
|
|||
|
|
2. **Knowledge Distillation KL Divergence (45%)** — T=2.0, scaled by T². Transfers the Thinking teacher's full deliberation landscape.
|
|||
|
|
|
|||
|
|
**Training format:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Solve the following problem carefully and show a rigorous derivation.
|
|||
|
|
|
|||
|
|
Problem:
|
|||
|
|
{question}
|
|||
|
|
|
|||
|
|
Proof:
|
|||
|
|
{CoT}
|
|||
|
|
|
|||
|
|
Final Answer:
|
|||
|
|
{response}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Stage 1 hyperparameters:**
|
|||
|
|
|
|||
|
|
| Parameter | Value |
|
|||
|
|
|---|---|
|
|||
|
|
| Epochs | 1 |
|
|||
|
|
| Training samples | 5,815 |
|
|||
|
|
| Effective batch size | 8 |
|
|||
|
|
| Learning rate | 1.5e-5 → 1e-6 (cosine) |
|
|||
|
|
| Temperature | 2.0 |
|
|||
|
|
| Proof weight | 2.5 → 1.5 |
|
|||
|
|
| Precision | bf16 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Stage 2: Supervised Fine-Tuning (Legal Domain)
|
|||
|
|
|
|||
|
|
The distilled model was fine-tuned on [Alignment-Lab-AI/Lawyer-Instruct](https://huggingface.co/datasets/Alignment-Lab-AI/Lawyer-Instruct) using TRL's SFTTrainer.
|
|||
|
|
|
|||
|
|
**Why legal on top of STEM:** Legal reasoning is structurally isomorphic to mathematical reasoning — premise identification, logical chaining, exception handling, structured argumentation toward a conclusion. A model that learned rigorous derivation transfers that structure to legal analysis rather than learning legal templates from scratch.
|
|||
|
|
|
|||
|
|
**Training format:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
### Instruction:
|
|||
|
|
{instruction}
|
|||
|
|
|
|||
|
|
### Response:
|
|||
|
|
{output}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Stage 2 hyperparameters:**
|
|||
|
|
|
|||
|
|
| Parameter | Value |
|
|||
|
|
|---|---|
|
|||
|
|
| Epochs | 1 |
|
|||
|
|
| Effective batch size | 8 |
|
|||
|
|
| Learning rate | 5e-6 (lower than Stage 1 to preserve backbone) |
|
|||
|
|
| Gradient checkpointing | Enabled |
|
|||
|
|
| Precision | bf16 |
|
|||
|
|
|
|||
|
|
## Model Details
|
|||
|
|
|
|||
|
|
| Attribute | Value |
|
|||
|
|
|---|---|
|
|||
|
|
| **Architecture** | Qwen3 (causal LM, RoPE, GQA) |
|
|||
|
|
| **Parameters** | 0.6B |
|
|||
|
|
| **Base model** | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) |
|
|||
|
|
| **Teacher model** | [Qwen/Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507) |
|
|||
|
|
| **Compression ratio** | 50x |
|
|||
|
|
| **Stage 1 data** | 6,122 STEM CoT samples (12 datasets) |
|
|||
|
|
| **Stage 2 data** | [Alignment-Lab-AI/Lawyer-Instruct](https://huggingface.co/datasets/Alignment-Lab-AI/Lawyer-Instruct) |
|
|||
|
|
| **Context length** | 1024 tokens (training) |
|
|||
|
|
| **License** | Apache 2.0 |
|
|||
|
|
| **Developer** | Reaperdoesntrun / [Convergent Intelligence LLC](https://convergentintel.com): Research Division |
|
|||
|
|
|
|||
|
|
## Usage
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
model_id = "reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT"
|
|||
|
|
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
model_id,
|
|||
|
|
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
|
|||
|
|
device_map="auto",
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Legal instruction-following
|
|||
|
|
prompt = """### Instruction:
|
|||
|
|
What is the difference between a felony and a misdemeanor?
|
|||
|
|
|
|||
|
|
### Response:
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
# STEM derivation (Stage 1 format still works)
|
|||
|
|
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.
|
|||
|
|
|
|||
|
|
Problem:
|
|||
|
|
Compute the determinant of the matrix [[1, 2], [3, 4]].
|
|||
|
|
|
|||
|
|
Proof:
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
|||
|
|
with torch.no_grad():
|
|||
|
|
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
|
|||
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### GGUF
|
|||
|
|
|
|||
|
|
Quantized versions at [reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF).
|
|||
|
|
|
|||
|
|
## Prompt Formats
|
|||
|
|
|
|||
|
|
**STEM derivation (Stage 1):**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Solve the following problem carefully and show a rigorous derivation.
|
|||
|
|
|
|||
|
|
Problem:
|
|||
|
|
[Your problem]
|
|||
|
|
|
|||
|
|
Proof:
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Instruction-following (Stage 2):**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
### Instruction:
|
|||
|
|
[Your question]
|
|||
|
|
|
|||
|
|
### Response:
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Intended Uses
|
|||
|
|
|
|||
|
|
**Good for:** Ultra-lightweight reasoning on mobile/edge/IoT, legal and STEM instruction-following, educational tutoring, embedded inference, component in multi-model pipelines, anywhere you need reasoning in under 500MB.
|
|||
|
|
|
|||
|
|
**Not for:** Formal proof verification, actual legal counsel, safety-critical analysis, complex multi-step proofs (>8 steps), or long-context tasks beyond 1024 tokens.
|
|||
|
|
|
|||
|
|
## Limitations
|
|||
|
|
|
|||
|
|
0.6B is a hard capacity constraint. The model trades depth for deployability. It will make reasoning errors that a larger model would not. Multi-step derivations beyond ~8 steps degrade. Legal reasoning covers general concepts but lacks the nuance of larger models. Performance is weakest on underrepresented domains (molecular biology, physiology). Always verify outputs.
|
|||
|
|
|
|||
|
|
|
|||
|
|
## Mathematical Foundations: Discrepancy Calculus (DISC)
|
|||
|
|
|
|||
|
|
This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.
|
|||
|
|
|
|||
|
|
Full theory: *"On the Formal Analysis of Discrepancy Calculus"* (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165).
|
|||
|
|
|
|||
|
|
## Related Models
|
|||
|
|
|
|||
|
|
| Model | Description |
|
|||
|
|
|---|---|
|
|||
|
|
| [Qwen3-0.6B-STEM-Proof-Distilled-Thinking](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-STEM-Proof-Distilled-Thinking) | Stage 1 only — pure STEM backbone |
|
|||
|
|
| [Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF) | This model quantized for edge deployment |
|
|||
|
|
| [Qwen3-1.7B-STEM-Proof-Distilled](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-STEM-Proof-Distilled) | Larger 1.7B variant (Instruct teacher) |
|
|||
|
|
| [Qwen3-1.7B-Distilled-30B-A3B-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT) | Larger 1.7B variant + legal SFT |
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@misc{colca2026thinking06bsft,
|
|||
|
|
title={Two-Stage Reasoning Transfer at 0.6B: Thinking Teacher Distillation + Legal SFT},
|
|||
|
|
year={2026},
|
|||
|
|
publisher={HuggingFace},
|
|||
|
|
url={https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT},
|
|||
|
|
note={Convergent Intelligence LLC: Research Division}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
*Convergent Intelligence LLC: Research Division*
|
|||
|
|
*"Where classical analysis fails to see, we begin."*
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Convergent Intelligence Portfolio
|
|||
|
|
|
|||
|
|
*Part of the [Qwen3 0.6B Distillation Series](https://huggingface.co/reaperdoesntknow) by [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)*
|
|||
|
|
|
|||
|
|
|
|||
|
|
#
|
|||
|
|
## Mathematical Foundations: Discrepancy Calculus (DISC)
|
|||
|
|
|
|||
|
|
This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.
|
|||
|
|
|
|||
|
|
Full theory: *"On the Formal Analysis of Discrepancy Calculus"* (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165).
|
|||
|
|
|
|||
|
|
## Related Models
|
|||
|
|
|
|||
|
|
| Model | Downloads | Format |
|
|||
|
|
|-------|-----------|--------|
|
|||
|
|
| [Qwen3-0.6B-Distilled-30B-A3B](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B) | 36 | HF |
|
|||
|
|
| [Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF) | 203 | GGUF |
|
|||
|
|
|
|||
|
|
### Top Models from Our Lab
|
|||
|
|
|
|||
|
|
| Model | Downloads |
|
|||
|
|
|-------|-----------|
|
|||
|
|
| [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 501 |
|
|||
|
|
| [LFM2.5-1.2B-Distilled-SFT](https://huggingface.co/reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT) | 342 |
|
|||
|
|
| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 302 |
|
|||
|
|
| [Qwen3-1.7B-Coder-Distilled-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF) | 194 |
|
|||
|
|
| [Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF) | 175 |
|
|||
|
|
|
|||
|
|
**Total Portfolio: 41 models | 2,781 total downloads**
|
|||
|
|
|
|||
|
|
|
|||
|
|
*Last updated: 2026-03-28 12:56 UTC*
|
|||
|
|
|
|||
|
|
<!-- DISTILQWEN-SPOTLIGHT-START -->
|
|||
|
|
|
|||
|
|
## DistilQwen Collection
|
|||
|
|
|
|||
|
|
This model is part of the **[DistilQwen](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c)** proof-weighted distillation series.
|
|||
|
|
Collection: **9 models** | **2,788 downloads**
|
|||
|
|
|
|||
|
|
### Teacher Variant Comparison
|
|||
|
|
|
|||
|
|
| Teacher | Student Size | Strength | Models |
|
|||
|
|
|---------|-------------|----------|--------|
|
|||
|
|
| Qwen3-30B-A3B (Instruct) | 1.7B | Instruction following, structured output, legal reasoning | 3 (833 DL) |
|
|||
|
|
| Qwen3-30B-A3B (Thinking) | 0.6B | Extended deliberation, higher-entropy distributions, proof derivation | 3 (779 DL) **← this model** |
|
|||
|
|
| Qwen3-30B-A3B (Coder) | 1.7B | Structured decomposition, STEM derivation, logical inference | 2 (825 DL) |
|
|||
|
|
|
|||
|
|
### Methodology
|
|||
|
|
|
|||
|
|
**The only BF16 collection in the portfolio.** While the broader Convergent Intelligence catalog (43 models, 12,000+ downloads) was trained on CPU at FP32 for $24 total compute, the DistilQwen series was trained on H100 at BF16 with a 30B-parameter teacher. Same methodology, premium hardware. This is what happens when you give the pipeline real compute.
|
|||
|
|
|
|||
|
|
All models use proof-weighted knowledge distillation: 55% cross-entropy with decaying proof weights (2.5× → 1.5×), 45% KL divergence at T=2.0. The proof weight amplifies loss on reasoning-critical tokens, forcing the student to allocate capacity to structural understanding rather than surface-level pattern matching.
|
|||
|
|
|
|||
|
|
Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165)
|
|||
|
|
|
|||
|
|
### Related in this series
|
|||
|
|
|
|||
|
|
- [Qwen3-0.6B-Distilled-30B-A3B](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B) (236 downloads)
|
|||
|
|
- [Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF) (316 downloads)
|
|||
|
|
|
|||
|
|
<!-- DISTILQWEN-SPOTLIGHT-END -->
|
|||
|
|
<!-- cix-keeper-ts:2026-06-12T13:16:20Z -->
|
|||
|
|
<!-- card-refresh: 2026-03-30 -->
|