Qwen3-1.7B-Thinking-Distil/README.md

---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- qwen3
- sft
- trl
- knowledge-distillation
- thinking
- longwriter
- convergent-intelligence
- convergentintel
- edge
- distillation
base_model:
- reaperdoesntknow/Disctil-Qwen3-1.7B
datasets:
- longwriter-6k
- 0xZee/dataset-CoT-Differential-Equations-636
- 0xZee/dataset-CoT-Linear-Algebra-667
---

# Qwen3-1.7B-Thinking-Distil

**Extended Reasoning Distillation from Qwen3-30B-A3B-Thinking → 1.7B**

*Convergent Intelligence LLC: Research Division*

---

## What This Is

The most downloaded model in the Convergent Intelligence portfolio. Qwen3-1.7B-Thinking-Distil captures extended deliberation patterns from the Qwen3-30B-A3B **Thinking** teacher — the variant that generates long-form reasoning chains before committing to an answer — and compresses them into a 1.7B student via supervised fine-tuning on the [longwriter-6k](https://huggingface.co/datasets/longwriter-6k) dataset.

The Thinking teacher produces the **richest signal** of the three teacher variants in the DistilQwen family (Instruct, Thinking, Coder). Where Instruct distillation captures clean instruction-following and Coder captures hierarchical decomposition, Thinking distillation captures the extended internal monologue — the model reasoning through uncertainty, backtracking, and re-evaluating before arriving at a conclusion. That deliberative depth is what makes this variant the highest-download model in the collection.

## Architecture

| Parameter | Value |
|-----------|-------|
| Architecture | Qwen3ForCausalLM |
| Parameters | ~2.03B (1.7B effective) |
| Hidden Size | 2048 |
| Layers | 28 |
| Attention Heads | 16 (Q) / 8 (KV) — GQA |
| Intermediate | 6144 |
| Head Dimension | 128 |
| Context Length | 40,960 tokens (max position) |
| Vocabulary | 151,936 |
| Precision | BF16 |
| Activation | SiLU |

## Training

**Teacher:** Qwen3-30B-A3B-Thinking
**Student:** Qwen3-1.7B
**Dataset:** longwriter-6k — long-form generation samples that preserve extended reasoning chains
**Method:** Supervised Fine-Tuning (SFT) via TRL

| Parameter | Value |
|-----------|-------|
| Max Sequence Length | 4,096 |
| Precision | BF16 |
| Framework | TRL (SFTTrainer) |
| Hardware | NVIDIA H100 |

The training captures the teacher's extended thinking traces through direct SFT rather than logit-level KD. This is a deliberate design choice — the longwriter-6k dataset provides naturally long reasoning samples where the signal is in the structure of the generation (how the teacher approaches, reconsiders, and resolves), not just the final token probabilities.

For the full topology-aware distillation pipeline (BV decomposition, jump detection, curriculum ordering), see [TopologicalQwen](https://huggingface.co/reaperdoesntknow/TopologicalQwen). This model is the SFT-direct variant — simpler, faster to train, and empirically the most downloaded for a reason: the Thinking teacher's extended chains transfer well through pure SFT.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "reaperdoesntknow/Qwen3-1.7B-Thinking-Distil",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "reaperdoesntknow/Qwen3-1.7B-Thinking-Distil"
)

messages = [
    {"role": "user", "content": "Explain why gradient descent can get stuck in saddle points but not local minima in high dimensions."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=2048,
    do_sample=True,
    top_p=0.9,
    temperature=0.7,
    repetition_penalty=1.15
)

print(tokenizer.decode(output[0], skip_special_tokens=True))
```

### Generation Tips

- **Temperature 0.6–0.8** works best for reasoning tasks — low enough for coherence, high enough to activate the extended deliberation patterns from the Thinking teacher.
- **Repetition penalty 1.1–1.2** prevents the model from getting caught in reasoning loops during long generations.
- **Max tokens 1024–2048** — the model was trained on 4096 max seq, so it can generate long. Give it room.
- The model inherits the Thinking teacher's tendency to reason before answering. Let it.

## Distillation Position

```
Qwen3-30B-A3B-Thinking (teacher)
  ↓ SFT on longwriter-6k (4096 max seq)
Qwen3-1.7B-Thinking-Distil ← you are here
```

This model is the **direct SFT** path. The DistilQwen collection also includes models that go through additional refinement stages:

```
Qwen3-1.7B (base)
  → Qwen3-1.7B-Distilled-30B-A3B (Instruct teacher KD)
    → DiStil (uncensored SFT)
      → Disctil (DISC refinement)
        → TopologicalQwen (full TKD pipeline)
```

Different paths, different capabilities. This model prioritizes extended reasoning. TopologicalQwen prioritizes structural precision. The Coder variant prioritizes hierarchical decomposition. They're complementary.

## DistilQwen Collection

| Model | Downloads | What It Does |
|-------|-----------|-------------|
| **[Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil)** | **1,188** | **← this model. Thinking teacher SFT.** |
| [TopologicalQwen](https://huggingface.co/reaperdoesntknow/TopologicalQwen) | 1,134 | Full TKD pipeline. BV decomposition + DualMind format. |
| [DiStil-Qwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DiStil-Qwen3-1.7B-uncensored) | 1,030 | DISC-informed uncensored distillation. |
| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 966 | Coder teacher. Hierarchical problem solving. |
| [DistilQwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DistilQwen3-1.7B-uncensored) | 832 | Base uncensored variant. |

Full collection: [DistilQwen on HuggingFace](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c)

## Methodology

Full methodology paper: **[Structure Over Scale: Proof-Weighted Knowledge Distillation](https://doi.org/10.57967/hf/8165)** (DOI: 10.57967/hf/8165)

Companion paper: **[Three Teachers to Dual Cognition](https://doi.org/10.57967/hf/8184)** (DOI: 10.57967/hf/8184) — covers the DualMind extension and ghost imprinting phenomenon.

## License

Apache 2.0 — same as the base Qwen3 model.


## Mathematical Foundations: Discrepancy Calculus (DISC)

This model's training pipeline is grounded in Discrepancy Calculus — a measure-theoretic framework that treats singularities as primary structure rather than pathology. Full theory: *"On the Formal Analysis of Discrepancy Calculus"* (Colca, 2026; Convergent Intelligence LLC: Research Division).

**The Core Operator:**

$$Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|}\, dt$$

For smooth $f$: $Df(x) = |f'(x)|$. For rough $f$: $D$ localizes irregularity to null sets while preserving integral structure.

**The Mesh Fundamental Identity** — every BV function decomposes as:

$$f(b) - f(a) = \underbrace{\int_a^b f'(x)\,dx}_{\text{smooth (AC)}} + \underbrace{\sum_{x \in J_f} \Delta f(x)}_{\text{jumps}} + \underbrace{D^c f(I)}_{\text{Cantor drift}}$$

Standard knowledge distillation captures only term 1. Topological Knowledge Distillation (TKD) preserves all three by treating the teacher's output distribution as a BV function and computing discrepancy energy, jump sets, and gap energy density before training begins.

## Citation

```bibtex
@misc{colca2026distilqwen,
  title={Structure Over Scale: Proof-Weighted Knowledge Distillation from Qwen3-30B to 1.7B},
  author={Colca, Roy},
  year={2026},
  doi={10.57967/hf/8165},
  publisher={Convergent Intelligence LLC: Research Division}
}
```

---

*Convergent Intelligence LLC: Research Division — 49 models, 22,598 downloads across the portfolio.*
*[Full portfolio](https://huggingface.co/reaperdoesntknow) | [DistilQwen Collection](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c) | [DualMind Collection](https://huggingface.co/collections/reaperdoesntknow/dualmind-69c93f888c6e79ecc69cf41e)*

---

## Convergent Intelligence Portfolio

*Part of the [DistilQwen Series](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c) by [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)*

### Related Models

| Model | Downloads | Format |
|-------|-----------|--------|
| [TopologicalQwen](https://huggingface.co/reaperdoesntknow/TopologicalQwen) | 1,974 | BF16 |
| [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 1,903 | BF16 |
| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 1,677 | BF16 |
| [DiStil-Qwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DiStil-Qwen3-1.7B-uncensored) | 1,602 | BF16 |
| [DistilQwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DistilQwen3-1.7B-uncensored) | 1,574 | BF16 |
| [Qwen3-1.7B-Distilled-30B-A3B](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B) | 1,138 | BF16 |

### Papers

| Paper | DOI |
|-------|-----|
| [Structure Over Scale](https://huggingface.co/reaperdoesntknow/Structure-Over-Scale) | 10.57967/hf/8165 |
| [Three Teachers to Dual Cognition](https://huggingface.co/reaperdoesntknow/DualMind_Methodolgy) | 10.57967/hf/8184 |
| [Discrepancy Calculus](https://huggingface.co/reaperdoesntknow/Discrepancy_Calculus) | 10.57967/hf/8194 |

---

*Last updated: 2026-03-31 by Convergent Intelligence LLC: Research Division*
<!-- cix-keeper-ts:2026-06-12T13:16:41Z -->