初始化项目，由ModelHub XC社区提供模型

Model: reaperdoesntknow/Qwen3-1.7B-Thinking-Distil Source: Original Platform
2026-06-13 10:53:16 +08:00
commit cd18160901
8 changed files with 451 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,216 @@
+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- qwen3
+- sft
+- trl
+- knowledge-distillation
+- thinking
+- longwriter
+- convergent-intelligence
+- convergentintel
+- edge
+- distillation
+base_model:
+- reaperdoesntknow/Disctil-Qwen3-1.7B
+datasets:
+- longwriter-6k
+- 0xZee/dataset-CoT-Differential-Equations-636
+- 0xZee/dataset-CoT-Linear-Algebra-667
+---
+
+# Qwen3-1.7B-Thinking-Distil
+
+**Extended Reasoning Distillation from Qwen3-30B-A3B-Thinking → 1.7B**
+
+*Convergent Intelligence LLC: Research Division*
+
+---
+
+## What This Is
+
+The most downloaded model in the Convergent Intelligence portfolio. Qwen3-1.7B-Thinking-Distil captures extended deliberation patterns from the Qwen3-30B-A3B **Thinking** teacher — the variant that generates long-form reasoning chains before committing to an answer — and compresses them into a 1.7B student via supervised fine-tuning on the [longwriter-6k](https://huggingface.co/datasets/longwriter-6k) dataset.
+
+The Thinking teacher produces the **richest signal** of the three teacher variants in the DistilQwen family (Instruct, Thinking, Coder). Where Instruct distillation captures clean instruction-following and Coder captures hierarchical decomposition, Thinking distillation captures the extended internal monologue — the model reasoning through uncertainty, backtracking, and re-evaluating before arriving at a conclusion. That deliberative depth is what makes this variant the highest-download model in the collection.
+
+## Architecture
+
+| Parameter | Value |
+|-----------|-------|
+| Architecture | Qwen3ForCausalLM |
+| Parameters | ~2.03B (1.7B effective) |
+| Hidden Size | 2048 |
+| Layers | 28 |
+| Attention Heads | 16 (Q) / 8 (KV) — GQA |
+| Intermediate | 6144 |
+| Head Dimension | 128 |
+| Context Length | 40,960 tokens (max position) |
+| Vocabulary | 151,936 |
+| Precision | BF16 |
+| Activation | SiLU |
+
+## Training
+
+**Teacher:** Qwen3-30B-A3B-Thinking
+**Student:** Qwen3-1.7B
+**Dataset:** longwriter-6k — long-form generation samples that preserve extended reasoning chains
+**Method:** Supervised Fine-Tuning (SFT) via TRL
+
+| Parameter | Value |
+|-----------|-------|
+| Max Sequence Length | 4,096 |
+| Precision | BF16 |
+| Framework | TRL (SFTTrainer) |
+| Hardware | NVIDIA H100 |
+
+The training captures the teacher's extended thinking traces through direct SFT rather than logit-level KD. This is a deliberate design choice — the longwriter-6k dataset provides naturally long reasoning samples where the signal is in the structure of the generation (how the teacher approaches, reconsiders, and resolves), not just the final token probabilities.
+
+For the full topology-aware distillation pipeline (BV decomposition, jump detection, curriculum ordering), see [TopologicalQwen](https://huggingface.co/reaperdoesntknow/TopologicalQwen). This model is the SFT-direct variant — simpler, faster to train, and empirically the most downloaded for a reason: the Thinking teacher's extended chains transfer well through pure SFT.
+
+## Usage
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained(
+    "reaperdoesntknow/Qwen3-1.7B-Thinking-Distil",
+    torch_dtype="auto",
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "reaperdoesntknow/Qwen3-1.7B-Thinking-Distil"
+)
+
+messages = [
+    {"role": "user", "content": "Explain why gradient descent can get stuck in saddle points but not local minima in high dimensions."}
+]
+
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(text, return_tensors="pt").to(model.device)
+
+output = model.generate(
+    **inputs,
+    max_new_tokens=2048,
+    do_sample=True,
+    top_p=0.9,
+    temperature=0.7,
+    repetition_penalty=1.15
+)
+
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+### Generation Tips
+
+- **Temperature 0.6–0.8** works best for reasoning tasks — low enough for coherence, high enough to activate the extended deliberation patterns from the Thinking teacher.
+- **Repetition penalty 1.1–1.2** prevents the model from getting caught in reasoning loops during long generations.
+- **Max tokens 1024–2048** — the model was trained on 4096 max seq, so it can generate long. Give it room.
+- The model inherits the Thinking teacher's tendency to reason before answering. Let it.
+
+## Distillation Position
+
+```
+Qwen3-30B-A3B-Thinking (teacher)
+  ↓ SFT on longwriter-6k (4096 max seq)
+Qwen3-1.7B-Thinking-Distil ← you are here
+```
+
+This model is the **direct SFT** path. The DistilQwen collection also includes models that go through additional refinement stages:
+
+```
+Qwen3-1.7B (base)
+  → Qwen3-1.7B-Distilled-30B-A3B (Instruct teacher KD)
+    → DiStil (uncensored SFT)
+      → Disctil (DISC refinement)
+        → TopologicalQwen (full TKD pipeline)
+```
+
+Different paths, different capabilities. This model prioritizes extended reasoning. TopologicalQwen prioritizes structural precision. The Coder variant prioritizes hierarchical decomposition. They're complementary.
+
+## DistilQwen Collection
+
+| Model | Downloads | What It Does |
+|-------|-----------|-------------|
+| **[Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil)** | **1,188** | **← this model. Thinking teacher SFT.** |
+| [TopologicalQwen](https://huggingface.co/reaperdoesntknow/TopologicalQwen) | 1,134 | Full TKD pipeline. BV decomposition + DualMind format. |
+| [DiStil-Qwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DiStil-Qwen3-1.7B-uncensored) | 1,030 | DISC-informed uncensored distillation. |
+| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 966 | Coder teacher. Hierarchical problem solving. |
+| [DistilQwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DistilQwen3-1.7B-uncensored) | 832 | Base uncensored variant. |
+
+Full collection: [DistilQwen on HuggingFace](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c)
+
+## Methodology
+
+Full methodology paper: **[Structure Over Scale: Proof-Weighted Knowledge Distillation](https://doi.org/10.57967/hf/8165)** (DOI: 10.57967/hf/8165)
+
+Companion paper: **[Three Teachers to Dual Cognition](https://doi.org/10.57967/hf/8184)** (DOI: 10.57967/hf/8184) — covers the DualMind extension and ghost imprinting phenomenon.
+
+## License
+
+Apache 2.0 — same as the base Qwen3 model.
+
+
+## Mathematical Foundations: Discrepancy Calculus (DISC)
+
+This model's training pipeline is grounded in Discrepancy Calculus — a measure-theoretic framework that treats singularities as primary structure rather than pathology. Full theory: *"On the Formal Analysis of Discrepancy Calculus"* (Colca, 2026; Convergent Intelligence LLC: Research Division).
+
+**The Core Operator:**
+
+$$Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|}\, dt$$
+
+For smooth $f$: $Df(x) = |f'(x)|$. For rough $f$: $D$ localizes irregularity to null sets while preserving integral structure.
+
+**The Mesh Fundamental Identity** — every BV function decomposes as:
+
+$$f(b) - f(a) = \underbrace{\int_a^b f'(x)\,dx}_{\text{smooth (AC)}} + \underbrace{\sum_{x \in J_f} \Delta f(x)}_{\text{jumps}} + \underbrace{D^c f(I)}_{\text{Cantor drift}}$$
+
+Standard knowledge distillation captures only term 1. Topological Knowledge Distillation (TKD) preserves all three by treating the teacher's output distribution as a BV function and computing discrepancy energy, jump sets, and gap energy density before training begins.
+
+## Citation
+
+```bibtex
+@misc{colca2026distilqwen,
+  title={Structure Over Scale: Proof-Weighted Knowledge Distillation from Qwen3-30B to 1.7B},
+  author={Colca, Roy},
+  year={2026},
+  doi={10.57967/hf/8165},
+  publisher={Convergent Intelligence LLC: Research Division}
+}
+```
+
+---
+
+*Convergent Intelligence LLC: Research Division — 49 models, 22,598 downloads across the portfolio.*
+*[Full portfolio](https://huggingface.co/reaperdoesntknow) | [DistilQwen Collection](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c) | [DualMind Collection](https://huggingface.co/collections/reaperdoesntknow/dualmind-69c93f888c6e79ecc69cf41e)*
+
+---
+
+## Convergent Intelligence Portfolio
+
+*Part of the [DistilQwen Series](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c) by [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)*
+
+### Related Models
+
+| Model | Downloads | Format |
+|-------|-----------|--------|
+| [TopologicalQwen](https://huggingface.co/reaperdoesntknow/TopologicalQwen) | 1,974 | BF16 |
+| [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 1,903 | BF16 |
+| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 1,677 | BF16 |
+| [DiStil-Qwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DiStil-Qwen3-1.7B-uncensored) | 1,602 | BF16 |
+| [DistilQwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DistilQwen3-1.7B-uncensored) | 1,574 | BF16 |
+| [Qwen3-1.7B-Distilled-30B-A3B](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B) | 1,138 | BF16 |
+
+### Papers
+
+| Paper | DOI |
+|-------|-----|
+| [Structure Over Scale](https://huggingface.co/reaperdoesntknow/Structure-Over-Scale) | 10.57967/hf/8165 |
+| [Three Teachers to Dual Cognition](https://huggingface.co/reaperdoesntknow/DualMind_Methodolgy) | 10.57967/hf/8184 |
+| [Discrepancy Calculus](https://huggingface.co/reaperdoesntknow/Discrepancy_Calculus) | 10.57967/hf/8194 |
+
+---
+
+*Last updated: 2026-03-31 by Convergent Intelligence LLC: Research Division*
+<!-- cix-keeper-ts:2026-06-12T13:16:41Z -->