初始化项目，由ModelHub XC社区提供模型

Model: reaperdoesntknow/Qwen3-1.7B-Thinking-Distil Source: Original Platform
2026-06-13 10:53:16 +08:00
commit cd18160901
8 changed files with 451 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,36 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,216 @@
 ---
 license: apache-2.0
 library_name: transformers
 pipeline_tag: text-generation
 tags:
 - qwen3
 - sft
 - trl
 - knowledge-distillation
 - thinking
 - longwriter
 - convergent-intelligence
 - convergentintel
 - edge
 - distillation
 base_model:
 - reaperdoesntknow/Disctil-Qwen3-1.7B
 datasets:
 - longwriter-6k
 - 0xZee/dataset-CoT-Differential-Equations-636
 - 0xZee/dataset-CoT-Linear-Algebra-667
 ---
 # Qwen3-1.7B-Thinking-Distil
 **Extended Reasoning Distillation from Qwen3-30B-A3B-Thinking → 1.7B**
 *Convergent Intelligence LLC: Research Division*
 ---
 ## What This Is
 The most downloaded model in the Convergent Intelligence portfolio. Qwen3-1.7B-Thinking-Distil captures extended deliberation patterns from the Qwen3-30B-A3B **Thinking** teacher — the variant that generates long-form reasoning chains before committing to an answer — and compresses them into a 1.7B student via supervised fine-tuning on the [longwriter-6k](https://huggingface.co/datasets/longwriter-6k) dataset.
 The Thinking teacher produces the **richest signal** of the three teacher variants in the DistilQwen family (Instruct, Thinking, Coder). Where Instruct distillation captures clean instruction-following and Coder captures hierarchical decomposition, Thinking distillation captures the extended internal monologue — the model reasoning through uncertainty, backtracking, and re-evaluating before arriving at a conclusion. That deliberative depth is what makes this variant the highest-download model in the collection.
 ## Architecture
 | Parameter | Value |
 |-----------|-------|
 | Architecture | Qwen3ForCausalLM |
 | Parameters | ~2.03B (1.7B effective) |
 | Hidden Size | 2048 |
 | Layers | 28 |
 | Attention Heads | 16 (Q) / 8 (KV) — GQA |
 | Intermediate | 6144 |
 | Head Dimension | 128 |
 | Context Length | 40,960 tokens (max position) |
 | Vocabulary | 151,936 |
 | Precision | BF16 |
 | Activation | SiLU |
 ## Training
 **Teacher:** Qwen3-30B-A3B-Thinking
 **Student:** Qwen3-1.7B
 **Dataset:** longwriter-6k — long-form generation samples that preserve extended reasoning chains
 **Method:** Supervised Fine-Tuning (SFT) via TRL
 | Parameter | Value |
 |-----------|-------|
 | Max Sequence Length | 4,096 |
 | Precision | BF16 |
 | Framework | TRL (SFTTrainer) |
 | Hardware | NVIDIA H100 |
 The training captures the teacher's extended thinking traces through direct SFT rather than logit-level KD. This is a deliberate design choice — the longwriter-6k dataset provides naturally long reasoning samples where the signal is in the structure of the generation (how the teacher approaches, reconsiders, and resolves), not just the final token probabilities.
 For the full topology-aware distillation pipeline (BV decomposition, jump detection, curriculum ordering), see [TopologicalQwen](https://huggingface.co/reaperdoesntknow/TopologicalQwen). This model is the SFT-direct variant — simpler, faster to train, and empirically the most downloaded for a reason: the Thinking teacher's extended chains transfer well through pure SFT.
 ## Usage
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 model = AutoModelForCausalLM.from_pretrained(
    "reaperdoesntknow/Qwen3-1.7B-Thinking-Distil",
    torch_dtype="auto",
    device_map="auto"
 )
 tokenizer = AutoTokenizer.from_pretrained(
    "reaperdoesntknow/Qwen3-1.7B-Thinking-Distil"
 )
 messages = [
    {"role": "user", "content": "Explain why gradient descent can get stuck in saddle points but not local minima in high dimensions."}
 ]
 text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 inputs = tokenizer(text, return_tensors="pt").to(model.device)
 output = model.generate(
    **inputs,
    max_new_tokens=2048,
    do_sample=True,
    top_p=0.9,
    temperature=0.7,
    repetition_penalty=1.15
 )
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 ### Generation Tips
 - **Temperature 0.6–0.8** works best for reasoning tasks — low enough for coherence, high enough to activate the extended deliberation patterns from the Thinking teacher.
 - **Repetition penalty 1.1–1.2** prevents the model from getting caught in reasoning loops during long generations.
 - **Max tokens 1024–2048** — the model was trained on 4096 max seq, so it can generate long. Give it room.
 - The model inherits the Thinking teacher's tendency to reason before answering. Let it.
 ## Distillation Position
 ```
 Qwen3-30B-A3B-Thinking (teacher)
  ↓ SFT on longwriter-6k (4096 max seq)
 Qwen3-1.7B-Thinking-Distil ← you are here
 ```
 This model is the **direct SFT** path. The DistilQwen collection also includes models that go through additional refinement stages:
 ```
 Qwen3-1.7B (base)
  → Qwen3-1.7B-Distilled-30B-A3B (Instruct teacher KD)
    → DiStil (uncensored SFT)
      → Disctil (DISC refinement)
        → TopologicalQwen (full TKD pipeline)
 ```
 Different paths, different capabilities. This model prioritizes extended reasoning. TopologicalQwen prioritizes structural precision. The Coder variant prioritizes hierarchical decomposition. They're complementary.
 ## DistilQwen Collection
 | Model | Downloads | What It Does |
 |-------|-----------|-------------|
 | **[Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil)** | **1,188** | **← this model. Thinking teacher SFT.** |
 | [TopologicalQwen](https://huggingface.co/reaperdoesntknow/TopologicalQwen) | 1,134 | Full TKD pipeline. BV decomposition + DualMind format. |
 | [DiStil-Qwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DiStil-Qwen3-1.7B-uncensored) | 1,030 | DISC-informed uncensored distillation. |
 | [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 966 | Coder teacher. Hierarchical problem solving. |
 | [DistilQwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DistilQwen3-1.7B-uncensored) | 832 | Base uncensored variant. |
 Full collection: [DistilQwen on HuggingFace](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c)
 ## Methodology
 Full methodology paper: **[Structure Over Scale: Proof-Weighted Knowledge Distillation](https://doi.org/10.57967/hf/8165)** (DOI: 10.57967/hf/8165)
 Companion paper: **[Three Teachers to Dual Cognition](https://doi.org/10.57967/hf/8184)** (DOI: 10.57967/hf/8184) — covers the DualMind extension and ghost imprinting phenomenon.
 ## License
 Apache 2.0 — same as the base Qwen3 model.
 ## Mathematical Foundations: Discrepancy Calculus (DISC)
 This model's training pipeline is grounded in Discrepancy Calculus — a measure-theoretic framework that treats singularities as primary structure rather than pathology. Full theory: *"On the Formal Analysis of Discrepancy Calculus"* (Colca, 2026; Convergent Intelligence LLC: Research Division).
 **The Core Operator:**
 $$Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|}\, dt$$
 For smooth $f$: $Df(x) = |f'(x)|$. For rough $f$: $D$ localizes irregularity to null sets while preserving integral structure.
 **The Mesh Fundamental Identity** — every BV function decomposes as:
 $$f(b) - f(a) = \underbrace{\int_a^b f'(x)\,dx}_{\text{smooth (AC)}} + \underbrace{\sum_{x \in J_f} \Delta f(x)}_{\text{jumps}} + \underbrace{D^c f(I)}_{\text{Cantor drift}}$$
 Standard knowledge distillation captures only term 1. Topological Knowledge Distillation (TKD) preserves all three by treating the teacher's output distribution as a BV function and computing discrepancy energy, jump sets, and gap energy density before training begins.
 ## Citation
 ```bibtex
@misc{colca2026distilqwen,
  title={Structure Over Scale: Proof-Weighted Knowledge Distillation from Qwen3-30B to 1.7B},
  author={Colca, Roy},
  year={2026},
  doi={10.57967/hf/8165},
  publisher={Convergent Intelligence LLC: Research Division}
 }
 ```
 ---
 *Convergent Intelligence LLC: Research Division — 49 models, 22,598 downloads across the portfolio.*
 *[Full portfolio](https://huggingface.co/reaperdoesntknow) | [DistilQwen Collection](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c) | [DualMind Collection](https://huggingface.co/collections/reaperdoesntknow/dualmind-69c93f888c6e79ecc69cf41e)*
 ---
 ## Convergent Intelligence Portfolio
 *Part of the [DistilQwen Series](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c) by [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)*
 ### Related Models
 | Model | Downloads | Format |
 |-------|-----------|--------|
 | [TopologicalQwen](https://huggingface.co/reaperdoesntknow/TopologicalQwen) | 1,974 | BF16 |
 | [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 1,903 | BF16 |
 | [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 1,677 | BF16 |
 | [DiStil-Qwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DiStil-Qwen3-1.7B-uncensored) | 1,602 | BF16 |
 | [DistilQwen3-1.7B-uncensored](https://huggingface.co/reaperdoesntknow/DistilQwen3-1.7B-uncensored) | 1,574 | BF16 |
 | [Qwen3-1.7B-Distilled-30B-A3B](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B) | 1,138 | BF16 |
 ### Papers
 | Paper | DOI |
 |-------|-----|
 | [Structure Over Scale](https://huggingface.co/reaperdoesntknow/Structure-Over-Scale) | 10.57967/hf/8165 |
 | [Three Teachers to Dual Cognition](https://huggingface.co/reaperdoesntknow/DualMind_Methodolgy) | 10.57967/hf/8184 |
 | [Discrepancy Calculus](https://huggingface.co/reaperdoesntknow/Discrepancy_Calculus) | 10.57967/hf/8194 |
 ---
 *Last updated: 2026-03-31 by Convergent Intelligence LLC: Research Division*
 <!-- cix-keeper-ts:2026-06-12T13:16:41Z -->
--- a/chat_template.jinja
+++ b/chat_template.jinja
@@ -0,0 +1,89 @@
 {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
 {%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
 {%- endif %}
 {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
 {%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
 {%- endfor %}
 {%- for message in messages %}
    {%- if message.content is string %}
        {%- set content = message.content %}
    {%- else %}
        {%- set content = '' %}
    {%- endif %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and reasoning_content) %}
                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
 {%- endfor %}
 {%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- endif %}
 {%- endif %}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,63 @@
 {
  "architectures": [
    "Qwen3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": null,
  "dtype": "bfloat16",
  "eos_token_id": 151645,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 6144,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "max_position_embeddings": 40960,
  "max_window_layers": 28,
  "model_type": "qwen3",
  "num_attention_heads": 16,
  "num_hidden_layers": 28,
  "num_key_value_heads": 8,
  "pad_token_id": 151643,
  "rms_norm_eps": 1e-06,
  "rope_parameters": {
    "rope_theta": 1000000,
    "rope_type": "default"
  },
  "sliding_window": null,
  "tie_word_embeddings": false,
  "transformers_version": "5.0.0",
  "use_cache": false,
  "use_sliding_window": false,
  "vocab_size": 151936
 }
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,12 @@
 {
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "temperature": 0.6,
  "top_k": 20,
  "top_p": 0.95,
  "transformers_version": "5.0.0"
 }
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:13f028a8b0dce70cbbc42b182b374235f38b395893e827dbdb11bfbb0a1cc052
 size 4063515640
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:be75606093db2094d7cd20f3c2f385c212750648bd6ea4fb2bf507a6a4c55506
 size 11422650
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,29 @@
 {
  "add_prefix_space": false,
  "backend": "tokenizers",
  "bos_token": null,
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|im_end|>",
  "errors": "replace",
  "extra_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|object_ref_start|>",
    "<|object_ref_end|>",
    "<|box_start|>",
    "<|box_end|>",
    "<|quad_start|>",
    "<|quad_end|>",
    "<|vision_start|>",
    "<|vision_end|>",
    "<|vision_pad|>",
    "<|image_pad|>",
    "<|video_pad|>"
  ],
  "is_local": false,
  "model_max_length": 131072,
  "pad_token": "<|endoftext|>",
  "split_special_tokens": false,
  "tokenizer_class": "Qwen2Tokenizer",
  "unk_token": null
 }