初始化项目，由ModelHub XC社区提供模型

Model: reaperdoesntknow/Qemma-Q1.7B Source: Original Platform
2026-05-23 17:38:18 +08:00
commit eecfc85986
11 changed files with 51779 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,155 @@
+---
+library_name: transformers
+tags:
+- trl
+- sft
+- gemma
+- qwen
+- merge
+- disc
+- convergentintel
+license: osl-3.0
+datasets:
+- HuggingFaceH4/ultrachat_200k
+- TIGER-Lab/MathInstruct
+language:
+- en
+base_model:
+- Qwen/Qwen3-1.7B
+- google/gemma-3-1b-it
+pipeline_tag: text-generation
+---
+# Model Card for Qemma-Q-1.7B
+## Gap Envelope Integral 
+* My mathematical formulation to utilize space projections to "measure" the Jump between points of discontinuity found in Non-Differentialable Functions.
+## Redux
+  * This Model underwent an additional merge between Qemma-redux and Qwen3-1.7B, in addition to adding Rope Scaling. 
+### Additionally
+* Fusion Logic was updated to aid per layer fusion and post fusion embedding alignment.
+* **Qemma** is a HuggingFace-native hybrid model that merges **Gemma-3 (1B)** and **Qwen-3 (1.7B)** at the weight level (no adapters).
+* Design: Gemma MLP/body + Qwen attention/head, projected and aligned to Gemma’s hidden size. The model is then SFT-tuned for stepwise reasoning.
+* This variant uses Yarn based Rope Scaling with 1:* Ratio from max_position_embeddings = 242144
+* 
+## Quick start
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+model_id = "reaperdoesntknow/Qemma-Q1.7B"
+tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).eval()
+
+text = (
+    "<|user|>"
+    "What makes the sky blue?."
+    "<|assistant|>"
+    "<think><reasoning_step>"
+)
+
+inputs = tokenizer(text, return_tensors="pt", max_length=64, padding='max_length', truncation=True)
+inputs = {k: v.to(model.device) for k, v in inputs.items()}
+
+with torch.no_grad():
+    model.eval()
+    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, min_length=32)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+
+```
+
+## What’s inside
+
+* **Architecture:**
+* **Gemma-3 backbone**  (26 layers, hidden 1152, MLP 6912)
+* **Qwen-style attention** regrouped to Gemma’s 4×256 heads. (head_dim=128, hidden=2048, intermediate_size=6144, num_attn_heads=16, KV heads=8, num_hidd_layers=28)
+* **Tokenizer:** Gemma-3 tokenizer and chat template (see `chat_template.jinja`).
+* **Training:** SFT for instruction following and stepwise reasoning.
+
+## Intended use & limitations
+
+**Use:** research, instruction following, code/help, analysis, further SFT/RLHF.
+**Limits:** may hallucinate; not for safety-critical, medical, legal, or financial decisions. Follow dataset/model licenses.
+
+## Training procedure
+
+* ~512 warm-start steps (HuggingFaceH4/ultrachat_200k) ~ A small post fussion training round was done (8 steps): to encourage embedding realignment.
+* ~256 SFT steps with (TIGER-Lab/MathInstruct + HuggingFaceH4/ultrachat_200k)
+
+
+### Framework versions
+
+* TRL: 0.25.0
+* Transformers: 4.57.1
+* Pytorch: 2.8.0+cpu
+* Datasets: 4.4.1
+* Tokenizers: 0.22.1
+
+## Discrepancy Calculus Foundation
+
+This model is part of the [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow) portfolio. All models in this portfolio are developed under the Discrepancy Calculus (DISC) framework — a measure-theoretic approach to understanding and controlling the gap between what a model *should* produce and what it *actually* produces.
+
+DISC treats training singularities (loss plateaus, mode collapse, catastrophic forgetting) not as failures to be smoothed over, but as **structural signals** that reveal the geometry of the learning problem. Key concepts:
+
+- **Discrepancy Operator (D):** Measures the gap between expected and observed behavior at each training step
+- **Jump Sets:** Boundaries where model behavior changes discontinuously — these are *features*, not bugs
+- **Ghost Imprinting:** Teacher knowledge that transfers to student models through weight-space topology rather than explicit distillation signal
+
+For the full mathematical treatment, see [Discrepancy Calculus: Foundations and Core Theory](https://huggingface.co/reaperdoesntknow/Discrepancy_Calculus) (DOI: 10.57967/hf/8194).
+
+**Citation chain:** [Structure Over Scale](https://huggingface.co/reaperdoesntknow/Structure-Over-Scale) (DOI: 10.57967/hf/8165) → [Three Teachers to Dual Cognition](https://huggingface.co/reaperdoesntknow/DualMind_Methodolgy) (DOI: 10.57967/hf/8184) → [Discrepancy Calculus](https://huggingface.co/reaperdoesntknow/Discrepancy_Calculus) (DOI: 10.57967/hf/8194)
+
+## Citations
+
+
+
+Cite TRL as:
+    
+```bibtex
+@misc{vonwerra2022trl,
+	title        = {{TRL: Transformer Reinforcement Learning}},
+	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
+	year         = 2020,
+	journal      = {GitHub repository},
+	publisher    = {GitHub},
+	howpublished = {\url{https://github.com/huggingface/trl}}
+}
+```
+
+---
+
+## Convergent Intelligence Portfolio
+
+*By [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)*
+
+
+### Top Models from Our Lab
+
+| Model | Downloads |
+|-------|-----------|
+| [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 501 |
+| [LFM2.5-1.2B-Distilled-SFT](https://huggingface.co/reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT) | 342 |
+| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 302 |
+| [Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF) | 203 |
+| [Qwen3-1.7B-Coder-Distilled-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF) | 194 |
+
+**Total Portfolio: 41 models | 2,781 total downloads**
+
+
+*Last updated: 2026-03-28 12:57 UTC*
+
+<!-- CIX-CROSSLINK-START -->
+
+---
+
+## From the Convergent Intelligence Portfolio
+
+**[DistilQwen Collection](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c)** — Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B → 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models, 2,788 combined downloads. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.
+
+Top model: [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) — 508 downloads
+
+Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165)
+
+*Convergent Intelligence LLC: Research Division*
+
+<!-- CIX-CROSSLINK-END -->
+<!-- cix-keeper-ts:2026-05-22T13:15:55Z -->