Go to file

ModelHub XC 13355d0f5a 初始化项目，由ModelHub XC社区提供模型

Model: reaperdoesntknow/Qemma-sft
Source: Original Platform

2026-05-29 15:24:16 +08:00

runs/Nov08_11-40-04_1f5bfabb418c

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

generation_config_think.json

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

pytorch_model.bin

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

trainer_state.json

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

training_args.bin

初始化项目，由ModelHub XC社区提供模型

2026-05-29 15:24:16 +08:00

README.md

library_name, model_name, tags, licence, license, datasets, language, base_model, pipeline_tag

library_name

model_name

Model Card for Qemma

Qemma is a HuggingFace-native hybrid model that merges Gemma-3 (1B) and Qwen-3 (0.6B) at the weight level (no adapters). Design: Gemma MLP/body + Qwen attention/head, projected and aligned to Gemma’s hidden size. The model is then SFT-tuned for stepwise reasoning.

Quick start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qemma-sft"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).eval()

messages = [{"role": "user", "content": "Explain finite-scale discrepancy Δ_r in one paragraph."}]
inputs = tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

out = model.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))

What’s inside

Architecture: Gemma-3 backbone (26 layers, hidden 1152, MLP 6912) with Qwen-style attention regrouped to Gemma’s 4×256 heads.
Tokenizer: Gemma-3 tokenizer and chat template (see chat_template.jinja).
Training: SFT for instruction following and stepwise reasoning.

Intended use & limitations

Use: research, instruction following, code/help, analysis, further SFT/RLHF. Limits: may hallucinate; not for safety-critical, medical, legal, or financial decisions. Follow dataset/model licenses.

Training procedure

~512 warm-start steps (Alpaca-style data)
256 Additional pretraining steps on (O1-OPEN/OpenO1-SFT)
128 SFT steps with (Jackrong/gpt-oss-120b-reasoning-STEM-5K)
256 SFT steps with (O1-OPEN/OpenO1-SFT)

Framework versions

TRL: 0.25.0
Transformers: 4.57.1
Pytorch: 2.8.0+cpu
Datasets: 4.4.1
Tokenizers: 0.22.1

Discrepancy Calculus Foundation

This model is part of the Convergent Intelligence LLC: Research Division portfolio. All models in this portfolio are developed under the Discrepancy Calculus (DISC) framework — a measure-theoretic approach to understanding and controlling the gap between what a model should produce and what it actually produces.

DISC treats training singularities (loss plateaus, mode collapse, catastrophic forgetting) not as failures to be smoothed over, but as structural signals that reveal the geometry of the learning problem. Key concepts:

Discrepancy Operator (D): Measures the gap between expected and observed behavior at each training step
Jump Sets: Boundaries where model behavior changes discontinuously — these are features, not bugs
Ghost Imprinting: Teacher knowledge that transfers to student models through weight-space topology rather than explicit distillation signal

For the full mathematical treatment, see Discrepancy Calculus: Foundations and Core Theory (DOI: 10.57967/hf/8194).

Citation chain: Structure Over Scale (DOI: 10.57967/hf/8165) → Three Teachers to Dual Cognition (DOI: 10.57967/hf/8184) → Discrepancy Calculus (DOI: 10.57967/hf/8194)

Citations

Cite TRL as:

@misc{vonwerra2022trl,
	title        = {{TRL: Transformer Reinforcement Learning}},
	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
	year         = 2020,
	journal      = {GitHub repository},
	publisher    = {GitHub},
	howpublished = {\url{https://github.com/huggingface/trl}}
}

Convergent Intelligence Portfolio

By Convergent Intelligence LLC: Research Division

Top Models from Our Lab

Model	Downloads
Qwen3-1.7B-Thinking-Distil	501
LFM2.5-1.2B-Distilled-SFT	342
Qwen3-1.7B-Coder-Distilled-SFT	302
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF	203
Qwen3-1.7B-Coder-Distilled-SFT-GGUF	194

Total Portfolio: 41 models | 2,781 total downloads

Last updated: 2026-03-28 12:57 UTC

From the Convergent Intelligence Portfolio

DistilQwen Collection — Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B → 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models, 2,788 combined downloads. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.

Top model: Qwen3-1.7B-Coder-Distilled-SFT — 508 downloads

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)

Convergent Intelligence LLC: Research Division

README.md Unescape Escape

Model Card for Qemma

Quick start

What’s inside

Intended use & limitations

Training procedure

Framework versions

Discrepancy Calculus Foundation

Citations

Convergent Intelligence Portfolio

Top Models from Our Lab

From the Convergent Intelligence Portfolio

README.md