Homunculus is a 12 billion-parameter instruction model distilled from Qwen3-235B onto the Mistral-Nemo backbone.
It was purpose-built to preserve Qwen’s two-mode interaction style—/think (deliberate chain-of-thought) and /nothink (concise answers)—while running on a single consumer GPU.
✨ What’s special?
Feature
Detail
Reasoning-trace transfer
Instead of copying just final probabilities, we align full logit trajectories, yielding more faithful reasoning.
Total-Variation-Distance loss
To better match the teacher’s confidence distribution and smooth the loss landscape.
Tokenizer replacement
The original Mistral tokenizer was swapped for Qwen3's tokenizer.
Dual interaction modes
Use /think when you want transparent step-by-step reasoning (good for analysis & debugging). Use /nothink for terse, production-ready answers. Most reliable in the system role field.
Benchmark results
Benchmark
Score
GPQADiamond (average of 3)
57.1%
mmlu
67.5%
🔧 Quick Start
fromtransformersimportAutoTokenizer,AutoModelForCausalLMmodel_id="arcee-ai/Homunculus"tokenizer=AutoTokenizer.from_pretrained(model_id)model=AutoModelForCausalLM.from_pretrained(model_id,torch_dtype="auto",device_map="auto")# /think mode - Chain-of-thought reasoningmessages=[{"role":"system","content":"You are a helpful assistant. /think"},{"role":"user","content":"Why is the sky blue?"},]output=model.generate(tokenizer.apply_chat_template(messages,tokenize=True,return_tensors="pt"),max_new_tokens=512,temperature=0.7)print(tokenizer.decode(output[0],skip_special_tokens=True))# /nothink mode - Direct answersmessages=[{"role":"system","content":"You are a helpful assistant. /nothink"},{"role":"user","content":"Summarize the plot of Hamlet in two sentences."},]output=model.generate(tokenizer.apply_chat_template(messages,tokenize=True,return_tensors="pt"),max_new_tokens=128,temperature=0.7)print(tokenizer.decode(output[0],skip_special_tokens=True))
💡 Intended Use & Limitations
Homunculus is designed for:
Research on reasoning-trace distillation, Logit Imitation, and mode-switchable assistants.
Lightweight production deployments that need strong reasoning at <12 GB VRAM.
Known limitations
May inherit biases from the Qwen3 teacher and internet-scale pretraining data.
Long-context (>32 k tokens) use is experimental—expect latency & memory overhead.