license, base_model, tags, datasets, language, pipeline_tag, library_name
| license | base_model | tags | datasets | language | pipeline_tag | library_name | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| apache-2.0 | Qwen/Qwen3-0.6B |
|
|
|
text-generation | transformers |
Atem-0.6B
Ancient logic. Modern intelligence.
A 0.6B reasoning model trained via multi-source knowledge distillation from frontier teacher models.
Overview
Atem-0.6B is a 0.6B parameter reasoning model built via supervised fine-tuning on a curated corpus of approximately 120,000 examples distilled from multiple frontier teacher models. Starting from Qwen/Qwen3-0.6B, Atem was trained using LoRA to preserve base model capabilities while shifting output style toward clean, directly-formatted final answers.
This is Stage 1 of a planned multi-stage training series, and the first entry in the Atem family built on Qwen3 rather than Qwen2.5. Stage 1 strips <think> reasoning traces from all training data, deliberately suppressing Qwen3's native exposed chain-of-thought in favor of direct answers. Stage 2 (Atem-Savant-0.6B) is currently in progress, layering curated chain-of-thought traces back on top of this foundation — see Known Limitations for why that stage matters.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-0.6B |
| Training method | LoRA Supervised Fine-Tuning (Stage 1) |
| LoRA config | r=32, alpha=64, dropout=0.05 |
| Target modules | q, k, v, o, gate, up, down projections |
| Parameters | ~596M |
| Trainable (LoRA) params | 20,185,088 (3.28% of base) |
| Training records | 120,017 |
| Epochs | 2 |
| Effective batch size | 128 (batch 32 × grad accum 4) |
| Learning rate | 2e-4, cosine schedule, 5% warmup |
| Final train loss | 1.055 |
| Final val loss | 1.073 |
| Hardware | NVIDIA A100-SXM4 80GB |
| Max sequence length | 4,096 tokens |
| Precision | bfloat16 |
| License | Apache 2.0 |
Intended Use
Atem-0.6B is designed for lightweight, open-ended reasoning tasks where structured, direct answers add value at low compute cost:
- Code explanation, implementation, and debugging
- Mathematical problem solving with working shown
- Analytical reasoning and hypothesis evaluation
- Concept explanation and comparative analysis
- Logic, argument, and fallacy identification
Atem-0.6B is not designed for retrieval-heavy factual lookup, real-time information, or tasks requiring broad knowledge breadth beyond its training domains. At 0.6B parameters its capability ceiling is naturally lower than larger Atem models — expect it to be most useful where speed and footprint matter more than depth on hard, multi-step problems.
Training Data
Atem-0.6B was trained on a corpus assembled from eleven sources, combining domain-specific generated datasets and publicly available distillation datasets from frontier models. All outputs containing <think> reasoning traces were stripped to clean final responses for Stage 1 training.
| Dataset | Records | Source / Teacher |
|---|---|---|
| EphAsad/QWENMillenium-SF | ~ | Qwen2.5-14B — Analytical & Scientific |
| EphAsad/Phi4Millennium-SF | ~ | Phi-4 14B — Mathematical Reasoning |
| EphAsad/MistralMillenium-SF | ~ | Mistral-Nemo-12B — Language & Comprehension |
| Modotte/CodeX-2M-Thinking | 40,000 | Mixed — Coding |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned | 23,000 | Kimi K2.5 — General Distillation (English filtered) |
| WithinUsAI/MiniMax_M2.7_Distilled_5k | 5,000 | MiniMax M2.7 |
| tuanha1305/DeepSeek-R1-Distill | 9,000 | DeepSeek-R1 |
| open-r1/OpenThoughts-114k-math | 10,000 | Mixed — Mathematics (correct answers only) |
| flytech/python-codes-25k | 10,000 | Python coding |
| FreedomIntelligence/medical-o1-reasoning-SFT | 10,000 | Medical reasoning (English config) |
| Jackrong/Claude-opus-4.7-TraceInversion-5000x | 5,000 | Claude Opus 4.7 — Trace Inversion |
| Total | 120,017 |
Training Configuration
# Key hyperparameters
lora_r = 32
lora_alpha = 64
lora_dropout = 0.05
max_seq_length = 4096
learning_rate = 2e-4
lr_scheduler = 'cosine'
warmup_ratio = 0.05
batch_size = 32
grad_accumulation = 4 # effective batch size: 128
num_epochs = 2
dtype = bfloat16
load_in_4bit = True # during training
Training used Unsloth with train_on_responses_only masking, ensuring loss was computed exclusively on assistant response tokens. Because Qwen3 ships with no default system prompt (unlike Qwen2.5-Instruct), Atem's identity is baked in via a chat-template modification that injects Atem as the default persona only when no explicit system message is supplied — explicit system messages still take priority. A pre-training validation suite verified this injection, confirmed the response-masking boundary correctly accounts for Qwen3's automatic empty <think></think> scaffold insertion, and checked for leaked reasoning content before training began.
After training, LoRA adapters were merged into the base weights and exported as a full merged model.
Loss curve:
| Step | Train Loss | Val Loss |
|---|---|---|
| 200 | 1.166 | 1.163 |
| 800 | 1.108 | 1.096 |
| 1400 | 0.983 | 1.077 |
| Final (1876) | 1.055 | 1.073 |
Validation loss plateaued around step 1600 of 1876 total steps — the final ~15% of training produced only marginal further improvement (1.074 → 1.073). Train loss showed some batch-to-batch volatility late in training (a step-1800 spike to 1.088, consistent with the dataset's domain diversity rather than divergence), but validation loss stayed smooth and never reversed, indicating no overfitting across the two epochs.
Evaluation
Benchmark Results
Evaluated against the base model (unsloth/qwen3-0.6b-unsloth-bnb-4bit) using lm-evaluation-harness.
| Task | Base (Qwen3-0.6B) | Atem-0.6B | Delta |
|---|---|---|---|
| ARC-Challenge (0-shot, acc_norm) | 33.0% | 35.0% | +2.0% ✓ |
| GSM8K (5-shot, strict-match) | 26.7% | 31.8% | +5.1% ✓ |
| HellaSwag (0-shot, acc_norm) | 45.3% | 45.8% | +0.5% |
Eval condition note: the base model was loaded in 4-bit (unsloth/qwen3-0.6b-unsloth-bnb-4bit); Atem-0.6B was evaluated as the full bfloat16 merged model. This is not a precision-matched comparison — the gap may be modestly inflated relative to a 4-bit-vs-4-bit or bf16-vs-bf16 run. GSM8K used 5-shot prompting per lm-eval's default config; ARC-Challenge and HellaSwag were 0-shot.
The GSM8K gain is the standout figure, but it likely reflects Stage 1's training toward clean, directly-formatted final answers — which matters a great deal for lm-eval's exact-match-on-extracted-number scoring — more than a deeper improvement in multi-step mathematical reasoning. The qualitative evaluation below, which looks at harder, less templated problems, supports this reading: reasoning depth on multi-step problems is not uniformly better than the base model. ARC-Challenge and HellaSwag, which probe general/commonsense knowledge rather than output formatting, moved only slightly — expected, since SFT on this corpus isn't designed to add new general knowledge.
Qualitative Evaluation
Atem-0.6B was evaluated against base Qwen3-0.6B (default thinking-enabled) across 30 domain-representative questions with matched system prompts.
| Domain | Questions | Outcome |
|---|---|---|
| Coding | 8 | Mixed — comparable correctness; Atem notably more concise and direct |
| Mathematics | 6 | Mixed — base model's exposed reasoning self-corrects mid-generation on some multi-step problems that Atem commits to an error on |
| Analytical Reasoning | 6 | Base model edges ahead — exposed reasoning gives more room to work through multi-step arguments |
| General Knowledge | 5 | Comparable |
| Language & Logic | 5 | Comparable, slight edge to base on illustrative examples |
Atem-0.6B's outputs were consistently more concise and directly formatted — a direct result of Stage 1's design goal of suppressing exposed chain-of-thought. This did not translate into a uniform quality advantage over the base model: on problems requiring several sequential reasoning steps, the base model's visible thinking trace sometimes catches and corrects mistakes mid-generation that Atem, having no scratchpad, does not. This is the expected cost of the no-think format rather than a knowledge regression, and is the explicit target of the in-progress Stage 2 (Atem-Savant-0.6B) training, which reintroduces chain-of-thought.
One output during qualitative testing showed repetitive/degenerate text (a duplicated bullet list) on a single open-ended analytical question — noted here for transparency rather than treated as representative.
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "EphAsad/Atem-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{
"role": "user",
"content": "Write a Python function that checks whether a number is prime."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=1000,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
response = tokenizer.decode(
output[0][inputs.shape[1]:],
skip_special_tokens=True
)
print(response)
Unsloth (faster inference)
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="EphAsad/Atem-0.6B",
max_seq_length=4096,
dtype=torch.bfloat16,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [
{
"role": "user",
"content": "Explain the difference between a stack and a queue, with examples."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=1000,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(
output[0][inputs.shape[1]:],
skip_special_tokens=True
))
Ollama
# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-0.6B:Q4_K_M
# Higher quality
ollama run hf.co/EphAsad/Atem-0.6B:Q5_K_M
# Near-lossless
ollama run hf.co/EphAsad/Atem-0.6B:Q8_0
llama.cpp
llama-server -hf EphAsad/Atem-0.6B:Q4_K_M
System Prompt
Atem-0.6B's identity is baked into the chat template and activates automatically when no system message is provided. For manual override:
You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.
Available Files
| File | Size | Description |
|---|---|---|
model.safetensors |
~1.2 GB | Full bfloat16 merged weights |
Atem-0.6b.Q4_K_M.gguf |
~397 MB | 4-bit quantised — recommended |
Atem-0.6b.Q5_K_M.gguf |
~444 MB | 5-bit quantised |
Atem-0.6b.Q8_0.gguf |
~700 MB | 8-bit quantised — near-lossless |
Known Limitations
No thinking traces (Stage 1 by design). Think tags were stripped from all training data for Stage 1, and Qwen3's native exposed reasoning is suppressed. The model does not produce extended <think> content. As shown in the qualitative evaluation above, this measurably costs accuracy on multi-step analytical and mathematical problems relative to the base model's default thinking-enabled behavior — Stage 2 (Atem-Savant-0.6B, in progress) exists specifically to recover this.
Smaller capability ceiling than larger Atem models. At 0.6B parameters, this is the smallest model in the Atem family. Treat it as a fast, low-footprint option rather than a reasoning-depth flagship.
Mathematical precision on complex problems. On multi-step calculations, the model may make arithmetic or counting errors without a scratchpad to catch them — verified directly in qualitative testing (e.g., miscounting combinatorial outcomes). Answers to high-stakes mathematical problems should be independently verified.
Eval precision asymmetry. The benchmark comparison above evaluated the base model in 4-bit and Atem-0.6B in bfloat16 — see the Evaluation section for details. A precision-matched re-run would give a cleaner comparison.
Roadmap
Atem-0.6B establishes the Stage 1 foundation for the Qwen3-based branch of the Atem family. Planned next steps:
- Stage 2 (in progress): Atem-Savant-0.6B — LoRA SFT on curated chain-of-thought data (~90% think-trace records, ~10% no-think) using OpenR1-Math, Kimi-K2.5, DeepSeek-V4-Pro-Reasoning, OpenCodeReasoning, and trace-inversion datasets, to recover multi-step reasoning depth on top of Stage 1's direct-answer foundation
- Extended benchmarks: MMLU, BBH, IFEval post-Stage 2
- Precision-matched re-benchmark: re-run base vs Atem comparison under identical 4-bit (or identical bf16) conditions
Citation
@misc{atem_06b_2026,
author = {Asad, Zain},
title = {Atem-0.6B: A 0.6B Direct-Reasoning Model via
Stage 1 SFT on Qwen3},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/EphAsad/Atem-0.6B}},
}
License
Released under the Apache 2.0 License, consistent with the base model Qwen/Qwen3-0.6B.
Built independently by Zain Asad - EphAsad
