license, base_model, tags, datasets, language, pipeline_tag, library_name
license base_model tags datasets language pipeline_tag library_name
apache-2.0 Qwen/Qwen3-0.6B
unsloth
lora
qwen3
reasoning
distillation
conversational
EphAsad/QWENMillenium-SF
EphAsad/Phi4Millennium-SF
EphAsad/MistralMillenium-SF
Modotte/CodeX-2M-Thinking
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned
WithinUsAI/MiniMax_M2.7_Distilled_5k
tuanha1305/DeepSeek-R1-Distill
open-r1/OpenThoughts-114k-math
flytech/python-codes-25k
FreedomIntelligence/medical-o1-reasoning-SFT
Jackrong/Claude-opus-4.7-TraceInversion-5000x
en
text-generation transformers

Atem Logo

Atem-0.6B

Ancient logic. Modern intelligence.

A 0.6B reasoning model trained via multi-source knowledge distillation from frontier teacher models.

Base ModelMethodParametersLicense


Overview

Atem-0.6B is a 0.6B parameter reasoning model built via supervised fine-tuning on a curated corpus of approximately 120,000 examples distilled from multiple frontier teacher models. Starting from Qwen/Qwen3-0.6B, Atem was trained using LoRA to preserve base model capabilities while shifting output style toward clean, directly-formatted final answers.

This is Stage 1 of a planned multi-stage training series, and the first entry in the Atem family built on Qwen3 rather than Qwen2.5. Stage 1 strips <think> reasoning traces from all training data, deliberately suppressing Qwen3's native exposed chain-of-thought in favor of direct answers. Stage 2 (Atem-Savant-0.6B) is currently in progress, layering curated chain-of-thought traces back on top of this foundation — see Known Limitations for why that stage matters.


Model Details

Property Value
Base model Qwen/Qwen3-0.6B
Training method LoRA Supervised Fine-Tuning (Stage 1)
LoRA config r=32, alpha=64, dropout=0.05
Target modules q, k, v, o, gate, up, down projections
Parameters ~596M
Trainable (LoRA) params 20,185,088 (3.28% of base)
Training records 120,017
Epochs 2
Effective batch size 128 (batch 32 × grad accum 4)
Learning rate 2e-4, cosine schedule, 5% warmup
Final train loss 1.055
Final val loss 1.073
Hardware NVIDIA A100-SXM4 80GB
Max sequence length 4,096 tokens
Precision bfloat16
License Apache 2.0

Intended Use

Atem-0.6B is designed for lightweight, open-ended reasoning tasks where structured, direct answers add value at low compute cost:

  • Code explanation, implementation, and debugging
  • Mathematical problem solving with working shown
  • Analytical reasoning and hypothesis evaluation
  • Concept explanation and comparative analysis
  • Logic, argument, and fallacy identification

Atem-0.6B is not designed for retrieval-heavy factual lookup, real-time information, or tasks requiring broad knowledge breadth beyond its training domains. At 0.6B parameters its capability ceiling is naturally lower than larger Atem models — expect it to be most useful where speed and footprint matter more than depth on hard, multi-step problems.


Training Data

Atem-0.6B was trained on a corpus assembled from eleven sources, combining domain-specific generated datasets and publicly available distillation datasets from frontier models. All outputs containing <think> reasoning traces were stripped to clean final responses for Stage 1 training.

Dataset Records Source / Teacher
EphAsad/QWENMillenium-SF ~ Qwen2.5-14B — Analytical & Scientific
EphAsad/Phi4Millennium-SF ~ Phi-4 14B — Mathematical Reasoning
EphAsad/MistralMillenium-SF ~ Mistral-Nemo-12B — Language & Comprehension
Modotte/CodeX-2M-Thinking 40,000 Mixed — Coding
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned 23,000 Kimi K2.5 — General Distillation (English filtered)
WithinUsAI/MiniMax_M2.7_Distilled_5k 5,000 MiniMax M2.7
tuanha1305/DeepSeek-R1-Distill 9,000 DeepSeek-R1
open-r1/OpenThoughts-114k-math 10,000 Mixed — Mathematics (correct answers only)
flytech/python-codes-25k 10,000 Python coding
FreedomIntelligence/medical-o1-reasoning-SFT 10,000 Medical reasoning (English config)
Jackrong/Claude-opus-4.7-TraceInversion-5000x 5,000 Claude Opus 4.7 — Trace Inversion
Total 120,017

Training Configuration

# Key hyperparameters
lora_r            = 32
lora_alpha        = 64
lora_dropout      = 0.05
max_seq_length    = 4096
learning_rate     = 2e-4
lr_scheduler      = 'cosine'
warmup_ratio      = 0.05
batch_size        = 32
grad_accumulation = 4        # effective batch size: 128
num_epochs        = 2
dtype             = bfloat16
load_in_4bit      = True     # during training

Training used Unsloth with train_on_responses_only masking, ensuring loss was computed exclusively on assistant response tokens. Because Qwen3 ships with no default system prompt (unlike Qwen2.5-Instruct), Atem's identity is baked in via a chat-template modification that injects Atem as the default persona only when no explicit system message is supplied — explicit system messages still take priority. A pre-training validation suite verified this injection, confirmed the response-masking boundary correctly accounts for Qwen3's automatic empty <think></think> scaffold insertion, and checked for leaked reasoning content before training began.

After training, LoRA adapters were merged into the base weights and exported as a full merged model.

Loss curve:

Step Train Loss Val Loss
200 1.166 1.163
800 1.108 1.096
1400 0.983 1.077
Final (1876) 1.055 1.073

Validation loss plateaued around step 1600 of 1876 total steps — the final ~15% of training produced only marginal further improvement (1.074 → 1.073). Train loss showed some batch-to-batch volatility late in training (a step-1800 spike to 1.088, consistent with the dataset's domain diversity rather than divergence), but validation loss stayed smooth and never reversed, indicating no overfitting across the two epochs.


Evaluation

Benchmark Results

Evaluated against the base model (unsloth/qwen3-0.6b-unsloth-bnb-4bit) using lm-evaluation-harness.

Task Base (Qwen3-0.6B) Atem-0.6B Delta
ARC-Challenge (0-shot, acc_norm) 33.0% 35.0% +2.0% ✓
GSM8K (5-shot, strict-match) 26.7% 31.8% +5.1%
HellaSwag (0-shot, acc_norm) 45.3% 45.8% +0.5%

Eval condition note: the base model was loaded in 4-bit (unsloth/qwen3-0.6b-unsloth-bnb-4bit); Atem-0.6B was evaluated as the full bfloat16 merged model. This is not a precision-matched comparison — the gap may be modestly inflated relative to a 4-bit-vs-4-bit or bf16-vs-bf16 run. GSM8K used 5-shot prompting per lm-eval's default config; ARC-Challenge and HellaSwag were 0-shot.

The GSM8K gain is the standout figure, but it likely reflects Stage 1's training toward clean, directly-formatted final answers — which matters a great deal for lm-eval's exact-match-on-extracted-number scoring — more than a deeper improvement in multi-step mathematical reasoning. The qualitative evaluation below, which looks at harder, less templated problems, supports this reading: reasoning depth on multi-step problems is not uniformly better than the base model. ARC-Challenge and HellaSwag, which probe general/commonsense knowledge rather than output formatting, moved only slightly — expected, since SFT on this corpus isn't designed to add new general knowledge.

Qualitative Evaluation

Atem-0.6B was evaluated against base Qwen3-0.6B (default thinking-enabled) across 30 domain-representative questions with matched system prompts.

Domain Questions Outcome
Coding 8 Mixed — comparable correctness; Atem notably more concise and direct
Mathematics 6 Mixed — base model's exposed reasoning self-corrects mid-generation on some multi-step problems that Atem commits to an error on
Analytical Reasoning 6 Base model edges ahead — exposed reasoning gives more room to work through multi-step arguments
General Knowledge 5 Comparable
Language & Logic 5 Comparable, slight edge to base on illustrative examples

Atem-0.6B's outputs were consistently more concise and directly formatted — a direct result of Stage 1's design goal of suppressing exposed chain-of-thought. This did not translate into a uniform quality advantage over the base model: on problems requiring several sequential reasoning steps, the base model's visible thinking trace sometimes catches and corrects mistakes mid-generation that Atem, having no scratchpad, does not. This is the expected cost of the no-think format rather than a knowledge regression, and is the explicit target of the in-progress Stage 2 (Atem-Savant-0.6B) training, which reintroduces chain-of-thought.

One output during qualitative testing showed repetitive/degenerate text (a duplicated bullet list) on a single open-ended analytical question — noted here for transparency rather than treated as representative.


Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "EphAsad/Atem-0.6B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": "Write a Python function that checks whether a number is prime."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=1000,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        do_sample=True,
    )

response = tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
)
print(response)

Unsloth (faster inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="EphAsad/Atem-0.6B",
    max_seq_length=4096,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {
        "role": "user",
        "content": "Explain the difference between a stack and a queue, with examples."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=1000,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

print(tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
))

Ollama

# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-0.6B:Q4_K_M

# Higher quality
ollama run hf.co/EphAsad/Atem-0.6B:Q5_K_M

# Near-lossless
ollama run hf.co/EphAsad/Atem-0.6B:Q8_0

llama.cpp

llama-server -hf EphAsad/Atem-0.6B:Q4_K_M

System Prompt

Atem-0.6B's identity is baked into the chat template and activates automatically when no system message is provided. For manual override:

You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.

Available Files

File Size Description
model.safetensors ~1.2 GB Full bfloat16 merged weights
Atem-0.6b.Q4_K_M.gguf ~397 MB 4-bit quantised — recommended
Atem-0.6b.Q5_K_M.gguf ~444 MB 5-bit quantised
Atem-0.6b.Q8_0.gguf ~700 MB 8-bit quantised — near-lossless

Known Limitations

No thinking traces (Stage 1 by design). Think tags were stripped from all training data for Stage 1, and Qwen3's native exposed reasoning is suppressed. The model does not produce extended <think> content. As shown in the qualitative evaluation above, this measurably costs accuracy on multi-step analytical and mathematical problems relative to the base model's default thinking-enabled behavior — Stage 2 (Atem-Savant-0.6B, in progress) exists specifically to recover this.

Smaller capability ceiling than larger Atem models. At 0.6B parameters, this is the smallest model in the Atem family. Treat it as a fast, low-footprint option rather than a reasoning-depth flagship.

Mathematical precision on complex problems. On multi-step calculations, the model may make arithmetic or counting errors without a scratchpad to catch them — verified directly in qualitative testing (e.g., miscounting combinatorial outcomes). Answers to high-stakes mathematical problems should be independently verified.

Eval precision asymmetry. The benchmark comparison above evaluated the base model in 4-bit and Atem-0.6B in bfloat16 — see the Evaluation section for details. A precision-matched re-run would give a cleaner comparison.


Roadmap

Atem-0.6B establishes the Stage 1 foundation for the Qwen3-based branch of the Atem family. Planned next steps:

  • Stage 2 (in progress): Atem-Savant-0.6B — LoRA SFT on curated chain-of-thought data (~90% think-trace records, ~10% no-think) using OpenR1-Math, Kimi-K2.5, DeepSeek-V4-Pro-Reasoning, OpenCodeReasoning, and trace-inversion datasets, to recover multi-step reasoning depth on top of Stage 1's direct-answer foundation
  • Extended benchmarks: MMLU, BBH, IFEval post-Stage 2
  • Precision-matched re-benchmark: re-run base vs Atem comparison under identical 4-bit (or identical bf16) conditions

Citation

@misc{atem_06b_2026,
  author       = {Asad, Zain},
  title        = {Atem-0.6B: A 0.6B Direct-Reasoning Model via
                  Stage 1 SFT on Qwen3},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EphAsad/Atem-0.6B}},
}


License

Released under the Apache 2.0 License, consistent with the base model Qwen/Qwen3-0.6B.


Built independently by Zain Asad - EphAsad

Description
Model synced from source: EphAsad/Atem-0.6B
Readme 36 KiB
Languages
Jinja 100%