Atem-0.6B/README.md

---
license: apache-2.0
base_model: Qwen/Qwen3-0.6B
tags:
- unsloth
- lora
- qwen3
- reasoning
- distillation
- conversational
datasets:
- EphAsad/QWENMillenium-SF
- EphAsad/Phi4Millennium-SF
- EphAsad/MistralMillenium-SF
- Modotte/CodeX-2M-Thinking
- Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned
- WithinUsAI/MiniMax_M2.7_Distilled_5k
- tuanha1305/DeepSeek-R1-Distill
- open-r1/OpenThoughts-114k-math
- flytech/python-codes-25k
- FreedomIntelligence/medical-o1-reasoning-SFT
- Jackrong/Claude-opus-4.7-TraceInversion-5000x
language:
- en
pipeline_tag: text-generation
library_name: transformers
---

![Atem Logo](https://huggingface.co/EphAsad/Atem-0.6B/resolve/main/Logo.png)

# Atem-0.6B

*Ancient logic. Modern intelligence.*

A 0.6B reasoning model trained via multi-source knowledge distillation from frontier teacher models.

![Base Model](https://img.shields.io/badge/Base-Qwen3--0.6B-blue)![Method](https://img.shields.io/badge/Method-LoRA%20SFT-purple)![Parameters](https://img.shields.io/badge/Parameters-0.6B-orange)![License](https://img.shields.io/badge/License-Apache%202.0-green)

---

## Overview

Atem-0.6B is a 0.6B parameter reasoning model built via supervised fine-tuning on a curated corpus of approximately 120,000 examples distilled from multiple frontier teacher models. Starting from Qwen/Qwen3-0.6B, Atem was trained using LoRA to preserve base model capabilities while shifting output style toward clean, directly-formatted final answers.

This is **Stage 1** of a planned multi-stage training series, and the first entry in the Atem family built on Qwen3 rather than Qwen2.5. Stage 1 strips `<think>` reasoning traces from all training data, deliberately suppressing Qwen3's native exposed chain-of-thought in favor of direct answers. **Stage 2 (Atem-Savant-0.6B) is currently in progress**, layering curated chain-of-thought traces back on top of this foundation — see [Known Limitations](#known-limitations) for why that stage matters.

---

## Model Details

| Property                 | Value                                   |
| ------------------------ | ---------------------------------------- |
| **Base model**           | Qwen/Qwen3-0.6B                          |
| **Training method**      | LoRA Supervised Fine-Tuning (Stage 1)    |
| **LoRA config**          | r=32, alpha=64, dropout=0.05             |
| **Target modules**       | q, k, v, o, gate, up, down projections   |
| **Parameters**           | ~596M                                    |
| **Trainable (LoRA) params** | 20,185,088 (3.28% of base)            |
| **Training records**     | 120,017                                  |
| **Epochs**                | 2                                        |
| **Effective batch size** | 128 (batch 32 × grad accum 4)            |
| **Learning rate**        | 2e-4, cosine schedule, 5% warmup         |
| **Final train loss**     | 1.055                                    |
| **Final val loss**       | 1.073                                    |
| **Hardware**             | NVIDIA A100-SXM4 80GB                    |
| **Max sequence length**  | 4,096 tokens                             |
| **Precision**            | bfloat16                                 |
| **License**              | Apache 2.0                               |

---

## Intended Use

Atem-0.6B is designed for lightweight, open-ended reasoning tasks where structured, direct answers add value at low compute cost:

- Code explanation, implementation, and debugging
- Mathematical problem solving with working shown
- Analytical reasoning and hypothesis evaluation
- Concept explanation and comparative analysis
- Logic, argument, and fallacy identification

Atem-0.6B is **not** designed for retrieval-heavy factual lookup, real-time information, or tasks requiring broad knowledge breadth beyond its training domains. At 0.6B parameters its capability ceiling is naturally lower than larger Atem models — expect it to be most useful where speed and footprint matter more than depth on hard, multi-step problems.

---

## Training Data

Atem-0.6B was trained on a corpus assembled from eleven sources, combining domain-specific generated datasets and publicly available distillation datasets from frontier models. All outputs containing `<think>` reasoning traces were stripped to clean final responses for Stage 1 training.

| Dataset                                      | Records      | Source / Teacher                                    |
| --------------------------------------------- | ------------ | ----------------------------------------------------- |
| EphAsad/QWENMillenium-SF                      | ~            | Qwen2.5-14B — Analytical & Scientific                  |
| EphAsad/Phi4Millennium-SF                     | ~            | Phi-4 14B — Mathematical Reasoning                     |
| EphAsad/MistralMillenium-SF                   | ~            | Mistral-Nemo-12B — Language & Comprehension            |
| Modotte/CodeX-2M-Thinking                     | 40,000       | Mixed — Coding                                         |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned       | 23,000       | Kimi K2.5 — General Distillation (English filtered)    |
| WithinUsAI/MiniMax_M2.7_Distilled_5k          | 5,000        | MiniMax M2.7                                           |
| tuanha1305/DeepSeek-R1-Distill                | 9,000        | DeepSeek-R1                                            |
| open-r1/OpenThoughts-114k-math                | 10,000       | Mixed — Mathematics (correct answers only)             |
| flytech/python-codes-25k                      | 10,000       | Python coding                                          |
| FreedomIntelligence/medical-o1-reasoning-SFT  | 10,000       | Medical reasoning (English config)                     |
| Jackrong/Claude-opus-4.7-TraceInversion-5000x | 5,000        | Claude Opus 4.7 — Trace Inversion                      |
| **Total**                                     | **120,017**  |                                                         |

---

## Training Configuration

```python
# Key hyperparameters
lora_r            = 32
lora_alpha        = 64
lora_dropout      = 0.05
max_seq_length    = 4096
learning_rate     = 2e-4
lr_scheduler      = 'cosine'
warmup_ratio      = 0.05
batch_size        = 32
grad_accumulation = 4        # effective batch size: 128
num_epochs        = 2
dtype             = bfloat16
load_in_4bit      = True     # during training

```

Training used Unsloth with `train_on_responses_only` masking, ensuring loss was computed exclusively on assistant response tokens. Because Qwen3 ships with no default system prompt (unlike Qwen2.5-Instruct), Atem's identity is baked in via a chat-template modification that injects Atem as the default persona only when no explicit system message is supplied — explicit system messages still take priority. A pre-training validation suite verified this injection, confirmed the response-masking boundary correctly accounts for Qwen3's automatic empty `<think></think>` scaffold insertion, and checked for leaked reasoning content before training began.

After training, LoRA adapters were merged into the base weights and exported as a full merged model.

**Loss curve:**

| Step  | Train Loss | Val Loss  |
| ----- | ---------- | --------- |
| 200   | 1.166      | 1.163     |
| 800   | 1.108      | 1.096     |
| 1400  | 0.983      | 1.077     |
| Final (1876) | **1.055** | **1.073** |

Validation loss plateaued around step 1600 of 1876 total steps — the final ~15% of training produced only marginal further improvement (1.074 → 1.073). Train loss showed some batch-to-batch volatility late in training (a step-1800 spike to 1.088, consistent with the dataset's domain diversity rather than divergence), but validation loss stayed smooth and never reversed, indicating no overfitting across the two epochs.

---

## Evaluation

### Benchmark Results

Evaluated against the base model (`unsloth/qwen3-0.6b-unsloth-bnb-4bit`) using lm-evaluation-harness.

| Task                    | Base (Qwen3-0.6B) | Atem-0.6B | Delta       |
| ------------------------ | ------------------ | ---------- | ----------- |
| ARC-Challenge (0-shot, acc_norm) | 33.0%       | 35.0%      | +2.0% ✓     |
| GSM8K (5-shot, strict-match)     | 26.7%       | **31.8%**  | **+5.1%** ✓ |
| HellaSwag (0-shot, acc_norm)     | 45.3%       | 45.8%      | +0.5%       |

**Eval condition note:** the base model was loaded in 4-bit (`unsloth/qwen3-0.6b-unsloth-bnb-4bit`); Atem-0.6B was evaluated as the full bfloat16 merged model. This is not a precision-matched comparison — the gap may be modestly inflated relative to a 4-bit-vs-4-bit or bf16-vs-bf16 run. GSM8K used 5-shot prompting per lm-eval's default config; ARC-Challenge and HellaSwag were 0-shot.

The GSM8K gain is the standout figure, but it likely reflects Stage 1's training toward clean, directly-formatted final answers — which matters a great deal for lm-eval's exact-match-on-extracted-number scoring — more than a deeper improvement in multi-step mathematical reasoning. The qualitative evaluation below, which looks at harder, less templated problems, supports this reading: reasoning depth on multi-step problems is not uniformly better than the base model. ARC-Challenge and HellaSwag, which probe general/commonsense knowledge rather than output formatting, moved only slightly — expected, since SFT on this corpus isn't designed to add new general knowledge.

### Qualitative Evaluation

Atem-0.6B was evaluated against base Qwen3-0.6B (default thinking-enabled) across 30 domain-representative questions with matched system prompts.

| Domain               | Questions | Outcome                                                                                  |
| --------------------- | --------- | ----------------------------------------------------------------------------------------- |
| Coding                | 8         | Mixed — comparable correctness; Atem notably more concise and direct                      |
| Mathematics           | 6         | Mixed — base model's exposed reasoning self-corrects mid-generation on some multi-step problems that Atem commits to an error on |
| Analytical Reasoning  | 6         | Base model edges ahead — exposed reasoning gives more room to work through multi-step arguments |
| General Knowledge     | 5         | Comparable                                                                                 |
| Language & Logic      | 5         | Comparable, slight edge to base on illustrative examples                                  |

Atem-0.6B's outputs were consistently more concise and directly formatted — a direct result of Stage 1's design goal of suppressing exposed chain-of-thought. This did **not** translate into a uniform quality advantage over the base model: on problems requiring several sequential reasoning steps, the base model's visible thinking trace sometimes catches and corrects mistakes mid-generation that Atem, having no scratchpad, does not. This is the expected cost of the no-think format rather than a knowledge regression, and is the explicit target of the in-progress Stage 2 (Atem-Savant-0.6B) training, which reintroduces chain-of-thought.

One output during qualitative testing showed repetitive/degenerate text (a duplicated bullet list) on a single open-ended analytical question — noted here for transparency rather than treated as representative.

---

## Usage

### Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "EphAsad/Atem-0.6B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": "Write a Python function that checks whether a number is prime."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=1000,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        do_sample=True,
    )

response = tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
)
print(response)

```

### Unsloth (faster inference)

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="EphAsad/Atem-0.6B",
    max_seq_length=4096,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {
        "role": "user",
        "content": "Explain the difference between a stack and a queue, with examples."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=1000,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

print(tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
))

```

### Ollama

```bash
# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-0.6B:Q4_K_M

# Higher quality
ollama run hf.co/EphAsad/Atem-0.6B:Q5_K_M

# Near-lossless
ollama run hf.co/EphAsad/Atem-0.6B:Q8_0

```

### llama.cpp

```bash
llama-server -hf EphAsad/Atem-0.6B:Q4_K_M

```

### System Prompt

Atem-0.6B's identity is baked into the chat template and activates automatically when no system message is provided. For manual override:

```
You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.

```

### Available Files

| File                       | Size       | Description                       |
| --------------------------- | ---------- | ----------------------------------- |
| `model.safetensors`         | ~1.2 GB    | Full bfloat16 merged weights        |
| `Atem-0.6b.Q4_K_M.gguf`     | ~397 MB    | 4-bit quantised — recommended       |
| `Atem-0.6b.Q5_K_M.gguf`     | ~444 MB    | 5-bit quantised                     |
| `Atem-0.6b.Q8_0.gguf`       | ~700 MB    | 8-bit quantised — near-lossless     |

---

## Known Limitations

**No thinking traces (Stage 1 by design).** Think tags were stripped from all training data for Stage 1, and Qwen3's native exposed reasoning is suppressed. The model does not produce extended `<think>` content. As shown in the qualitative evaluation above, this measurably costs accuracy on multi-step analytical and mathematical problems relative to the base model's default thinking-enabled behavior — Stage 2 (Atem-Savant-0.6B, in progress) exists specifically to recover this.

**Smaller capability ceiling than larger Atem models.** At 0.6B parameters, this is the smallest model in the Atem family. Treat it as a fast, low-footprint option rather than a reasoning-depth flagship.

**Mathematical precision on complex problems.** On multi-step calculations, the model may make arithmetic or counting errors without a scratchpad to catch them — verified directly in qualitative testing (e.g., miscounting combinatorial outcomes). Answers to high-stakes mathematical problems should be independently verified.

**Eval precision asymmetry.** The benchmark comparison above evaluated the base model in 4-bit and Atem-0.6B in bfloat16 — see the Evaluation section for details. A precision-matched re-run would give a cleaner comparison.

---

## Roadmap

Atem-0.6B establishes the Stage 1 foundation for the Qwen3-based branch of the Atem family. Planned next steps:

- **Stage 2 (in progress):** Atem-Savant-0.6B — LoRA SFT on curated chain-of-thought data (~90% think-trace records, ~10% no-think) using OpenR1-Math, Kimi-K2.5, DeepSeek-V4-Pro-Reasoning, OpenCodeReasoning, and trace-inversion datasets, to recover multi-step reasoning depth on top of Stage 1's direct-answer foundation
- **Extended benchmarks:** MMLU, BBH, IFEval post-Stage 2
- **Precision-matched re-benchmark:** re-run base vs Atem comparison under identical 4-bit (or identical bf16) conditions

---

## Citation

```bibtex
@misc{atem_06b_2026,
  author       = {Asad, Zain},
  title        = {Atem-0.6B: A 0.6B Direct-Reasoning Model via
                  Stage 1 SFT on Qwen3},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EphAsad/Atem-0.6B}},
}

```

---

## License

Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base model Qwen/Qwen3-0.6B.

---

Built independently by Zain Asad - [EphAsad](https://huggingface.co/EphAsad)