--- license: apache-2.0 base_model: Qwen/Qwen3-0.6B tags: - unsloth - lora - qwen3 - reasoning - distillation - conversational datasets: - EphAsad/QWENMillenium-SF - EphAsad/Phi4Millennium-SF - EphAsad/MistralMillenium-SF - Modotte/CodeX-2M-Thinking - Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned - WithinUsAI/MiniMax_M2.7_Distilled_5k - tuanha1305/DeepSeek-R1-Distill - open-r1/OpenThoughts-114k-math - flytech/python-codes-25k - FreedomIntelligence/medical-o1-reasoning-SFT - Jackrong/Claude-opus-4.7-TraceInversion-5000x language: - en pipeline_tag: text-generation library_name: transformers --- ![Atem Logo](https://huggingface.co/EphAsad/Atem-0.6B/resolve/main/Logo.png) # Atem-0.6B *Ancient logic. Modern intelligence.* A 0.6B reasoning model trained via multi-source knowledge distillation from frontier teacher models. ![Base Model](https://img.shields.io/badge/Base-Qwen3--0.6B-blue)![Method](https://img.shields.io/badge/Method-LoRA%20SFT-purple)![Parameters](https://img.shields.io/badge/Parameters-0.6B-orange)![License](https://img.shields.io/badge/License-Apache%202.0-green) --- ## Overview Atem-0.6B is a 0.6B parameter reasoning model built via supervised fine-tuning on a curated corpus of approximately 120,000 examples distilled from multiple frontier teacher models. Starting from Qwen/Qwen3-0.6B, Atem was trained using LoRA to preserve base model capabilities while shifting output style toward clean, directly-formatted final answers. This is **Stage 1** of a planned multi-stage training series, and the first entry in the Atem family built on Qwen3 rather than Qwen2.5. Stage 1 strips `` reasoning traces from all training data, deliberately suppressing Qwen3's native exposed chain-of-thought in favor of direct answers. **Stage 2 (Atem-Savant-0.6B) is currently in progress**, layering curated chain-of-thought traces back on top of this foundation — see [Known Limitations](#known-limitations) for why that stage matters. --- ## Model Details | Property | Value | | ------------------------ | ---------------------------------------- | | **Base model** | Qwen/Qwen3-0.6B | | **Training method** | LoRA Supervised Fine-Tuning (Stage 1) | | **LoRA config** | r=32, alpha=64, dropout=0.05 | | **Target modules** | q, k, v, o, gate, up, down projections | | **Parameters** | ~596M | | **Trainable (LoRA) params** | 20,185,088 (3.28% of base) | | **Training records** | 120,017 | | **Epochs** | 2 | | **Effective batch size** | 128 (batch 32 × grad accum 4) | | **Learning rate** | 2e-4, cosine schedule, 5% warmup | | **Final train loss** | 1.055 | | **Final val loss** | 1.073 | | **Hardware** | NVIDIA A100-SXM4 80GB | | **Max sequence length** | 4,096 tokens | | **Precision** | bfloat16 | | **License** | Apache 2.0 | --- ## Intended Use Atem-0.6B is designed for lightweight, open-ended reasoning tasks where structured, direct answers add value at low compute cost: - Code explanation, implementation, and debugging - Mathematical problem solving with working shown - Analytical reasoning and hypothesis evaluation - Concept explanation and comparative analysis - Logic, argument, and fallacy identification Atem-0.6B is **not** designed for retrieval-heavy factual lookup, real-time information, or tasks requiring broad knowledge breadth beyond its training domains. At 0.6B parameters its capability ceiling is naturally lower than larger Atem models — expect it to be most useful where speed and footprint matter more than depth on hard, multi-step problems. --- ## Training Data Atem-0.6B was trained on a corpus assembled from eleven sources, combining domain-specific generated datasets and publicly available distillation datasets from frontier models. All outputs containing `` reasoning traces were stripped to clean final responses for Stage 1 training. | Dataset | Records | Source / Teacher | | --------------------------------------------- | ------------ | ----------------------------------------------------- | | EphAsad/QWENMillenium-SF | ~ | Qwen2.5-14B — Analytical & Scientific | | EphAsad/Phi4Millennium-SF | ~ | Phi-4 14B — Mathematical Reasoning | | EphAsad/MistralMillenium-SF | ~ | Mistral-Nemo-12B — Language & Comprehension | | Modotte/CodeX-2M-Thinking | 40,000 | Mixed — Coding | | Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned | 23,000 | Kimi K2.5 — General Distillation (English filtered) | | WithinUsAI/MiniMax_M2.7_Distilled_5k | 5,000 | MiniMax M2.7 | | tuanha1305/DeepSeek-R1-Distill | 9,000 | DeepSeek-R1 | | open-r1/OpenThoughts-114k-math | 10,000 | Mixed — Mathematics (correct answers only) | | flytech/python-codes-25k | 10,000 | Python coding | | FreedomIntelligence/medical-o1-reasoning-SFT | 10,000 | Medical reasoning (English config) | | Jackrong/Claude-opus-4.7-TraceInversion-5000x | 5,000 | Claude Opus 4.7 — Trace Inversion | | **Total** | **120,017** | | --- ## Training Configuration ```python # Key hyperparameters lora_r = 32 lora_alpha = 64 lora_dropout = 0.05 max_seq_length = 4096 learning_rate = 2e-4 lr_scheduler = 'cosine' warmup_ratio = 0.05 batch_size = 32 grad_accumulation = 4 # effective batch size: 128 num_epochs = 2 dtype = bfloat16 load_in_4bit = True # during training ``` Training used Unsloth with `train_on_responses_only` masking, ensuring loss was computed exclusively on assistant response tokens. Because Qwen3 ships with no default system prompt (unlike Qwen2.5-Instruct), Atem's identity is baked in via a chat-template modification that injects Atem as the default persona only when no explicit system message is supplied — explicit system messages still take priority. A pre-training validation suite verified this injection, confirmed the response-masking boundary correctly accounts for Qwen3's automatic empty `` scaffold insertion, and checked for leaked reasoning content before training began. After training, LoRA adapters were merged into the base weights and exported as a full merged model. **Loss curve:** | Step | Train Loss | Val Loss | | ----- | ---------- | --------- | | 200 | 1.166 | 1.163 | | 800 | 1.108 | 1.096 | | 1400 | 0.983 | 1.077 | | Final (1876) | **1.055** | **1.073** | Validation loss plateaued around step 1600 of 1876 total steps — the final ~15% of training produced only marginal further improvement (1.074 → 1.073). Train loss showed some batch-to-batch volatility late in training (a step-1800 spike to 1.088, consistent with the dataset's domain diversity rather than divergence), but validation loss stayed smooth and never reversed, indicating no overfitting across the two epochs. --- ## Evaluation ### Benchmark Results Evaluated against the base model (`unsloth/qwen3-0.6b-unsloth-bnb-4bit`) using lm-evaluation-harness. | Task | Base (Qwen3-0.6B) | Atem-0.6B | Delta | | ------------------------ | ------------------ | ---------- | ----------- | | ARC-Challenge (0-shot, acc_norm) | 33.0% | 35.0% | +2.0% ✓ | | GSM8K (5-shot, strict-match) | 26.7% | **31.8%** | **+5.1%** ✓ | | HellaSwag (0-shot, acc_norm) | 45.3% | 45.8% | +0.5% | **Eval condition note:** the base model was loaded in 4-bit (`unsloth/qwen3-0.6b-unsloth-bnb-4bit`); Atem-0.6B was evaluated as the full bfloat16 merged model. This is not a precision-matched comparison — the gap may be modestly inflated relative to a 4-bit-vs-4-bit or bf16-vs-bf16 run. GSM8K used 5-shot prompting per lm-eval's default config; ARC-Challenge and HellaSwag were 0-shot. The GSM8K gain is the standout figure, but it likely reflects Stage 1's training toward clean, directly-formatted final answers — which matters a great deal for lm-eval's exact-match-on-extracted-number scoring — more than a deeper improvement in multi-step mathematical reasoning. The qualitative evaluation below, which looks at harder, less templated problems, supports this reading: reasoning depth on multi-step problems is not uniformly better than the base model. ARC-Challenge and HellaSwag, which probe general/commonsense knowledge rather than output formatting, moved only slightly — expected, since SFT on this corpus isn't designed to add new general knowledge. ### Qualitative Evaluation Atem-0.6B was evaluated against base Qwen3-0.6B (default thinking-enabled) across 30 domain-representative questions with matched system prompts. | Domain | Questions | Outcome | | --------------------- | --------- | ----------------------------------------------------------------------------------------- | | Coding | 8 | Mixed — comparable correctness; Atem notably more concise and direct | | Mathematics | 6 | Mixed — base model's exposed reasoning self-corrects mid-generation on some multi-step problems that Atem commits to an error on | | Analytical Reasoning | 6 | Base model edges ahead — exposed reasoning gives more room to work through multi-step arguments | | General Knowledge | 5 | Comparable | | Language & Logic | 5 | Comparable, slight edge to base on illustrative examples | Atem-0.6B's outputs were consistently more concise and directly formatted — a direct result of Stage 1's design goal of suppressing exposed chain-of-thought. This did **not** translate into a uniform quality advantage over the base model: on problems requiring several sequential reasoning steps, the base model's visible thinking trace sometimes catches and corrects mistakes mid-generation that Atem, having no scratchpad, does not. This is the expected cost of the no-think format rather than a knowledge regression, and is the explicit target of the in-progress Stage 2 (Atem-Savant-0.6B) training, which reintroduces chain-of-thought. One output during qualitative testing showed repetitive/degenerate text (a duplicated bullet list) on a single open-ended analytical question — noted here for transparency rather than treated as representative. --- ## Usage ### Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "EphAsad/Atem-0.6B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) messages = [ { "role": "user", "content": "Write a Python function that checks whether a number is prime." } ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate( input_ids=inputs, max_new_tokens=1000, temperature=0.7, top_p=0.9, repetition_penalty=1.1, do_sample=True, ) response = tokenizer.decode( output[0][inputs.shape[1]:], skip_special_tokens=True ) print(response) ``` ### Unsloth (faster inference) ```python from unsloth import FastLanguageModel import torch model, tokenizer = FastLanguageModel.from_pretrained( model_name="EphAsad/Atem-0.6B", max_seq_length=4096, dtype=torch.bfloat16, load_in_4bit=True, ) FastLanguageModel.for_inference(model) messages = [ { "role": "user", "content": "Explain the difference between a stack and a queue, with examples." } ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to("cuda") with torch.no_grad(): output = model.generate( input_ids=inputs, max_new_tokens=1000, temperature=0.7, top_p=0.9, do_sample=True, ) print(tokenizer.decode( output[0][inputs.shape[1]:], skip_special_tokens=True )) ``` ### Ollama ```bash # Recommended — best speed/quality balance ollama run hf.co/EphAsad/Atem-0.6B:Q4_K_M # Higher quality ollama run hf.co/EphAsad/Atem-0.6B:Q5_K_M # Near-lossless ollama run hf.co/EphAsad/Atem-0.6B:Q8_0 ``` ### llama.cpp ```bash llama-server -hf EphAsad/Atem-0.6B:Q4_K_M ``` ### System Prompt Atem-0.6B's identity is baked into the chat template and activates automatically when no system message is provided. For manual override: ``` You are Atem, a precise and analytical reasoning assistant. You approach every problem methodically — identifying core concepts, reasoning step by step, and arriving at well-supported conclusions. You show your thinking clearly and are thorough, direct, and intellectually honest. ``` ### Available Files | File | Size | Description | | --------------------------- | ---------- | ----------------------------------- | | `model.safetensors` | ~1.2 GB | Full bfloat16 merged weights | | `Atem-0.6b.Q4_K_M.gguf` | ~397 MB | 4-bit quantised — recommended | | `Atem-0.6b.Q5_K_M.gguf` | ~444 MB | 5-bit quantised | | `Atem-0.6b.Q8_0.gguf` | ~700 MB | 8-bit quantised — near-lossless | --- ## Known Limitations **No thinking traces (Stage 1 by design).** Think tags were stripped from all training data for Stage 1, and Qwen3's native exposed reasoning is suppressed. The model does not produce extended `` content. As shown in the qualitative evaluation above, this measurably costs accuracy on multi-step analytical and mathematical problems relative to the base model's default thinking-enabled behavior — Stage 2 (Atem-Savant-0.6B, in progress) exists specifically to recover this. **Smaller capability ceiling than larger Atem models.** At 0.6B parameters, this is the smallest model in the Atem family. Treat it as a fast, low-footprint option rather than a reasoning-depth flagship. **Mathematical precision on complex problems.** On multi-step calculations, the model may make arithmetic or counting errors without a scratchpad to catch them — verified directly in qualitative testing (e.g., miscounting combinatorial outcomes). Answers to high-stakes mathematical problems should be independently verified. **Eval precision asymmetry.** The benchmark comparison above evaluated the base model in 4-bit and Atem-0.6B in bfloat16 — see the Evaluation section for details. A precision-matched re-run would give a cleaner comparison. --- ## Roadmap Atem-0.6B establishes the Stage 1 foundation for the Qwen3-based branch of the Atem family. Planned next steps: - **Stage 2 (in progress):** Atem-Savant-0.6B — LoRA SFT on curated chain-of-thought data (~90% think-trace records, ~10% no-think) using OpenR1-Math, Kimi-K2.5, DeepSeek-V4-Pro-Reasoning, OpenCodeReasoning, and trace-inversion datasets, to recover multi-step reasoning depth on top of Stage 1's direct-answer foundation - **Extended benchmarks:** MMLU, BBH, IFEval post-Stage 2 - **Precision-matched re-benchmark:** re-run base vs Atem comparison under identical 4-bit (or identical bf16) conditions --- ## Citation ```bibtex @misc{atem_06b_2026, author = {Asad, Zain}, title = {Atem-0.6B: A 0.6B Direct-Reasoning Model via Stage 1 SFT on Qwen3}, year = {2026}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/EphAsad/Atem-0.6B}}, } ``` --- ## License Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base model Qwen/Qwen3-0.6B. --- Built independently by Zain Asad - [EphAsad](https://huggingface.co/EphAsad)