--- language: - en license: apache-2.0 base_model: Qwen/Qwen2.5-1.5B-Instruct tags: - reasoning - fine-tuned - qwen2.5 - math - science - code - chain-of-thought - unsloth datasets: - open-thoughts/OpenThoughts3-1.2M - bespokelabs/Bespoke-Stratos-17k pipeline_tag: text-generation --- # Aristaeus **Aristaeus** is a fine-tuned version of [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), trained to improve structured, step-by-step reasoning across mathematics, science, logic, and code. It is a Stage 1 reasoning model — the goal of this release is deliberate, verifiable chain-of-thought, not raw benchmark maximisation. The name comes from Aristaeus, the ancient Greek deity of practical knowledge — beekeeping, olive cultivation, cheesemaking. Applied intelligence in service of real things. --- ## Training | Detail | Value | |---|---| | Base model | Qwen/Qwen2.5-1.5B-Instruct | | Fine-tune type | Full fine-tune (bf16) | | Hardware | NVIDIA A100-SXM4-40GB | | Training time | ~81 minutes | | Epochs | 2 | | Sequence length | 4096 tokens | | Effective batch size | 16 (batch 2 × grad accum 8) | | Learning rate | 2e-5 (cosine schedule) | | Warmup ratio | 0.05 | | Framework | Unsloth + TRL SFTTrainer | | Final train loss | 1.083 | | Final eval loss | 1.023 | ### Datasets **[open-thoughts/OpenThoughts3-1.2M](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M)** — 30,000 examples sampled via streaming. Reasoning traces generated by QwQ-32B (Apache 2.0). Covers mathematics, science, and coding problems with long chain-of-thought traces. **[bespokelabs/Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k)** — Full 16,710 examples. Curated from AIME/MATH olympiad problems, competitive programming (APPS, TACO), and science/puzzle data. Reasoning traces generated from DeepSeek-R1 via local inference. Combined training set: ~47,000 examples after normalisation and filtering. Both datasets were selected for clean licensing (no API-generated outputs from closed models). --- ## Evaluation Aristaeus was compared against the base Qwen2.5-1.5B-Instruct across six reasoning tasks covering different problem types. Results below are from manual evaluation — no automated benchmark harness was used for this release. | Task | Aristaeus | Base | |---|---|---| | Unit conversion (train speed km → m/s) | ✅ Correct | ❌ Wrong (unit tracking failure) | | Multi-step word problem (apples) | ✅ Correct | ✅ Correct | | Deductive logic (mammals/warm-blooded) | ⚠️ Correct answer, minor overreach | ✅ Correct, richer detail | | Recursive code trace (Fibonacci f(7)) | ❌ Lost thread, no answer | ✅ Correct (13) | | Exponential growth (bacterial doubling) | ✅ Correct (6400) | ✅ Correct (6400) | | Spatial constraint reasoning (water jug) | ✅ Correct, includes verification | ❌ Incoherent final steps | **3 wins / 1 loss / 2 draws** against base on this task set. ### Honest limitations **Recursive call stack tracing** is the clearest failure mode. On `f(7)` Fibonacci, Aristaeus lost track of the recursion depth, began questioning its own assumptions, and produced no final answer. The base model handled it correctly. This is consistent with a known capacity ceiling at 1.5B parameters for problems that require holding many simultaneous state variables. A 7B model would likely not exhibit this failure. **Logical overconfidence** was observed on the deductive reasoning prompt. The model correctly concluded dolphins are warm-blooded, but also asserted snakes are cold-blooded purely from the premise "snakes are not mammals" — which does not logically follow without additional premises. The model has learned to produce confident, structured conclusions, which occasionally leads it to state more than the premises support. This is a known SFT artefact when training data rewards assertive, well-formatted responses. The eval loss curve plateaued convincingly from step ~2800 onward, suggesting the model saturated the current dataset. Additional epochs would not improve this release. --- ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("EphAsad/Aristaeus") tokenizer = AutoTokenizer.from_pretrained("EphAsad/Aristaeus") messages = [ {"role": "system", "content": "You are a helpful reasoning assistant."}, {"role": "user", "content": "A bacterial culture starts with 100 cells and doubles every 20 minutes. How many cells after 2 hours?"}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_p=0.9, do_sample=True) print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` --- ## Roadmap Aristaeus is a Stage 1 release. Two further stages are planned: **Stage 2 — Agentic tool use.** Fine-tuning on `lambda/hermes-agent-reasoning-traces` (Apache 2.0, agentic trajectories with `` blocks and real tool execution results) at 16k context. The intention is to teach the model *when* and *how* to use tools, layered on top of the reasoning foundation established here. --- ## Author Built by **Zain Asad** (Eph) — Senior Microbiology Analyst and Applied AI Engineer. Core portfolio: [BactAID](https://doi.org/10.5281/zenodo.18089381) · [DomainEmbedder](https://huggingface.co/EphAsad/DomainEmbedder) · FireSOP · FireAccess LIMS · Eidos · Ananke --- ## Licence Apache 2.0 — consistent with the base model and training datasets used.