Aristaeus/README.md

---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen2.5-1.5B-Instruct
tags:
- reasoning
- fine-tuned
- qwen2.5
- math
- science
- code
- chain-of-thought
- unsloth
datasets:
- open-thoughts/OpenThoughts3-1.2M
- bespokelabs/Bespoke-Stratos-17k
pipeline_tag: text-generation
---

# Aristaeus

**Aristaeus** is a fine-tuned version of [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), trained to improve structured, step-by-step reasoning across mathematics, science, logic, and code. It is a Stage 1 reasoning model — the goal of this release is deliberate, verifiable chain-of-thought, not raw benchmark maximisation.

The name comes from Aristaeus, the ancient Greek deity of practical knowledge — beekeeping, olive cultivation, cheesemaking. Applied intelligence in service of real things.

---

## Training

| Detail | Value |
|---|---|
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Fine-tune type | Full fine-tune (bf16) |
| Hardware | NVIDIA A100-SXM4-40GB |
| Training time | ~81 minutes |
| Epochs | 2 |
| Sequence length | 4096 tokens |
| Effective batch size | 16 (batch 2 × grad accum 8) |
| Learning rate | 2e-5 (cosine schedule) |
| Warmup ratio | 0.05 |
| Framework | Unsloth + TRL SFTTrainer |
| Final train loss | 1.083 |
| Final eval loss | 1.023 |

### Datasets

**[open-thoughts/OpenThoughts3-1.2M](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M)** — 30,000 examples sampled via streaming. Reasoning traces generated by QwQ-32B (Apache 2.0). Covers mathematics, science, and coding problems with long chain-of-thought traces.

**[bespokelabs/Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k)** — Full 16,710 examples. Curated from AIME/MATH olympiad problems, competitive programming (APPS, TACO), and science/puzzle data. Reasoning traces generated from DeepSeek-R1 via local inference.

Combined training set: ~47,000 examples after normalisation and filtering. Both datasets were selected for clean licensing (no API-generated outputs from closed models).

---

## Evaluation

Aristaeus was compared against the base Qwen2.5-1.5B-Instruct across six reasoning tasks covering different problem types. Results below are from manual evaluation — no automated benchmark harness was used for this release.

| Task | Aristaeus | Base |
|---|---|---|
| Unit conversion (train speed km → m/s) | ✅ Correct | ❌ Wrong (unit tracking failure) |
| Multi-step word problem (apples) | ✅ Correct | ✅ Correct |
| Deductive logic (mammals/warm-blooded) | ⚠️ Correct answer, minor overreach | ✅ Correct, richer detail |
| Recursive code trace (Fibonacci f(7)) | ❌ Lost thread, no answer | ✅ Correct (13) |
| Exponential growth (bacterial doubling) | ✅ Correct (6400) | ✅ Correct (6400) |
| Spatial constraint reasoning (water jug) | ✅ Correct, includes verification | ❌ Incoherent final steps |

**3 wins / 1 loss / 2 draws** against base on this task set.

### Honest limitations

**Recursive call stack tracing** is the clearest failure mode. On `f(7)` Fibonacci, Aristaeus lost track of the recursion depth, began questioning its own assumptions, and produced no final answer. The base model handled it correctly. This is consistent with a known capacity ceiling at 1.5B parameters for problems that require holding many simultaneous state variables. A 7B model would likely not exhibit this failure.

**Logical overconfidence** was observed on the deductive reasoning prompt. The model correctly concluded dolphins are warm-blooded, but also asserted snakes are cold-blooded purely from the premise "snakes are not mammals" — which does not logically follow without additional premises. The model has learned to produce confident, structured conclusions, which occasionally leads it to state more than the premises support. This is a known SFT artefact when training data rewards assertive, well-formatted responses.

The eval loss curve plateaued convincingly from step ~2800 onward, suggesting the model saturated the current dataset. Additional epochs would not improve this release.

---

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EphAsad/Aristaeus")
tokenizer = AutoTokenizer.from_pretrained("EphAsad/Aristaeus")

messages = [
    {"role": "system", "content": "You are a helpful reasoning assistant."},
    {"role": "user",   "content": "A bacterial culture starts with 100 cells and doubles every 20 minutes. How many cells after 2 hours?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_p=0.9, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

---

## Roadmap

Aristaeus is a Stage 1 release. Two further stages are planned:

**Stage 2 — Agentic tool use.** Fine-tuning on `lambda/hermes-agent-reasoning-traces` (Apache 2.0, agentic trajectories with `<think>` blocks and real tool execution results) at 16k context. The intention is to teach the model *when* and *how* to use tools, layered on top of the reasoning foundation established here.


---

## Author

Built by **Zain Asad** (Eph) — Senior Microbiology Analyst and Applied AI Engineer.

Core portfolio: [BactAID](https://doi.org/10.5281/zenodo.18089381) · [DomainEmbedder](https://huggingface.co/EphAsad/DomainEmbedder) · FireSOP · FireAccess LIMS · Eidos · Ananke

---

## Licence

Apache 2.0 — consistent with the base model and training datasets used.