language, license, base_model, tags, datasets, pipeline_tag
language license base_model tags datasets pipeline_tag
en
apache-2.0 Qwen/Qwen2.5-1.5B-Instruct
reasoning
fine-tuned
qwen2.5
math
science
code
chain-of-thought
unsloth
open-thoughts/OpenThoughts3-1.2M
bespokelabs/Bespoke-Stratos-17k
text-generation

Aristaeus

Aristaeus is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct, trained to improve structured, step-by-step reasoning across mathematics, science, logic, and code. It is a Stage 1 reasoning model — the goal of this release is deliberate, verifiable chain-of-thought, not raw benchmark maximisation.

The name comes from Aristaeus, the ancient Greek deity of practical knowledge — beekeeping, olive cultivation, cheesemaking. Applied intelligence in service of real things.


Training

Detail Value
Base model Qwen/Qwen2.5-1.5B-Instruct
Fine-tune type Full fine-tune (bf16)
Hardware NVIDIA A100-SXM4-40GB
Training time ~81 minutes
Epochs 2
Sequence length 4096 tokens
Effective batch size 16 (batch 2 × grad accum 8)
Learning rate 2e-5 (cosine schedule)
Warmup ratio 0.05
Framework Unsloth + TRL SFTTrainer
Final train loss 1.083
Final eval loss 1.023

Datasets

open-thoughts/OpenThoughts3-1.2M — 30,000 examples sampled via streaming. Reasoning traces generated by QwQ-32B (Apache 2.0). Covers mathematics, science, and coding problems with long chain-of-thought traces.

bespokelabs/Bespoke-Stratos-17k — Full 16,710 examples. Curated from AIME/MATH olympiad problems, competitive programming (APPS, TACO), and science/puzzle data. Reasoning traces generated from DeepSeek-R1 via local inference.

Combined training set: ~47,000 examples after normalisation and filtering. Both datasets were selected for clean licensing (no API-generated outputs from closed models).


Evaluation

Aristaeus was compared against the base Qwen2.5-1.5B-Instruct across six reasoning tasks covering different problem types. Results below are from manual evaluation — no automated benchmark harness was used for this release.

Task Aristaeus Base
Unit conversion (train speed km → m/s) Correct Wrong (unit tracking failure)
Multi-step word problem (apples) Correct Correct
Deductive logic (mammals/warm-blooded) ⚠️ Correct answer, minor overreach Correct, richer detail
Recursive code trace (Fibonacci f(7)) Lost thread, no answer Correct (13)
Exponential growth (bacterial doubling) Correct (6400) Correct (6400)
Spatial constraint reasoning (water jug) Correct, includes verification Incoherent final steps

3 wins / 1 loss / 2 draws against base on this task set.

Honest limitations

Recursive call stack tracing is the clearest failure mode. On f(7) Fibonacci, Aristaeus lost track of the recursion depth, began questioning its own assumptions, and produced no final answer. The base model handled it correctly. This is consistent with a known capacity ceiling at 1.5B parameters for problems that require holding many simultaneous state variables. A 7B model would likely not exhibit this failure.

Logical overconfidence was observed on the deductive reasoning prompt. The model correctly concluded dolphins are warm-blooded, but also asserted snakes are cold-blooded purely from the premise "snakes are not mammals" — which does not logically follow without additional premises. The model has learned to produce confident, structured conclusions, which occasionally leads it to state more than the premises support. This is a known SFT artefact when training data rewards assertive, well-formatted responses.

The eval loss curve plateaued convincingly from step ~2800 onward, suggesting the model saturated the current dataset. Additional epochs would not improve this release.


Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EphAsad/Aristaeus")
tokenizer = AutoTokenizer.from_pretrained("EphAsad/Aristaeus")

messages = [
    {"role": "system", "content": "You are a helpful reasoning assistant."},
    {"role": "user",   "content": "A bacterial culture starts with 100 cells and doubles every 20 minutes. How many cells after 2 hours?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_p=0.9, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Roadmap

Aristaeus is a Stage 1 release. Two further stages are planned:

Stage 2 — Agentic tool use. Fine-tuning on lambda/hermes-agent-reasoning-traces (Apache 2.0, agentic trajectories with <think> blocks and real tool execution results) at 16k context. The intention is to teach the model when and how to use tools, layered on top of the reasoning foundation established here.


Author

Built by Zain Asad (Eph) — Senior Microbiology Analyst and Applied AI Engineer.

Core portfolio: BactAID · DomainEmbedder · FireSOP · FireAccess LIMS · Eidos · Ananke


Licence

Apache 2.0 — consistent with the base model and training datasets used.

Description
Model synced from source: EphAsad/Aristaeus
Readme 29 KiB
Languages
Jinja 100%