7.2 KiB
license, base_model, tags, language, pipeline_tag
| license | base_model | tags | language | pipeline_tag | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| apache-2.0 | Qwen/Qwen2.5-1.5B-Instruct |
|
|
text-generation |
Lily 1.5B — v0.1
Lily is a fine-tuned 1.5B parameter language model built on Qwen 2.5 1.5B Instruct. It is trained to reason explicitly before answering — every response includes a visible thinking step inside <think> tags, followed by the final answer inside <answer> tags.
The model is optimized for precision and structured output. It stays direct, avoids filler phrases, and scales response depth to the complexity of the question.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Parameters | 1.5B |
| Context length | 4096 tokens |
| Fine-tuning | Supervised fine-tuning (SFT) on chain-of-thought formatted data |
| Output format | <think>...</think> reasoning + <answer>...</answer> final response |
| License | Apache 2.0 |
Output Format
Every response from Lily follows this structure:
<think>
[Step-by-step reasoning, working through the problem before committing to an answer]
</think>
<answer>
[Final response — structured, precise, and direct]
</answer>
The <think> block is Lily's scratchpad — it plans, evaluates, and drafts before producing the answer. This makes the model's reasoning transparent and auditable.
Quick Start
Transformers (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "abhinav0231/Lily-1.5b-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
SYSTEM_PROMPT = (
"You are Lily, a precise and thoughtful AI assistant.\n\n"
"Always reason step by step inside <think></think> tags, "
"then write your final answer inside <answer></answer> tags.\n\n"
"When answering:\n"
"- Be thorough: cover all relevant aspects, not just the surface question\n"
"- Be specific: use exact values, names, and examples rather than vague generalities\n"
"- Structure long responses with markdown headers, code blocks, and lists where appropriate\n"
"- Lead with the most important information first\n"
"- Match the depth of your answer to the complexity of the question\n\n"
"Tone: direct and confident. Never use filler phrases like \"Certainly!\", "
"\"Great question!\", or \"Of course!\". Be helpful without being sycophantic."
)
def ask(question, max_new_tokens=512):
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(
output[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True,
)
return response
print(ask("What is the difference between a list and a tuple in Python?"))
GGUF (llama.cpp / Ollama / LM Studio)
Quantized GGUF versions are available at abhinav0231/Lily-1.5b-v0.1-GGUF.
| Quant | Size | Use case |
|---|---|---|
Q4_K_M |
~1.0 GB | Best balance of speed and quality for CPU inference |
Q5_K_M |
~1.2 GB | Better quality, still fast on CPU |
Q8_0 |
~1.6 GB | Near-lossless, recommended if VRAM/RAM allows |
F16 |
~3.1 GB | Full precision, GPU only |
llama.cpp
# Download a quant
huggingface-cli download abhinav0231/Lily-1.5b-v0.1-GGUF \
Lily-1.5b-v0.1-Q4_K_M.gguf \
--local-dir ./
# Run the server
./llama.cpp/build/bin/llama-server \
-m Lily-1.5b-v0.1-Q4_K_M.gguf \
--ctx-size 4096 \
--port 8080
Ollama
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Lily-1.5b-v0.1-Q4_K_M.gguf
SYSTEM "You are Lily, a precise and thoughtful AI assistant.
Always reason step by step inside <think></think> tags, then write your final answer inside <answer></answer> tags.
When answering:
- Be thorough: cover all relevant aspects, not just the surface question
- Be specific: use exact values, names, and examples rather than vague generalities
- Structure long responses with markdown headers, code blocks, and lists where appropriate
- Lead with the most important information first
- Match the depth of your answer to the complexity of the question
Tone: direct and confident. Never use filler phrases like \"Certainly!\", \"Great question!\", or \"Of course!\". Be helpful without being sycophantic."
EOF
# Build and run
ollama create lily -f Modelfile
ollama run lily "Explain how transformers work"
System Prompt
The system prompt below is embedded in the model's chat template and applied automatically when using apply_chat_template. You do not need to set it manually if using the Transformers pipeline — it is already the default.
The critical sentence that triggers the <think>/<answer> format — kept verbatim from training — is:
Always reason step by step inside
<think></think>tags, then write your final answer inside<answer></answer>tags.
The rest of the system prompt shapes tone and response quality and can be overridden by passing a custom system message.
Intended Use
Lily is a general-purpose assistant fine-tune. It performs well on:
- Reasoning and logic problems
- Code explanation and generation
- Structured question answering
- Step-by-step problem solving
The explicit <think> step makes it especially useful in applications where reasoning transparency matters — grading, debugging, tutoring, or any workflow where you need to see why the model gave a particular answer, not just the answer itself.
Limitations
- 1.5B parameters: Not suited for tasks requiring broad world knowledge or long multi-document context
- v0.1: Early release — output quality and format consistency will improve in future versions
- English primary: Training data is predominantly English; multilingual performance is limited
- No tool use / function calling: This version does not support structured tool call outputs
Training
Fine-tuned from Qwen/Qwen2.5-1.5B-Instruct using supervised fine-tuning on a dataset of chain-of-thought formatted examples. Each training example uses the <think>/<answer> output structure. Training was performed on a single T4 GPU via Google Colab.
Citation
If you use Lily in research or a project, please cite:
@misc{lily-1.5b-v0.1,
author = {abhinav0231},
title = {Lily 1.5B v0.1: A chain-of-thought fine-tune of Qwen 2.5 1.5B},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/abhinav0231/Lily-1.5b-v0.1}
}
License
Apache 2.0 — see LICENSE.