--- license: apache-2.0 base_model: Qwen/Qwen2.5-1.5B-Instruct tags: - qwen2.5 - chain-of-thought - reasoning - fine-tuned - gguf language: - en pipeline_tag: text-generation --- # Lily 1.5B — v0.1 Lily is a fine-tuned 1.5B parameter language model built on [Qwen 2.5 1.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct). It is trained to reason explicitly before answering — every response includes a visible thinking step inside `` tags, followed by the final answer inside `` tags. The model is optimized for precision and structured output. It stays direct, avoids filler phrases, and scales response depth to the complexity of the question. --- ## Model Details | Property | Value | |---|---| | **Base model** | Qwen/Qwen2.5-1.5B-Instruct | | **Parameters** | 1.5B | | **Context length** | 4096 tokens | | **Fine-tuning** | Supervised fine-tuning (SFT) on chain-of-thought formatted data | | **Output format** | `...` reasoning + `...` final response | | **License** | Apache 2.0 | --- ## Output Format Every response from Lily follows this structure: ``` [Step-by-step reasoning, working through the problem before committing to an answer] [Final response — structured, precise, and direct] ``` The `` block is Lily's scratchpad — it plans, evaluates, and drafts before producing the answer. This makes the model's reasoning transparent and auditable. --- ## Quick Start ### Transformers (Python) ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "abhinav0231/Lily-1.5b-v0.1" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto", ) SYSTEM_PROMPT = ( "You are Lily, a precise and thoughtful AI assistant.\n\n" "Always reason step by step inside tags, " "then write your final answer inside tags.\n\n" "When answering:\n" "- Be thorough: cover all relevant aspects, not just the surface question\n" "- Be specific: use exact values, names, and examples rather than vague generalities\n" "- Structure long responses with markdown headers, code blocks, and lists where appropriate\n" "- Lead with the most important information first\n" "- Match the depth of your answer to the complexity of the question\n\n" "Tone: direct and confident. Never use filler phrases like \"Certainly!\", " "\"Great question!\", or \"Of course!\". Be helpful without being sycophantic." ) def ask(question, max_new_tokens=512): messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": question}, ] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=max_new_tokens, temperature=0.7, top_p=0.9, do_sample=True, pad_token_id=tokenizer.eos_token_id, ) response = tokenizer.decode( output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True, ) return response print(ask("What is the difference between a list and a tuple in Python?")) ``` --- ## GGUF (llama.cpp / Ollama / LM Studio) Quantized GGUF versions are available at [abhinav0231/Lily-1.5b-v0.1-GGUF](https://huggingface.co/abhinav0231/Lily-1.5b-v0.1-GGUF). | Quant | Size | Use case | |---|---|---| | `Q4_K_M` | ~1.0 GB | Best balance of speed and quality for CPU inference | | `Q5_K_M` | ~1.2 GB | Better quality, still fast on CPU | | `Q8_0` | ~1.6 GB | Near-lossless, recommended if VRAM/RAM allows | | `F16` | ~3.1 GB | Full precision, GPU only | ### llama.cpp ```bash # Download a quant huggingface-cli download abhinav0231/Lily-1.5b-v0.1-GGUF \ Lily-1.5b-v0.1-Q4_K_M.gguf \ --local-dir ./ # Run the server ./llama.cpp/build/bin/llama-server \ -m Lily-1.5b-v0.1-Q4_K_M.gguf \ --ctx-size 4096 \ --port 8080 ``` ### Ollama ```bash # Create a Modelfile cat > Modelfile << 'EOF' FROM ./Lily-1.5b-v0.1-Q4_K_M.gguf SYSTEM "You are Lily, a precise and thoughtful AI assistant. Always reason step by step inside tags, then write your final answer inside tags. When answering: - Be thorough: cover all relevant aspects, not just the surface question - Be specific: use exact values, names, and examples rather than vague generalities - Structure long responses with markdown headers, code blocks, and lists where appropriate - Lead with the most important information first - Match the depth of your answer to the complexity of the question Tone: direct and confident. Never use filler phrases like \"Certainly!\", \"Great question!\", or \"Of course!\". Be helpful without being sycophantic." EOF # Build and run ollama create lily -f Modelfile ollama run lily "Explain how transformers work" ``` --- ## System Prompt The system prompt below is embedded in the model's chat template and applied automatically when using `apply_chat_template`. You do **not** need to set it manually if using the Transformers pipeline — it is already the default. The critical sentence that triggers the `/` format — kept verbatim from training — is: > *Always reason step by step inside `` tags, then write your final answer inside `` tags.* The rest of the system prompt shapes tone and response quality and can be overridden by passing a custom `system` message. --- ## Intended Use Lily is a general-purpose assistant fine-tune. It performs well on: - Reasoning and logic problems - Code explanation and generation - Structured question answering - Step-by-step problem solving The explicit `` step makes it especially useful in applications where reasoning transparency matters — grading, debugging, tutoring, or any workflow where you need to see *why* the model gave a particular answer, not just the answer itself. --- ## Limitations - **1.5B parameters**: Not suited for tasks requiring broad world knowledge or long multi-document context - **v0.1**: Early release — output quality and format consistency will improve in future versions - **English primary**: Training data is predominantly English; multilingual performance is limited - **No tool use / function calling**: This version does not support structured tool call outputs --- ## Training Fine-tuned from `Qwen/Qwen2.5-1.5B-Instruct` using supervised fine-tuning on a dataset of chain-of-thought formatted examples. Each training example uses the `/` output structure. Training was performed on a single T4 GPU via Google Colab. --- ## Citation If you use Lily in research or a project, please cite: ``` @misc{lily-1.5b-v0.1, author = {abhinav0231}, title = {Lily 1.5B v0.1: A chain-of-thought fine-tune of Qwen 2.5 1.5B}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/abhinav0231/Lily-1.5b-v0.1} } ``` --- ## License Apache 2.0 — see [LICENSE](https://www.apache.org/licenses/LICENSE-2.0).