Files
Lily-1.5b-v0.1-GGUF/README.md
ModelHub XC 9c60ea8806 初始化项目,由ModelHub XC社区提供模型
Model: abhinav0231/Lily-1.5b-v0.1-GGUF
Source: Original Platform
2026-04-30 18:40:46 +08:00

7.2 KiB

license, base_model, tags, language, pipeline_tag
license base_model tags language pipeline_tag
apache-2.0 Qwen/Qwen2.5-1.5B-Instruct
qwen2.5
chain-of-thought
reasoning
fine-tuned
gguf
en
text-generation

Lily 1.5B — v0.1

Lily is a fine-tuned 1.5B parameter language model built on Qwen 2.5 1.5B Instruct. It is trained to reason explicitly before answering — every response includes a visible thinking step inside <think> tags, followed by the final answer inside <answer> tags.

The model is optimized for precision and structured output. It stays direct, avoids filler phrases, and scales response depth to the complexity of the question.


Model Details

Property Value
Base model Qwen/Qwen2.5-1.5B-Instruct
Parameters 1.5B
Context length 4096 tokens
Fine-tuning Supervised fine-tuning (SFT) on chain-of-thought formatted data
Output format <think>...</think> reasoning + <answer>...</answer> final response
License Apache 2.0

Output Format

Every response from Lily follows this structure:

<think>
[Step-by-step reasoning, working through the problem before committing to an answer]
</think>
<answer>
[Final response — structured, precise, and direct]
</answer>

The <think> block is Lily's scratchpad — it plans, evaluates, and drafts before producing the answer. This makes the model's reasoning transparent and auditable.


Quick Start

Transformers (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "abhinav0231/Lily-1.5b-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

SYSTEM_PROMPT = (
    "You are Lily, a precise and thoughtful AI assistant.\n\n"
    "Always reason step by step inside <think></think> tags, "
    "then write your final answer inside <answer></answer> tags.\n\n"
    "When answering:\n"
    "- Be thorough: cover all relevant aspects, not just the surface question\n"
    "- Be specific: use exact values, names, and examples rather than vague generalities\n"
    "- Structure long responses with markdown headers, code blocks, and lists where appropriate\n"
    "- Lead with the most important information first\n"
    "- Match the depth of your answer to the complexity of the question\n\n"
    "Tone: direct and confident. Never use filler phrases like \"Certainly!\", "
    "\"Great question!\", or \"Of course!\". Be helpful without being sycophantic."
)

def ask(question, max_new_tokens=512):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": question},
    ]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    response = tokenizer.decode(
        output[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True,
    )
    return response

print(ask("What is the difference between a list and a tuple in Python?"))

GGUF (llama.cpp / Ollama / LM Studio)

Quantized GGUF versions are available at abhinav0231/Lily-1.5b-v0.1-GGUF.

Quant Size Use case
Q4_K_M ~1.0 GB Best balance of speed and quality for CPU inference
Q5_K_M ~1.2 GB Better quality, still fast on CPU
Q8_0 ~1.6 GB Near-lossless, recommended if VRAM/RAM allows
F16 ~3.1 GB Full precision, GPU only

llama.cpp

# Download a quant
huggingface-cli download abhinav0231/Lily-1.5b-v0.1-GGUF \
    Lily-1.5b-v0.1-Q4_K_M.gguf \
    --local-dir ./

# Run the server
./llama.cpp/build/bin/llama-server \
    -m Lily-1.5b-v0.1-Q4_K_M.gguf \
    --ctx-size 4096 \
    --port 8080

Ollama

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Lily-1.5b-v0.1-Q4_K_M.gguf
SYSTEM "You are Lily, a precise and thoughtful AI assistant.

Always reason step by step inside <think></think> tags, then write your final answer inside <answer></answer> tags.

When answering:
- Be thorough: cover all relevant aspects, not just the surface question
- Be specific: use exact values, names, and examples rather than vague generalities
- Structure long responses with markdown headers, code blocks, and lists where appropriate
- Lead with the most important information first
- Match the depth of your answer to the complexity of the question

Tone: direct and confident. Never use filler phrases like \"Certainly!\", \"Great question!\", or \"Of course!\". Be helpful without being sycophantic."
EOF

# Build and run
ollama create lily -f Modelfile
ollama run lily "Explain how transformers work"

System Prompt

The system prompt below is embedded in the model's chat template and applied automatically when using apply_chat_template. You do not need to set it manually if using the Transformers pipeline — it is already the default.

The critical sentence that triggers the <think>/<answer> format — kept verbatim from training — is:

Always reason step by step inside <think></think> tags, then write your final answer inside <answer></answer> tags.

The rest of the system prompt shapes tone and response quality and can be overridden by passing a custom system message.


Intended Use

Lily is a general-purpose assistant fine-tune. It performs well on:

  • Reasoning and logic problems
  • Code explanation and generation
  • Structured question answering
  • Step-by-step problem solving

The explicit <think> step makes it especially useful in applications where reasoning transparency matters — grading, debugging, tutoring, or any workflow where you need to see why the model gave a particular answer, not just the answer itself.


Limitations

  • 1.5B parameters: Not suited for tasks requiring broad world knowledge or long multi-document context
  • v0.1: Early release — output quality and format consistency will improve in future versions
  • English primary: Training data is predominantly English; multilingual performance is limited
  • No tool use / function calling: This version does not support structured tool call outputs

Training

Fine-tuned from Qwen/Qwen2.5-1.5B-Instruct using supervised fine-tuning on a dataset of chain-of-thought formatted examples. Each training example uses the <think>/<answer> output structure. Training was performed on a single T4 GPU via Google Colab.


Citation

If you use Lily in research or a project, please cite:

@misc{lily-1.5b-v0.1,
  author    = {abhinav0231},
  title     = {Lily 1.5B v0.1: A chain-of-thought fine-tune of Qwen 2.5 1.5B},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/abhinav0231/Lily-1.5b-v0.1}
}

License

Apache 2.0 — see LICENSE.