Files

ModelHub XC 9c60ea8806 初始化项目，由ModelHub XC社区提供模型

Model: abhinav0231/Lily-1.5b-v0.1-GGUF
Source: Original Platform

2026-04-30 18:40:46 +08:00

7.2 KiB

Raw Permalink Blame History

license, base_model, tags, language, pipeline_tag

license

base_model

Lily 1.5B — v0.1

Lily is a fine-tuned 1.5B parameter language model built on Qwen 2.5 1.5B Instruct. It is trained to reason explicitly before answering — every response includes a visible thinking step inside <think> tags, followed by the final answer inside <answer> tags.

The model is optimized for precision and structured output. It stays direct, avoids filler phrases, and scales response depth to the complexity of the question.

Model Details

Property	Value
Base model	Qwen/Qwen2.5-1.5B-Instruct
Parameters	1.5B
Context length	4096 tokens
Fine-tuning	Supervised fine-tuning (SFT) on chain-of-thought formatted data
Output format	`<think>...</think>` reasoning + `<answer>...</answer>` final response
License	Apache 2.0

Output Format

Every response from Lily follows this structure:

<think>
[Step-by-step reasoning, working through the problem before committing to an answer]
</think>
<answer>
[Final response — structured, precise, and direct]
</answer>

The <think> block is Lily's scratchpad — it plans, evaluates, and drafts before producing the answer. This makes the model's reasoning transparent and auditable.

Quick Start

Transformers (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "abhinav0231/Lily-1.5b-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

SYSTEM_PROMPT = (
    "You are Lily, a precise and thoughtful AI assistant.\n\n"
    "Always reason step by step inside <think></think> tags, "
    "then write your final answer inside <answer></answer> tags.\n\n"
    "When answering:\n"
    "- Be thorough: cover all relevant aspects, not just the surface question\n"
    "- Be specific: use exact values, names, and examples rather than vague generalities\n"
    "- Structure long responses with markdown headers, code blocks, and lists where appropriate\n"
    "- Lead with the most important information first\n"
    "- Match the depth of your answer to the complexity of the question\n\n"
    "Tone: direct and confident. Never use filler phrases like \"Certainly!\", "
    "\"Great question!\", or \"Of course!\". Be helpful without being sycophantic."
)

def ask(question, max_new_tokens=512):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": question},
    ]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    response = tokenizer.decode(
        output[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True,
    )
    return response

print(ask("What is the difference between a list and a tuple in Python?"))

GGUF (llama.cpp / Ollama / LM Studio)

Quantized GGUF versions are available at abhinav0231/Lily-1.5b-v0.1-GGUF.

Quant	Size	Use case
`Q4_K_M`	~1.0 GB	Best balance of speed and quality for CPU inference
`Q5_K_M`	~1.2 GB	Better quality, still fast on CPU
`Q8_0`	~1.6 GB	Near-lossless, recommended if VRAM/RAM allows
`F16`	~3.1 GB	Full precision, GPU only

llama.cpp

# Download a quant
huggingface-cli download abhinav0231/Lily-1.5b-v0.1-GGUF \
    Lily-1.5b-v0.1-Q4_K_M.gguf \
    --local-dir ./

# Run the server
./llama.cpp/build/bin/llama-server \
    -m Lily-1.5b-v0.1-Q4_K_M.gguf \
    --ctx-size 4096 \
    --port 8080

Ollama

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Lily-1.5b-v0.1-Q4_K_M.gguf
SYSTEM "You are Lily, a precise and thoughtful AI assistant.

Always reason step by step inside <think></think> tags, then write your final answer inside <answer></answer> tags.

When answering:
- Be thorough: cover all relevant aspects, not just the surface question
- Be specific: use exact values, names, and examples rather than vague generalities
- Structure long responses with markdown headers, code blocks, and lists where appropriate
- Lead with the most important information first
- Match the depth of your answer to the complexity of the question

Tone: direct and confident. Never use filler phrases like \"Certainly!\", \"Great question!\", or \"Of course!\". Be helpful without being sycophantic."
EOF

# Build and run
ollama create lily -f Modelfile
ollama run lily "Explain how transformers work"

System Prompt

The system prompt below is embedded in the model's chat template and applied automatically when using apply_chat_template. You do not need to set it manually if using the Transformers pipeline — it is already the default.

The critical sentence that triggers the <think>/<answer> format — kept verbatim from training — is:

Always reason step by step inside <think></think> tags, then write your final answer inside <answer></answer> tags.

The rest of the system prompt shapes tone and response quality and can be overridden by passing a custom system message.

Intended Use

Lily is a general-purpose assistant fine-tune. It performs well on:

Reasoning and logic problems
Code explanation and generation
Structured question answering
Step-by-step problem solving

The explicit <think> step makes it especially useful in applications where reasoning transparency matters — grading, debugging, tutoring, or any workflow where you need to see why the model gave a particular answer, not just the answer itself.

Limitations

1.5B parameters: Not suited for tasks requiring broad world knowledge or long multi-document context
v0.1: Early release — output quality and format consistency will improve in future versions
English primary: Training data is predominantly English; multilingual performance is limited
No tool use / function calling: This version does not support structured tool call outputs

Training

Fine-tuned from Qwen/Qwen2.5-1.5B-Instruct using supervised fine-tuning on a dataset of chain-of-thought formatted examples. Each training example uses the <think>/<answer> output structure. Training was performed on a single T4 GPU via Google Colab.

Citation

If you use Lily in research or a project, please cite:

@misc{lily-1.5b-v0.1,
  author    = {abhinav0231},
  title     = {Lily 1.5B v0.1: A chain-of-thought fine-tune of Qwen 2.5 1.5B},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/abhinav0231/Lily-1.5b-v0.1}
}

License

Apache 2.0 — see LICENSE.

7.2 KiB Raw Permalink Blame History