Lily-1.5b-v0.1-GGUF/README.md

---
license: apache-2.0
base_model: Qwen/Qwen2.5-1.5B-Instruct
tags:
  - qwen2.5
  - chain-of-thought
  - reasoning
  - fine-tuned
  - gguf
language:
  - en
pipeline_tag: text-generation
---

# Lily 1.5B — v0.1

Lily is a fine-tuned 1.5B parameter language model built on [Qwen 2.5 1.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct). It is trained to reason explicitly before answering — every response includes a visible thinking step inside `<think>` tags, followed by the final answer inside `<answer>` tags.

The model is optimized for precision and structured output. It stays direct, avoids filler phrases, and scales response depth to the complexity of the question.

---

## Model Details

| Property | Value |
|---|---|
| **Base model** | Qwen/Qwen2.5-1.5B-Instruct |
| **Parameters** | 1.5B |
| **Context length** | 4096 tokens |
| **Fine-tuning** | Supervised fine-tuning (SFT) on chain-of-thought formatted data |
| **Output format** | `<think>...</think>` reasoning + `<answer>...</answer>` final response |
| **License** | Apache 2.0 |

---

## Output Format

Every response from Lily follows this structure:

```
<think>
[Step-by-step reasoning, working through the problem before committing to an answer]
</think>
<answer>
[Final response — structured, precise, and direct]
</answer>
```

The `<think>` block is Lily's scratchpad — it plans, evaluates, and drafts before producing the answer. This makes the model's reasoning transparent and auditable.

---

## Quick Start

### Transformers (Python)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "abhinav0231/Lily-1.5b-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

SYSTEM_PROMPT = (
    "You are Lily, a precise and thoughtful AI assistant.\n\n"
    "Always reason step by step inside <think></think> tags, "
    "then write your final answer inside <answer></answer> tags.\n\n"
    "When answering:\n"
    "- Be thorough: cover all relevant aspects, not just the surface question\n"
    "- Be specific: use exact values, names, and examples rather than vague generalities\n"
    "- Structure long responses with markdown headers, code blocks, and lists where appropriate\n"
    "- Lead with the most important information first\n"
    "- Match the depth of your answer to the complexity of the question\n\n"
    "Tone: direct and confident. Never use filler phrases like \"Certainly!\", "
    "\"Great question!\", or \"Of course!\". Be helpful without being sycophantic."
)

def ask(question, max_new_tokens=512):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": question},
    ]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    response = tokenizer.decode(
        output[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True,
    )
    return response

print(ask("What is the difference between a list and a tuple in Python?"))
```

---

## GGUF (llama.cpp / Ollama / LM Studio)

Quantized GGUF versions are available at [abhinav0231/Lily-1.5b-v0.1-GGUF](https://huggingface.co/abhinav0231/Lily-1.5b-v0.1-GGUF).

| Quant | Size | Use case |
|---|---|---|
| `Q4_K_M` | ~1.0 GB | Best balance of speed and quality for CPU inference |
| `Q5_K_M` | ~1.2 GB | Better quality, still fast on CPU |
| `Q8_0`   | ~1.6 GB | Near-lossless, recommended if VRAM/RAM allows |
| `F16`    | ~3.1 GB | Full precision, GPU only |

### llama.cpp

```bash
# Download a quant
huggingface-cli download abhinav0231/Lily-1.5b-v0.1-GGUF \
    Lily-1.5b-v0.1-Q4_K_M.gguf \
    --local-dir ./

# Run the server
./llama.cpp/build/bin/llama-server \
    -m Lily-1.5b-v0.1-Q4_K_M.gguf \
    --ctx-size 4096 \
    --port 8080
```

### Ollama

```bash
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Lily-1.5b-v0.1-Q4_K_M.gguf
SYSTEM "You are Lily, a precise and thoughtful AI assistant.

Always reason step by step inside <think></think> tags, then write your final answer inside <answer></answer> tags.

When answering:
- Be thorough: cover all relevant aspects, not just the surface question
- Be specific: use exact values, names, and examples rather than vague generalities
- Structure long responses with markdown headers, code blocks, and lists where appropriate
- Lead with the most important information first
- Match the depth of your answer to the complexity of the question

Tone: direct and confident. Never use filler phrases like \"Certainly!\", \"Great question!\", or \"Of course!\". Be helpful without being sycophantic."
EOF

# Build and run
ollama create lily -f Modelfile
ollama run lily "Explain how transformers work"
```

---

## System Prompt

The system prompt below is embedded in the model's chat template and applied automatically when using `apply_chat_template`. You do **not** need to set it manually if using the Transformers pipeline — it is already the default.

The critical sentence that triggers the `<think>/<answer>` format — kept verbatim from training — is:

> *Always reason step by step inside `<think></think>` tags, then write your final answer inside `<answer></answer>` tags.*

The rest of the system prompt shapes tone and response quality and can be overridden by passing a custom `system` message.

---

## Intended Use

Lily is a general-purpose assistant fine-tune. It performs well on:

- Reasoning and logic problems
- Code explanation and generation
- Structured question answering
- Step-by-step problem solving

The explicit `<think>` step makes it especially useful in applications where reasoning transparency matters — grading, debugging, tutoring, or any workflow where you need to see *why* the model gave a particular answer, not just the answer itself.

---

## Limitations

- **1.5B parameters**: Not suited for tasks requiring broad world knowledge or long multi-document context
- **v0.1**: Early release — output quality and format consistency will improve in future versions
- **English primary**: Training data is predominantly English; multilingual performance is limited
- **No tool use / function calling**: This version does not support structured tool call outputs

---

## Training

Fine-tuned from `Qwen/Qwen2.5-1.5B-Instruct` using supervised fine-tuning on a dataset of chain-of-thought formatted examples. Each training example uses the `<think>/<answer>` output structure. Training was performed on a single T4 GPU via Google Colab.

---

## Citation

If you use Lily in research or a project, please cite:

```
@misc{lily-1.5b-v0.1,
  author    = {abhinav0231},
  title     = {Lily 1.5B v0.1: A chain-of-thought fine-tune of Qwen 2.5 1.5B},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/abhinav0231/Lily-1.5b-v0.1}
}
```

---

## License

Apache 2.0 — see [LICENSE](https://www.apache.org/licenses/LICENSE-2.0).
初始化项目，由ModelHub XC社区提供模型 Model: abhinav0231/Lily-1.5b-v0.1-GGUF Source: Original Platform 2026-04-30 18:40:46 +08:00			`---`
			`license: apache-2.0`
			`base_model: Qwen/Qwen2.5-1.5B-Instruct`
			`tags:`
			`- qwen2.5`
			`- chain-of-thought`
			`- reasoning`
			`- fine-tuned`
			`- gguf`
			`language:`
			`- en`
			`pipeline_tag: text-generation`
			`---`

			`# Lily 1.5B — v0.1`

			Lily is a fine-tuned 1.5B parameter language model built on [Qwen 2.5 1.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct). It is trained to reason explicitly before answering — every response includes a visible thinking step inside `<think>` tags, followed by the final answer inside `<answer>` tags.

			`The model is optimized for precision and structured output. It stays direct, avoids filler phrases, and scales response depth to the complexity of the question.`

			`---`

			`## Model Details`

			`\| Property \| Value \|`
			`\|---\|---\|`
			`\| Base model \| Qwen/Qwen2.5-1.5B-Instruct \|`
			`\| Parameters \| 1.5B \|`
			`\| Context length \| 4096 tokens \|`
			`\| Fine-tuning \| Supervised fine-tuning (SFT) on chain-of-thought formatted data \|`
			\| Output format \| `<think>...</think>` reasoning + `<answer>...</answer>` final response \|
			`\| License \| Apache 2.0 \|`

			`---`

			`## Output Format`

			`Every response from Lily follows this structure:`

			```
			`<think>`
			`[Step-by-step reasoning, working through the problem before committing to an answer]`
			`</think>`
			`<answer>`
			`[Final response — structured, precise, and direct]`
			`</answer>`
			```

			The `<think>` block is Lily's scratchpad — it plans, evaluates, and drafts before producing the answer. This makes the model's reasoning transparent and auditable.

			`---`

			`## Quick Start`

			`### Transformers (Python)`

			```python
			`from transformers import AutoTokenizer, AutoModelForCausalLM`
			`import torch`

			`model_id = "abhinav0231/Lily-1.5b-v0.1"`

			`tokenizer = AutoTokenizer.from_pretrained(model_id)`
			`model = AutoModelForCausalLM.from_pretrained(`
			`model_id,`
			`torch_dtype=torch.float16,`
			`device_map="auto",`
			`)`

			`SYSTEM_PROMPT = (`
			`"You are Lily, a precise and thoughtful AI assistant.\n\n"`
			`"Always reason step by step inside <think></think> tags, "`
			`"then write your final answer inside <answer></answer> tags.\n\n"`
			`"When answering:\n"`
			`"- Be thorough: cover all relevant aspects, not just the surface question\n"`
			`"- Be specific: use exact values, names, and examples rather than vague generalities\n"`
			`"- Structure long responses with markdown headers, code blocks, and lists where appropriate\n"`
			`"- Lead with the most important information first\n"`
			`"- Match the depth of your answer to the complexity of the question\n\n"`
			`"Tone: direct and confident. Never use filler phrases like \"Certainly!\", "`
			`"\"Great question!\", or \"Of course!\". Be helpful without being sycophantic."`
			`)`

			`def ask(question, max_new_tokens=512):`
			`messages = [`
			`{"role": "system", "content": SYSTEM_PROMPT},`
			`{"role": "user", "content": question},`
			`]`
			`prompt = tokenizer.apply_chat_template(`
			`messages,`
			`tokenize=False,`
			`add_generation_prompt=True,`
			`)`
			`inputs = tokenizer(prompt, return_tensors="pt").to(model.device)`
			`with torch.no_grad():`
			`output = model.generate(`
			`**inputs,`
			`max_new_tokens=max_new_tokens,`
			`temperature=0.7,`
			`top_p=0.9,`
			`do_sample=True,`
			`pad_token_id=tokenizer.eos_token_id,`
			`)`
			`response = tokenizer.decode(`
			`output[0][inputs["input_ids"].shape[-1]:],`
			`skip_special_tokens=True,`
			`)`
			`return response`

			`print(ask("What is the difference between a list and a tuple in Python?"))`
			```

			`---`

			`## GGUF (llama.cpp / Ollama / LM Studio)`

			`Quantized GGUF versions are available at [abhinav0231/Lily-1.5b-v0.1-GGUF](https://huggingface.co/abhinav0231/Lily-1.5b-v0.1-GGUF).`

			`\| Quant \| Size \| Use case \|`
			`\|---\|---\|---\|`
			\| `Q4_K_M` \| ~1.0 GB \| Best balance of speed and quality for CPU inference \|
			\| `Q5_K_M` \| ~1.2 GB \| Better quality, still fast on CPU \|
			\| `Q8_0` \| ~1.6 GB \| Near-lossless, recommended if VRAM/RAM allows \|
			\| `F16` \| ~3.1 GB \| Full precision, GPU only \|

			`### llama.cpp`

			```bash
			`# Download a quant`
			`huggingface-cli download abhinav0231/Lily-1.5b-v0.1-GGUF \`
			`Lily-1.5b-v0.1-Q4_K_M.gguf \`
			`--local-dir ./`

			`# Run the server`
			`./llama.cpp/build/bin/llama-server \`
			`-m Lily-1.5b-v0.1-Q4_K_M.gguf \`
			`--ctx-size 4096 \`
			`--port 8080`
			```

			`### Ollama`

			```bash
			`# Create a Modelfile`
			`cat > Modelfile << 'EOF'`
			`FROM ./Lily-1.5b-v0.1-Q4_K_M.gguf`
			`SYSTEM "You are Lily, a precise and thoughtful AI assistant.`

			`Always reason step by step inside <think></think> tags, then write your final answer inside <answer></answer> tags.`

			`When answering:`
			`- Be thorough: cover all relevant aspects, not just the surface question`
			`- Be specific: use exact values, names, and examples rather than vague generalities`
			`- Structure long responses with markdown headers, code blocks, and lists where appropriate`
			`- Lead with the most important information first`
			`- Match the depth of your answer to the complexity of the question`

			`Tone: direct and confident. Never use filler phrases like \"Certainly!\", \"Great question!\", or \"Of course!\". Be helpful without being sycophantic."`
			`EOF`

			`# Build and run`
			`ollama create lily -f Modelfile`
			`ollama run lily "Explain how transformers work"`
			```

			`---`

			`## System Prompt`

			The system prompt below is embedded in the model's chat template and applied automatically when using `apply_chat_template`. You do not need to set it manually if using the Transformers pipeline — it is already the default.

			The critical sentence that triggers the `<think>/<answer>` format — kept verbatim from training — is:

			> Always reason step by step inside `<think></think>` tags, then write your final answer inside `<answer></answer>` tags.

			The rest of the system prompt shapes tone and response quality and can be overridden by passing a custom `system` message.

			`---`

			`## Intended Use`

			`Lily is a general-purpose assistant fine-tune. It performs well on:`

			`- Reasoning and logic problems`
			`- Code explanation and generation`
			`- Structured question answering`
			`- Step-by-step problem solving`

			The explicit `<think>` step makes it especially useful in applications where reasoning transparency matters — grading, debugging, tutoring, or any workflow where you need to see why the model gave a particular answer, not just the answer itself.

			`---`

			`## Limitations`

			`- 1.5B parameters: Not suited for tasks requiring broad world knowledge or long multi-document context`
			`- v0.1: Early release — output quality and format consistency will improve in future versions`
			`- English primary: Training data is predominantly English; multilingual performance is limited`
			`- No tool use / function calling: This version does not support structured tool call outputs`

			`---`

			`## Training`

			Fine-tuned from `Qwen/Qwen2.5-1.5B-Instruct` using supervised fine-tuning on a dataset of chain-of-thought formatted examples. Each training example uses the `<think>/<answer>` output structure. Training was performed on a single T4 GPU via Google Colab.

			`---`

			`## Citation`

			`If you use Lily in research or a project, please cite:`

			```
			`@misc{lily-1.5b-v0.1,`
			`author = {abhinav0231},`
			`title = {Lily 1.5B v0.1: A chain-of-thought fine-tune of Qwen 2.5 1.5B},`
			`year = {2025},`
			`publisher = {HuggingFace},`
			`url = {https://huggingface.co/abhinav0231/Lily-1.5b-v0.1}`
			`}`
			```

			`---`

			`## License`

			`Apache 2.0 — see [LICENSE](https://www.apache.org/licenses/LICENSE-2.0).`