Lily-1.5b-v0.3/README.md

---
license: apache-2.0
language:
  - en
pipeline_tag: text-generation
tags:
  - qwen2
  - causal-lm
  - instruction-tuned
  - distillation
  - sft
  - lora
  - qlora
  - unsloth
  - chatml
  - reasoning

base_model: abhinav0231/Lily-1.5b-v0.1

datasets:
  - abhinav0231/Sarvam-105b-Distill-100k
---

# Lily-1.5b-v0.3

Lily-1.5b-v0.3 is a distilled instruction-tuned language model built by continuing training from `abhinav0231/Lily-1.5b-v0.1` on the `abhinav0231/Sarvam-105b-Distill-100k` dataset using the `chatml` split/configuration.

This version was trained as an offline supervised fine-tuning run focused on high-quality long-form assistant responses in ChatML format, with many examples following an explicit `<think>` and `<answer>` structure.

The model was trained and merged in a single-GPU Modal workflow on an NVIDIA A100-SXM4-40GB system using BF16, QLoRA, and Unsloth.

---

# Model summary

This checkpoint starts from `abhinav0231/Lily-1.5b-v0.1` and applies a distillation-style supervised fine-tuning stage rather than training from scratch.

The base architecture loaded during training is a Qwen2-style causal language model with:

- 28 layers
- hidden size 1536
- 12 attention heads
- 2 key-value heads
- vocabulary size 151,936

The training setup targets:

- instruction following
- structured response generation
- distilled reasoning-flavored outputs

rather than pure base-model continuation pretraining.

---

# Training objective

The goal of v0.3 was to improve the model through offline SFT distillation from a synthetic/teacher-style dataset while preserving the usability and compact size of the 1.5B-class base model.

The dataset examples are preformatted as ChatML conversations and frequently instruct the assistant to reason in a `<think>` block before producing a final `<answer>` block.

Because of that training distribution, the model may naturally produce more structured, tutor-like, stepwise outputs than the earlier checkpoint depending on the prompt style.

---

# Base model

- **Base model:** `abhinav0231/Lily-1.5b-v0.1`
- **Final merged model repo:** `abhinav0231/Lily-1.5b-v0.3`
- **GGUF Repo** `abhinav0231/Lily-1.5b-v0.3-GGUF`
---

# Benchmarks

Evaluation setup using `lm-evaluation-harness`, v0.3 achieved:


![image](https://cdn-uploads.huggingface.co/production/uploads/6617f58b9e7b3dbb1407d934/c6KDfKHIiEPmcwZi5qbBe.png)

---

# Dataset

The main training dataset is:

`abhinav0231/Sarvam-105b-Distill-100k`

using the `chatml` configuration, stored as a single `text` column of preformatted conversations.

The final training notebook loaded:

- 91,457 training examples
- 1,908 validation examples

A separate sanity-check pass over the dataset family showed a very similar distribution, including:

- 92,040 training examples
- 1,917 validation examples
- 1,918 test examples

confirming the same overall ChatML reasoning-style format.

---

## Dataset style

The dataset uses ChatML with:

- `<|im_start|>`
- `<|im_end|>`

delimiters and includes a chat template in the tokenizer setup.

Many examples use a system prompt that explicitly asks the assistant to think through the problem in a `<think>` block and then give the final response in an `<answer>` block.

This means the model was not trained on plain raw instruction-response text alone; it was trained on a formatted conversational distribution with strong structural priors.

---

## Length characteristics

A 5,000-sample sanity slice of the training set had:

- mean length = 1640.72 tokens
- p50 = 1219
- p90 = 3221
- p95 = 4096.15
- p99 = 6883.35

About:

- 5.00% of sampled training examples
- 4.33% of sampled validation examples

exceeded 4096 tokens.

These numbers matter because the training run used a 4096 token max sequence length, so the longest examples are subject to truncation or packing effects depending on preprocessing behavior.

---

# Training setup

Training was run on a single NVIDIA A100-SXM4-40GB GPU in Modal, without:

- DDP
- `accelerate launch`
- multi-process orchestration

The environment used:

- Unsloth 2026.5.2
- TRL 0.22.2
- PyTorch 2.8.0+cu129
- CUDA 12.9
- Triton 3.4.0
- BF16 mixed precision

Flash Attention 2 was auto-enabled by Unsloth because the A100 supports it.

---

## Core hyperparameters

| Parameter | Value |
|---|---|
| Max sequence length | 4096 |
| Num epochs | 2 |
| Learning rate | 2e-5 |
| Warmup steps | 100 |
| Warmup ratio | 0.03 |
| Batch size | 24 |
| Gradient accumulation | 1 |
| Effective batch size | 24 |
| Seed | 42 |

---

## Optimization stack

The model was loaded with QLoRA 4-bit weights during training, while the final merged checkpoint was saved in 16-bit merged form for deployment and inference use.

The W&B config logged the optimizer as `adamw_8bit`, while the trainer config used fused AdamW (`adamw_torch_fused`) in the notebook training arguments.

Sequence packing was enabled, dataset preprocessing used multiprocessing, and periodic evaluation/checkpoint saving was configured during the run.

---

# LoRA / PEFT details

The fine-tuning used:

- LoRA rank = 32
- LoRA alpha = 64

Target modules:

- `q_proj`
- `k_proj`
- `v_proj`
- `o_proj`
- `gate_proj`
- `up_proj`
- `down_proj`

The run reported approximately:

- 36.9M trainable parameters

which corresponded to around 2.34%–4.0% of total parameters depending on counting conventions.

---

# Hardware and runtime

Training hardware:

- NVIDIA A100-SXM4-40GB
- ~42.4 GB VRAM exposed
- Compute capability 8.0
- BF16 support
- Flash Attention 2 support

The run specifically targeted A100-native BF16 and Flash Attention 2 optimizations.

Total training runtime was approximately:

- 5 hours 14 minutes

---

# Checkpointing and merge

Intermediate checkpoints were pushed to:

`abhinav0231/Lily-1.5b-distill-v3-checkpoints`

during training.

The workflow included auto-resume logic from the latest Hugging Face checkpoint.

After training, the LoRA adapter was merged back into the base model in BF16/16-bit form and pushed as:

`abhinav0231/Lily-1.5b-v0.3`

The notebook also included GGUF export paths for quantized deployment variants.

---

# Training logs

The trainer log reported:

- 33,297 packed training examples
- 2 epochs
- 2,776 optimization steps

Validation loss decreased from:

- 9.100862 at step 500
to
- 8.973075 at step 2500

These values should be interpreted as internal training diagnostics rather than direct end-user quality metrics.

---

# Intended use

This model is intended for:

- instruction-following chat experiments
- structured answer generation
- research on distilled reasoning-style outputs
- lightweight local or hosted inference in the 1.5B parameter class

It is especially suited to prompts where:

- a user asks for explanations or breakdowns
- the desired answer format is structured
- the prompt resembles the ChatML style used during training

---

# Prompting notes

Because the training data is ChatML-formatted, best results usually come from chat-style prompting rather than plain raw completion prompting.

The model may respond in a more verbose tutor-like style because many training prompts encouraged detailed reasoning followed by a final answer.

If a cleaner direct-answer style is preferred, using a concise system prompt and explicitly requesting short outputs can help steer generation.

---

# Limitations

This model was trained on synthetic/distilled instruction data rather than broad raw web-scale pretraining data.

As a result:

- outputs may reflect teacher-style formatting biases
- responses may become over-structured
- reasoning markup may occasionally appear in generations

The dataset sanity checks also flagged formatting irregularities in sampled rows, including repeated markers and malformed counts, so downstream behavior may inherit some formatting artifacts from the source corpus.

---

# Safety

This model is not designed for fully autonomous use in high-stakes domains such as:

- legal
- medical
- financial
- safety-critical systems

Outputs can still be:

- incorrect
- incomplete
- overconfident

Human review is recommended for consequential use cases.


---

# Usage

## Transformers

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "abhinav0231/Lily-1.5b-v0.3"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain overfitting in simple terms."},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
    )

print(tokenizer.decode(outputs, skip_special_tokens=True))
```

### Suggested prompting

For best results:
- use chat-style prompts,
- keep instructions explicit,
- specify desired format,
- request concise output if you do not want long reasoning-style responses.

## Provenance

- **Base model:** `abhinav0231/Lily-1.5b-v0.1`
- **Training dataset:** `abhinav0231/Sarvam-105b-Distill-100k` (`chatml`)
- **Training framework:** Unsloth + TRL
- **Hardware:** 1x NVIDIA A100-SXM4-40GB
- **Final merged repo:** `abhinav0231/Lily-1.5b-v0.3`

## Acknowledgements

This model was trained with Unsloth, Hugging Face Transformers, TRL, PEFT/LoRA-style fine-tuning, and W&B logging in a Modal-hosted workflow.

This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)