397 lines
9.8 KiB
Markdown
397 lines
9.8 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
tags:
|
|||
|
|
- qwen2
|
|||
|
|
- causal-lm
|
|||
|
|
- instruction-tuned
|
|||
|
|
- distillation
|
|||
|
|
- sft
|
|||
|
|
- lora
|
|||
|
|
- qlora
|
|||
|
|
- unsloth
|
|||
|
|
- chatml
|
|||
|
|
- reasoning
|
|||
|
|
|
|||
|
|
base_model: abhinav0231/Lily-1.5b-v0.1
|
|||
|
|
|
|||
|
|
datasets:
|
|||
|
|
- abhinav0231/Sarvam-105b-Distill-100k
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Lily-1.5b-v0.3
|
|||
|
|
|
|||
|
|
Lily-1.5b-v0.3 is a distilled instruction-tuned language model built by continuing training from `abhinav0231/Lily-1.5b-v0.1` on the `abhinav0231/Sarvam-105b-Distill-100k` dataset using the `chatml` split/configuration.
|
|||
|
|
|
|||
|
|
This version was trained as an offline supervised fine-tuning run focused on high-quality long-form assistant responses in ChatML format, with many examples following an explicit `<think>` and `<answer>` structure.
|
|||
|
|
|
|||
|
|
The model was trained and merged in a single-GPU Modal workflow on an NVIDIA A100-SXM4-40GB system using BF16, QLoRA, and Unsloth.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Model summary
|
|||
|
|
|
|||
|
|
This checkpoint starts from `abhinav0231/Lily-1.5b-v0.1` and applies a distillation-style supervised fine-tuning stage rather than training from scratch.
|
|||
|
|
|
|||
|
|
The base architecture loaded during training is a Qwen2-style causal language model with:
|
|||
|
|
|
|||
|
|
- 28 layers
|
|||
|
|
- hidden size 1536
|
|||
|
|
- 12 attention heads
|
|||
|
|
- 2 key-value heads
|
|||
|
|
- vocabulary size 151,936
|
|||
|
|
|
|||
|
|
The training setup targets:
|
|||
|
|
|
|||
|
|
- instruction following
|
|||
|
|
- structured response generation
|
|||
|
|
- distilled reasoning-flavored outputs
|
|||
|
|
|
|||
|
|
rather than pure base-model continuation pretraining.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Training objective
|
|||
|
|
|
|||
|
|
The goal of v0.3 was to improve the model through offline SFT distillation from a synthetic/teacher-style dataset while preserving the usability and compact size of the 1.5B-class base model.
|
|||
|
|
|
|||
|
|
The dataset examples are preformatted as ChatML conversations and frequently instruct the assistant to reason in a `<think>` block before producing a final `<answer>` block.
|
|||
|
|
|
|||
|
|
Because of that training distribution, the model may naturally produce more structured, tutor-like, stepwise outputs than the earlier checkpoint depending on the prompt style.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Base model
|
|||
|
|
|
|||
|
|
- **Base model:** `abhinav0231/Lily-1.5b-v0.1`
|
|||
|
|
- **Final merged model repo:** `abhinav0231/Lily-1.5b-v0.3`
|
|||
|
|
- **GGUF Repo** `abhinav0231/Lily-1.5b-v0.3-GGUF`
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Benchmarks
|
|||
|
|
|
|||
|
|
Evaluation setup using `lm-evaluation-harness`, v0.3 achieved:
|
|||
|
|
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Dataset
|
|||
|
|
|
|||
|
|
The main training dataset is:
|
|||
|
|
|
|||
|
|
`abhinav0231/Sarvam-105b-Distill-100k`
|
|||
|
|
|
|||
|
|
using the `chatml` configuration, stored as a single `text` column of preformatted conversations.
|
|||
|
|
|
|||
|
|
The final training notebook loaded:
|
|||
|
|
|
|||
|
|
- 91,457 training examples
|
|||
|
|
- 1,908 validation examples
|
|||
|
|
|
|||
|
|
A separate sanity-check pass over the dataset family showed a very similar distribution, including:
|
|||
|
|
|
|||
|
|
- 92,040 training examples
|
|||
|
|
- 1,917 validation examples
|
|||
|
|
- 1,918 test examples
|
|||
|
|
|
|||
|
|
confirming the same overall ChatML reasoning-style format.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Dataset style
|
|||
|
|
|
|||
|
|
The dataset uses ChatML with:
|
|||
|
|
|
|||
|
|
- `<|im_start|>`
|
|||
|
|
- `<|im_end|>`
|
|||
|
|
|
|||
|
|
delimiters and includes a chat template in the tokenizer setup.
|
|||
|
|
|
|||
|
|
Many examples use a system prompt that explicitly asks the assistant to think through the problem in a `<think>` block and then give the final response in an `<answer>` block.
|
|||
|
|
|
|||
|
|
This means the model was not trained on plain raw instruction-response text alone; it was trained on a formatted conversational distribution with strong structural priors.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Length characteristics
|
|||
|
|
|
|||
|
|
A 5,000-sample sanity slice of the training set had:
|
|||
|
|
|
|||
|
|
- mean length = 1640.72 tokens
|
|||
|
|
- p50 = 1219
|
|||
|
|
- p90 = 3221
|
|||
|
|
- p95 = 4096.15
|
|||
|
|
- p99 = 6883.35
|
|||
|
|
|
|||
|
|
About:
|
|||
|
|
|
|||
|
|
- 5.00% of sampled training examples
|
|||
|
|
- 4.33% of sampled validation examples
|
|||
|
|
|
|||
|
|
exceeded 4096 tokens.
|
|||
|
|
|
|||
|
|
These numbers matter because the training run used a 4096 token max sequence length, so the longest examples are subject to truncation or packing effects depending on preprocessing behavior.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Training setup
|
|||
|
|
|
|||
|
|
Training was run on a single NVIDIA A100-SXM4-40GB GPU in Modal, without:
|
|||
|
|
|
|||
|
|
- DDP
|
|||
|
|
- `accelerate launch`
|
|||
|
|
- multi-process orchestration
|
|||
|
|
|
|||
|
|
The environment used:
|
|||
|
|
|
|||
|
|
- Unsloth 2026.5.2
|
|||
|
|
- TRL 0.22.2
|
|||
|
|
- PyTorch 2.8.0+cu129
|
|||
|
|
- CUDA 12.9
|
|||
|
|
- Triton 3.4.0
|
|||
|
|
- BF16 mixed precision
|
|||
|
|
|
|||
|
|
Flash Attention 2 was auto-enabled by Unsloth because the A100 supports it.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Core hyperparameters
|
|||
|
|
|
|||
|
|
| Parameter | Value |
|
|||
|
|
|---|---|
|
|||
|
|
| Max sequence length | 4096 |
|
|||
|
|
| Num epochs | 2 |
|
|||
|
|
| Learning rate | 2e-5 |
|
|||
|
|
| Warmup steps | 100 |
|
|||
|
|
| Warmup ratio | 0.03 |
|
|||
|
|
| Batch size | 24 |
|
|||
|
|
| Gradient accumulation | 1 |
|
|||
|
|
| Effective batch size | 24 |
|
|||
|
|
| Seed | 42 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Optimization stack
|
|||
|
|
|
|||
|
|
The model was loaded with QLoRA 4-bit weights during training, while the final merged checkpoint was saved in 16-bit merged form for deployment and inference use.
|
|||
|
|
|
|||
|
|
The W&B config logged the optimizer as `adamw_8bit`, while the trainer config used fused AdamW (`adamw_torch_fused`) in the notebook training arguments.
|
|||
|
|
|
|||
|
|
Sequence packing was enabled, dataset preprocessing used multiprocessing, and periodic evaluation/checkpoint saving was configured during the run.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# LoRA / PEFT details
|
|||
|
|
|
|||
|
|
The fine-tuning used:
|
|||
|
|
|
|||
|
|
- LoRA rank = 32
|
|||
|
|
- LoRA alpha = 64
|
|||
|
|
|
|||
|
|
Target modules:
|
|||
|
|
|
|||
|
|
- `q_proj`
|
|||
|
|
- `k_proj`
|
|||
|
|
- `v_proj`
|
|||
|
|
- `o_proj`
|
|||
|
|
- `gate_proj`
|
|||
|
|
- `up_proj`
|
|||
|
|
- `down_proj`
|
|||
|
|
|
|||
|
|
The run reported approximately:
|
|||
|
|
|
|||
|
|
- 36.9M trainable parameters
|
|||
|
|
|
|||
|
|
which corresponded to around 2.34%–4.0% of total parameters depending on counting conventions.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Hardware and runtime
|
|||
|
|
|
|||
|
|
Training hardware:
|
|||
|
|
|
|||
|
|
- NVIDIA A100-SXM4-40GB
|
|||
|
|
- ~42.4 GB VRAM exposed
|
|||
|
|
- Compute capability 8.0
|
|||
|
|
- BF16 support
|
|||
|
|
- Flash Attention 2 support
|
|||
|
|
|
|||
|
|
The run specifically targeted A100-native BF16 and Flash Attention 2 optimizations.
|
|||
|
|
|
|||
|
|
Total training runtime was approximately:
|
|||
|
|
|
|||
|
|
- 5 hours 14 minutes
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Checkpointing and merge
|
|||
|
|
|
|||
|
|
Intermediate checkpoints were pushed to:
|
|||
|
|
|
|||
|
|
`abhinav0231/Lily-1.5b-distill-v3-checkpoints`
|
|||
|
|
|
|||
|
|
during training.
|
|||
|
|
|
|||
|
|
The workflow included auto-resume logic from the latest Hugging Face checkpoint.
|
|||
|
|
|
|||
|
|
After training, the LoRA adapter was merged back into the base model in BF16/16-bit form and pushed as:
|
|||
|
|
|
|||
|
|
`abhinav0231/Lily-1.5b-v0.3`
|
|||
|
|
|
|||
|
|
The notebook also included GGUF export paths for quantized deployment variants.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Training logs
|
|||
|
|
|
|||
|
|
The trainer log reported:
|
|||
|
|
|
|||
|
|
- 33,297 packed training examples
|
|||
|
|
- 2 epochs
|
|||
|
|
- 2,776 optimization steps
|
|||
|
|
|
|||
|
|
Validation loss decreased from:
|
|||
|
|
|
|||
|
|
- 9.100862 at step 500
|
|||
|
|
to
|
|||
|
|
- 8.973075 at step 2500
|
|||
|
|
|
|||
|
|
These values should be interpreted as internal training diagnostics rather than direct end-user quality metrics.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Intended use
|
|||
|
|
|
|||
|
|
This model is intended for:
|
|||
|
|
|
|||
|
|
- instruction-following chat experiments
|
|||
|
|
- structured answer generation
|
|||
|
|
- research on distilled reasoning-style outputs
|
|||
|
|
- lightweight local or hosted inference in the 1.5B parameter class
|
|||
|
|
|
|||
|
|
It is especially suited to prompts where:
|
|||
|
|
|
|||
|
|
- a user asks for explanations or breakdowns
|
|||
|
|
- the desired answer format is structured
|
|||
|
|
- the prompt resembles the ChatML style used during training
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Prompting notes
|
|||
|
|
|
|||
|
|
Because the training data is ChatML-formatted, best results usually come from chat-style prompting rather than plain raw completion prompting.
|
|||
|
|
|
|||
|
|
The model may respond in a more verbose tutor-like style because many training prompts encouraged detailed reasoning followed by a final answer.
|
|||
|
|
|
|||
|
|
If a cleaner direct-answer style is preferred, using a concise system prompt and explicitly requesting short outputs can help steer generation.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Limitations
|
|||
|
|
|
|||
|
|
This model was trained on synthetic/distilled instruction data rather than broad raw web-scale pretraining data.
|
|||
|
|
|
|||
|
|
As a result:
|
|||
|
|
|
|||
|
|
- outputs may reflect teacher-style formatting biases
|
|||
|
|
- responses may become over-structured
|
|||
|
|
- reasoning markup may occasionally appear in generations
|
|||
|
|
|
|||
|
|
The dataset sanity checks also flagged formatting irregularities in sampled rows, including repeated markers and malformed counts, so downstream behavior may inherit some formatting artifacts from the source corpus.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Safety
|
|||
|
|
|
|||
|
|
This model is not designed for fully autonomous use in high-stakes domains such as:
|
|||
|
|
|
|||
|
|
- legal
|
|||
|
|
- medical
|
|||
|
|
- financial
|
|||
|
|
- safety-critical systems
|
|||
|
|
|
|||
|
|
Outputs can still be:
|
|||
|
|
|
|||
|
|
- incorrect
|
|||
|
|
- incomplete
|
|||
|
|
- overconfident
|
|||
|
|
|
|||
|
|
Human review is recommended for consequential use cases.
|
|||
|
|
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Usage
|
|||
|
|
|
|||
|
|
## Transformers
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import torch
|
|||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|||
|
|
|
|||
|
|
model_id = "abhinav0231/Lily-1.5b-v0.3"
|
|||
|
|
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(
|
|||
|
|
model_id,
|
|||
|
|
trust_remote_code=True,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
model_id,
|
|||
|
|
torch_dtype=torch.bfloat16,
|
|||
|
|
trust_remote_code=True,
|
|||
|
|
device_map="auto",
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
messages = [
|
|||
|
|
{"role": "system", "content": "You are a helpful assistant."},
|
|||
|
|
{"role": "user", "content": "Explain overfitting in simple terms."},
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
text = tokenizer.apply_chat_template(
|
|||
|
|
messages,
|
|||
|
|
tokenize=False,
|
|||
|
|
add_generation_prompt=True,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
|||
|
|
|
|||
|
|
with torch.no_grad():
|
|||
|
|
outputs = model.generate(
|
|||
|
|
**inputs,
|
|||
|
|
max_new_tokens=512,
|
|||
|
|
temperature=0.7,
|
|||
|
|
do_sample=True,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
print(tokenizer.decode(outputs, skip_special_tokens=True))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Suggested prompting
|
|||
|
|
|
|||
|
|
For best results:
|
|||
|
|
- use chat-style prompts,
|
|||
|
|
- keep instructions explicit,
|
|||
|
|
- specify desired format,
|
|||
|
|
- request concise output if you do not want long reasoning-style responses.
|
|||
|
|
|
|||
|
|
## Provenance
|
|||
|
|
|
|||
|
|
- **Base model:** `abhinav0231/Lily-1.5b-v0.1`
|
|||
|
|
- **Training dataset:** `abhinav0231/Sarvam-105b-Distill-100k` (`chatml`)
|
|||
|
|
- **Training framework:** Unsloth + TRL
|
|||
|
|
- **Hardware:** 1x NVIDIA A100-SXM4-40GB
|
|||
|
|
- **Final merged repo:** `abhinav0231/Lily-1.5b-v0.3`
|
|||
|
|
|
|||
|
|
## Acknowledgements
|
|||
|
|
|
|||
|
|
This model was trained with Unsloth, Hugging Face Transformers, TRL, PEFT/LoRA-style fine-tuning, and W&B logging in a Modal-hosted workflow.
|
|||
|
|
|
|||
|
|
This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
|
|||
|
|
|
|||
|
|
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
|