397 lines
9.8 KiB
Markdown
397 lines
9.8 KiB
Markdown
---
|
||
license: apache-2.0
|
||
language:
|
||
- en
|
||
pipeline_tag: text-generation
|
||
tags:
|
||
- qwen2
|
||
- causal-lm
|
||
- instruction-tuned
|
||
- distillation
|
||
- sft
|
||
- lora
|
||
- qlora
|
||
- unsloth
|
||
- chatml
|
||
- reasoning
|
||
|
||
base_model: abhinav0231/Lily-1.5b-v0.1
|
||
|
||
datasets:
|
||
- abhinav0231/Sarvam-105b-Distill-100k
|
||
---
|
||
|
||
# Lily-1.5b-v0.3
|
||
|
||
Lily-1.5b-v0.3 is a distilled instruction-tuned language model built by continuing training from `abhinav0231/Lily-1.5b-v0.1` on the `abhinav0231/Sarvam-105b-Distill-100k` dataset using the `chatml` split/configuration.
|
||
|
||
This version was trained as an offline supervised fine-tuning run focused on high-quality long-form assistant responses in ChatML format, with many examples following an explicit `<think>` and `<answer>` structure.
|
||
|
||
The model was trained and merged in a single-GPU Modal workflow on an NVIDIA A100-SXM4-40GB system using BF16, QLoRA, and Unsloth.
|
||
|
||
---
|
||
|
||
# Model summary
|
||
|
||
This checkpoint starts from `abhinav0231/Lily-1.5b-v0.1` and applies a distillation-style supervised fine-tuning stage rather than training from scratch.
|
||
|
||
The base architecture loaded during training is a Qwen2-style causal language model with:
|
||
|
||
- 28 layers
|
||
- hidden size 1536
|
||
- 12 attention heads
|
||
- 2 key-value heads
|
||
- vocabulary size 151,936
|
||
|
||
The training setup targets:
|
||
|
||
- instruction following
|
||
- structured response generation
|
||
- distilled reasoning-flavored outputs
|
||
|
||
rather than pure base-model continuation pretraining.
|
||
|
||
---
|
||
|
||
# Training objective
|
||
|
||
The goal of v0.3 was to improve the model through offline SFT distillation from a synthetic/teacher-style dataset while preserving the usability and compact size of the 1.5B-class base model.
|
||
|
||
The dataset examples are preformatted as ChatML conversations and frequently instruct the assistant to reason in a `<think>` block before producing a final `<answer>` block.
|
||
|
||
Because of that training distribution, the model may naturally produce more structured, tutor-like, stepwise outputs than the earlier checkpoint depending on the prompt style.
|
||
|
||
---
|
||
|
||
# Base model
|
||
|
||
- **Base model:** `abhinav0231/Lily-1.5b-v0.1`
|
||
- **Final merged model repo:** `abhinav0231/Lily-1.5b-v0.3`
|
||
- **GGUF Repo** `abhinav0231/Lily-1.5b-v0.3-GGUF`
|
||
---
|
||
|
||
# Benchmarks
|
||
|
||
Evaluation setup using `lm-evaluation-harness`, v0.3 achieved:
|
||
|
||
|
||

|
||
|
||
---
|
||
|
||
# Dataset
|
||
|
||
The main training dataset is:
|
||
|
||
`abhinav0231/Sarvam-105b-Distill-100k`
|
||
|
||
using the `chatml` configuration, stored as a single `text` column of preformatted conversations.
|
||
|
||
The final training notebook loaded:
|
||
|
||
- 91,457 training examples
|
||
- 1,908 validation examples
|
||
|
||
A separate sanity-check pass over the dataset family showed a very similar distribution, including:
|
||
|
||
- 92,040 training examples
|
||
- 1,917 validation examples
|
||
- 1,918 test examples
|
||
|
||
confirming the same overall ChatML reasoning-style format.
|
||
|
||
---
|
||
|
||
## Dataset style
|
||
|
||
The dataset uses ChatML with:
|
||
|
||
- `<|im_start|>`
|
||
- `<|im_end|>`
|
||
|
||
delimiters and includes a chat template in the tokenizer setup.
|
||
|
||
Many examples use a system prompt that explicitly asks the assistant to think through the problem in a `<think>` block and then give the final response in an `<answer>` block.
|
||
|
||
This means the model was not trained on plain raw instruction-response text alone; it was trained on a formatted conversational distribution with strong structural priors.
|
||
|
||
---
|
||
|
||
## Length characteristics
|
||
|
||
A 5,000-sample sanity slice of the training set had:
|
||
|
||
- mean length = 1640.72 tokens
|
||
- p50 = 1219
|
||
- p90 = 3221
|
||
- p95 = 4096.15
|
||
- p99 = 6883.35
|
||
|
||
About:
|
||
|
||
- 5.00% of sampled training examples
|
||
- 4.33% of sampled validation examples
|
||
|
||
exceeded 4096 tokens.
|
||
|
||
These numbers matter because the training run used a 4096 token max sequence length, so the longest examples are subject to truncation or packing effects depending on preprocessing behavior.
|
||
|
||
---
|
||
|
||
# Training setup
|
||
|
||
Training was run on a single NVIDIA A100-SXM4-40GB GPU in Modal, without:
|
||
|
||
- DDP
|
||
- `accelerate launch`
|
||
- multi-process orchestration
|
||
|
||
The environment used:
|
||
|
||
- Unsloth 2026.5.2
|
||
- TRL 0.22.2
|
||
- PyTorch 2.8.0+cu129
|
||
- CUDA 12.9
|
||
- Triton 3.4.0
|
||
- BF16 mixed precision
|
||
|
||
Flash Attention 2 was auto-enabled by Unsloth because the A100 supports it.
|
||
|
||
---
|
||
|
||
## Core hyperparameters
|
||
|
||
| Parameter | Value |
|
||
|---|---|
|
||
| Max sequence length | 4096 |
|
||
| Num epochs | 2 |
|
||
| Learning rate | 2e-5 |
|
||
| Warmup steps | 100 |
|
||
| Warmup ratio | 0.03 |
|
||
| Batch size | 24 |
|
||
| Gradient accumulation | 1 |
|
||
| Effective batch size | 24 |
|
||
| Seed | 42 |
|
||
|
||
---
|
||
|
||
## Optimization stack
|
||
|
||
The model was loaded with QLoRA 4-bit weights during training, while the final merged checkpoint was saved in 16-bit merged form for deployment and inference use.
|
||
|
||
The W&B config logged the optimizer as `adamw_8bit`, while the trainer config used fused AdamW (`adamw_torch_fused`) in the notebook training arguments.
|
||
|
||
Sequence packing was enabled, dataset preprocessing used multiprocessing, and periodic evaluation/checkpoint saving was configured during the run.
|
||
|
||
---
|
||
|
||
# LoRA / PEFT details
|
||
|
||
The fine-tuning used:
|
||
|
||
- LoRA rank = 32
|
||
- LoRA alpha = 64
|
||
|
||
Target modules:
|
||
|
||
- `q_proj`
|
||
- `k_proj`
|
||
- `v_proj`
|
||
- `o_proj`
|
||
- `gate_proj`
|
||
- `up_proj`
|
||
- `down_proj`
|
||
|
||
The run reported approximately:
|
||
|
||
- 36.9M trainable parameters
|
||
|
||
which corresponded to around 2.34%–4.0% of total parameters depending on counting conventions.
|
||
|
||
---
|
||
|
||
# Hardware and runtime
|
||
|
||
Training hardware:
|
||
|
||
- NVIDIA A100-SXM4-40GB
|
||
- ~42.4 GB VRAM exposed
|
||
- Compute capability 8.0
|
||
- BF16 support
|
||
- Flash Attention 2 support
|
||
|
||
The run specifically targeted A100-native BF16 and Flash Attention 2 optimizations.
|
||
|
||
Total training runtime was approximately:
|
||
|
||
- 5 hours 14 minutes
|
||
|
||
---
|
||
|
||
# Checkpointing and merge
|
||
|
||
Intermediate checkpoints were pushed to:
|
||
|
||
`abhinav0231/Lily-1.5b-distill-v3-checkpoints`
|
||
|
||
during training.
|
||
|
||
The workflow included auto-resume logic from the latest Hugging Face checkpoint.
|
||
|
||
After training, the LoRA adapter was merged back into the base model in BF16/16-bit form and pushed as:
|
||
|
||
`abhinav0231/Lily-1.5b-v0.3`
|
||
|
||
The notebook also included GGUF export paths for quantized deployment variants.
|
||
|
||
---
|
||
|
||
# Training logs
|
||
|
||
The trainer log reported:
|
||
|
||
- 33,297 packed training examples
|
||
- 2 epochs
|
||
- 2,776 optimization steps
|
||
|
||
Validation loss decreased from:
|
||
|
||
- 9.100862 at step 500
|
||
to
|
||
- 8.973075 at step 2500
|
||
|
||
These values should be interpreted as internal training diagnostics rather than direct end-user quality metrics.
|
||
|
||
---
|
||
|
||
# Intended use
|
||
|
||
This model is intended for:
|
||
|
||
- instruction-following chat experiments
|
||
- structured answer generation
|
||
- research on distilled reasoning-style outputs
|
||
- lightweight local or hosted inference in the 1.5B parameter class
|
||
|
||
It is especially suited to prompts where:
|
||
|
||
- a user asks for explanations or breakdowns
|
||
- the desired answer format is structured
|
||
- the prompt resembles the ChatML style used during training
|
||
|
||
---
|
||
|
||
# Prompting notes
|
||
|
||
Because the training data is ChatML-formatted, best results usually come from chat-style prompting rather than plain raw completion prompting.
|
||
|
||
The model may respond in a more verbose tutor-like style because many training prompts encouraged detailed reasoning followed by a final answer.
|
||
|
||
If a cleaner direct-answer style is preferred, using a concise system prompt and explicitly requesting short outputs can help steer generation.
|
||
|
||
---
|
||
|
||
# Limitations
|
||
|
||
This model was trained on synthetic/distilled instruction data rather than broad raw web-scale pretraining data.
|
||
|
||
As a result:
|
||
|
||
- outputs may reflect teacher-style formatting biases
|
||
- responses may become over-structured
|
||
- reasoning markup may occasionally appear in generations
|
||
|
||
The dataset sanity checks also flagged formatting irregularities in sampled rows, including repeated markers and malformed counts, so downstream behavior may inherit some formatting artifacts from the source corpus.
|
||
|
||
---
|
||
|
||
# Safety
|
||
|
||
This model is not designed for fully autonomous use in high-stakes domains such as:
|
||
|
||
- legal
|
||
- medical
|
||
- financial
|
||
- safety-critical systems
|
||
|
||
Outputs can still be:
|
||
|
||
- incorrect
|
||
- incomplete
|
||
- overconfident
|
||
|
||
Human review is recommended for consequential use cases.
|
||
|
||
|
||
---
|
||
|
||
# Usage
|
||
|
||
## Transformers
|
||
|
||
```python
|
||
import torch
|
||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||
|
||
model_id = "abhinav0231/Lily-1.5b-v0.3"
|
||
|
||
tokenizer = AutoTokenizer.from_pretrained(
|
||
model_id,
|
||
trust_remote_code=True,
|
||
)
|
||
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
model_id,
|
||
torch_dtype=torch.bfloat16,
|
||
trust_remote_code=True,
|
||
device_map="auto",
|
||
)
|
||
|
||
messages = [
|
||
{"role": "system", "content": "You are a helpful assistant."},
|
||
{"role": "user", "content": "Explain overfitting in simple terms."},
|
||
]
|
||
|
||
text = tokenizer.apply_chat_template(
|
||
messages,
|
||
tokenize=False,
|
||
add_generation_prompt=True,
|
||
)
|
||
|
||
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
||
|
||
with torch.no_grad():
|
||
outputs = model.generate(
|
||
**inputs,
|
||
max_new_tokens=512,
|
||
temperature=0.7,
|
||
do_sample=True,
|
||
)
|
||
|
||
print(tokenizer.decode(outputs, skip_special_tokens=True))
|
||
```
|
||
|
||
### Suggested prompting
|
||
|
||
For best results:
|
||
- use chat-style prompts,
|
||
- keep instructions explicit,
|
||
- specify desired format,
|
||
- request concise output if you do not want long reasoning-style responses.
|
||
|
||
## Provenance
|
||
|
||
- **Base model:** `abhinav0231/Lily-1.5b-v0.1`
|
||
- **Training dataset:** `abhinav0231/Sarvam-105b-Distill-100k` (`chatml`)
|
||
- **Training framework:** Unsloth + TRL
|
||
- **Hardware:** 1x NVIDIA A100-SXM4-40GB
|
||
- **Final merged repo:** `abhinav0231/Lily-1.5b-v0.3`
|
||
|
||
## Acknowledgements
|
||
|
||
This model was trained with Unsloth, Hugging Face Transformers, TRL, PEFT/LoRA-style fine-tuning, and W&B logging in a Modal-hosted workflow.
|
||
|
||
This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
|
||
|
||
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
|