A 1.1 billion parameter decoder-only language model trained entirely from scratch -- pretraining, supervised fine-tuning, and preference alignment -- on 8x NVIDIA H100 GPUs.
Model Details
Property
Value
Parameters
1,105,827,840 (1.1B)
Architecture
LLaMA-style Decoder-only Transformer
Hidden Size
2048
Intermediate Size
5504 (SwiGLU)
Layers
22
Attention Heads
32 (Grouped Query Attention)
KV Heads
8
Head Dim
64
Max Sequence Length
2048
Vocab Size
32,003
Precision
BFloat16
Architecture Highlights
RoPE (Rotary Position Embeddings) with theta=10,000
Grouped Query Attention (GQA) -- 4:1 query-to-KV head ratio for efficient inference
SwiGLU Feed-Forward Network
RMSNorm in a pre-norm configuration
Flash Attention 2 via PyTorch SDPA
Training Pipeline
This model was built through a complete 3-stage training pipeline: