--- language: - en license: apache-2.0 library_name: transformers tags: - text-generation - gpt2 - causal-lm - fine-tuned - knowledge-repair pipeline_tag: text-generation model_type: gpt2 --- # FF_3.13 > **Champion model** of the FF-LLM line. A 2.02B GPT-2 architecture model fine-tuned through a multi-stage pipeline (pretraining → SFT → distillation → surgical fine-tuning → knowledge repair) for general-purpose factual question answering. --- ## Model overview | | | |---|---| | **Architecture** | GPT-2 (causal LM) | | **Parameters** | 2.02B | | **Hidden size** | 2048 | | **Layers** | 38 | | **Attention heads** | 16 | | **Vocab size** | 50,257 | | **Tokenizer** | GPT-2 BPE | | **Context length** | 1024 tokens | | **Precision** | bfloat16 (also fp16/fp32 compatible) | | **License** | Apache 2.0 | | **Author** | francescofiamingo1 | --- ## Benchmark performance ### MMLU (Massive Multitask Language Understanding) Evaluated with `lm-eval-harness v0.4.11`, greedy decoding. | Split | Score | |---|---| | **MMLU full (14,042 items)** | **28.05%** | | MMLU dev (285 items) | 25.61% | ### Macro-domain breakdown (MMLU full) | Macro | Subjects | Accuracy | |---|---|---| | STEM | 19 subjects | 30.70% | | Humanities | 13 subjects | 26.06% | | Social Sciences | 12 subjects | 30.03% | | Other (medicine, law, professional) | 13 subjects | 29.32% | ### 106-bench (custom factual benchmark) Custom 106-prompt benchmark with strict TRUTH-list scoring: | Category | N | Score | |---|---|---| | arithmetic | 5 | 5/5 (100.0%) | | open-ended | 1 | 1/1 (100.0%) | | person | 25 | 22/25 (88.0%) | | science | 25 | 21/25 (84.0%) | | geography | 25 | 15/25 (60.0%) | | format compliance | 25 | 15/25 (60.0%) | | **TOTAL** | **106** | **79/106 (74.5%)** | ### Improvement vs precursors | Model | MMLU full | Δ vs FF_3.13 | |---|---|---| | FF_3 (base, original release) | — | — | | FF_3.1 (post-SFT) | 26.72% | -1.33pp | | FF_3.11 (specialized variant) | 25.20% | -2.85pp | | **FF_3.13 (this model)** | **28.05%** | **— champion** | --- ## Training pipeline — chronological view The model went through **7 distinct stages** of training. Below is the complete history. ### Stage 1 — Pretraining **Architecture chosen:** GPT-2 (2.02B parameters), trained from scratch on a curated multi-source web + encyclopedic + educational corpus. | Item | Value | |---|---| | Hardware | 8× NVIDIA RTX 5090 (24 GB each = 192 GB VRAM total) | | Throughput | **~220,000 tokens/sec sustained** (100% GPU utilization, all 8 GPUs in parallel) | | Framework | PyTorch + DeepSpeed ZeRO-2 | | Precision | bfloat16 | | **Total pretraining tokens** | **~90 billion tokens** | | Wall-clock pretraining time | ~5 days continuous (90B / 220K tok/s ≈ 4.7 days) | #### Pretraining data composition The pretraining corpus was assembled from **8 distinct sources**, organized in two training modules (M1 BASE / Extra and M2 BASE / Extra) with quality-tiered weighting. | Dataset | Module | Type | Quality weight | |---|---|---|---| | **FineWeb general** | M1 BASE | Web | medium | | **FineWeb 10BT** | M1 Extra25 | Web high quality | medium | | **FineWeb EDU** | M2 BASE | Educational | **high** | | **FineWeb EDU extended** | M2 Extra | Educational reasoning | medium | | **C4 EN** | M1 C4 | Web filtered | medium | | **Wikipedia EN** | M1 BASE | Encyclopedic | low | | **Web Clean custom** | M1 BASE / Extra | Web filtered | low | | **News crawl** | M1 BASE | Journalistic | low | #### Mix proportions (approximate) - **60–65% FineWeb** (various slices: general, 10BT, EDU, EDU extended) - **15–20% C4 EN** - **5–10% Wikipedia EN** - **5–10% Web Clean custom** - **~5% News crawl** This mix prioritizes educational content (FineWeb EDU = high weight) and high-quality web text, with encyclopedic and journalistic sources providing factual grounding. ### Stage 2 — Supervised Fine-Tuning (SFT) **Objective:** general instruction-following + factual knowledge alignment. **Data sources (~860K total examples):** - **OpenHermes** (cleaned) - **UltraChat** (cleaned) - **WildChat** (cleaned) - **Numina** (math reasoning) - **OpenThoughts** (chain-of-thought) - **Eurus** (multi-task) Composition: ~760K core + 100K augmentation examples. Sharded under `s3://ff-llm-datasets/sft/shards_v2/`. ### Stage 3 — Direct Preference Optimization (DPO) — **REJECTED** **Two DPO experiments were attempted and discarded:** | DPO variant | Pairs | Result | |---|---|---| | v1 — WizardLM/Alpaca preferences | 38,863 | **-3pp MMLU** → rejected | | v2 — UltraFeedback (argilla/ultrafeedback-binarized) | 60,917 | **-3pp MMLU** → rejected | **Lesson:** DPO consistently caused MMLU regression (~-3pp) regardless of hyperparameters. **Not used in the final model.** ### Stage 4 — Distillation v3 Knowledge distillation from larger teacher models on a curated question set. | Item | Value | |---|---| | Total questions | 108,779 | | Source mix | hellaswag (37%), openhermes (28%), mmlu (14%), math (12%), gsm8k (7%), arc (2%), truthfulqa (1%) | | S3 path | `s3://ff-llm-datasets/distill_v3/` | ### Stage 5 — LoRA experiments — **REJECTED** Multiple LoRA fine-tuning attempts were tried for surgical improvements: | LoRA experiment | Examples | Result | |---|---|---| | LoRA v4b (synthetic instruction) | 6,000 | marginal, not promoted | | LoRA format-only v1/v3 | 1,779–2,092 | **catastrophic forgetting** (-3 to -4pp MMLU full) | **Lesson:** LoRA at LR ≥ 5e-4 with template-structured data caused the model to overfit to template patterns rather than learn generalizable behavior. **Not used in the final model.** ### Stage 6 — Surgical Fine-Tuning Targeted fine-tuning on a small curated set focused on output discipline (yes/no answers, single-letter MCQ, exact-N lists, numeric-only). | Item | Value | |---|---| | Examples | 3,000 | | Path | `D:\ff_llm\ff31_surgical.jsonl` | ### Stage 7 — Knowledge Repair Training (produces FF_3.13) The decisive stage that turned FF_3.11 into FF_3.13. **Dataset composition (16,006 total):** | Block | Description | Examples | |---|---|---| | Block A | MMLU-style MCQ (multiple choice questions across diverse subjects) | 10,714 | | Block B | Factual concise (TruthfulQA-like, <100 char answers) | 929 | | Block D | Numeric microreasoning (arithmetic word problems with step solutions) | 3,562 | | Validation set | held-out for monitoring | 801 | **Training configuration:** | Item | Value | |---|---| | Hardware | 8× NVIDIA RTX 5090 | | Framework | DeepSpeed ZeRO-2 | | Precision | bfloat16 | | Optimizer | AdamW | | Learning rate | 2.5e-6 (cosine schedule) | | Epochs | 3 (early-stopped at step 200/357) | | Effective batch size | (configured for 8-GPU DDP) | | Wall-clock | ~30 min total | **Checkpoint sweep & selection:** | Checkpoint | MMLU full | Status | |---|---|---| | 1-epoch ckpt-100 | 27.47% | not selected | | 3-epoch ckpt-50 | 27.21% | not selected | | 3-epoch ckpt-100 | 27.86% | not selected | | **3-epoch ckpt-150** | **28.05%** | **CHAMPION → FF_3.13** ✅ | | 3-epoch ckpt-200 | 28.17% | rejected (+0.12pp marginal, regressed 5/6 weak domains) | --- ## Compute infrastructure summary | Resource | Specification | |---|---| | GPU | 8× NVIDIA RTX 5090 (24 GB VRAM each, 192 GB total) | | GPU utilization | ~100% sustained during training | | Throughput (pretraining) | ~220,000 tokens/sec | | Distributed training | DeepSpeed ZeRO-2 | | Numerical precision | bfloat16 (training and inference) | | Cloud provider | Vast.ai | --- ## Recommended usage ### Prompt template (Alpaca-style) ``` ### System: You are FF-LLM, a helpful assistant. ### Instruction: ### Response: ``` ### Decoding settings - **Always use greedy decoding** (`do_sample=False`). - Sampling has been shown to degrade factual accuracy by ~5pp on this model family. ### Quick start (transformers) ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "francescofiamingo1/FF_3.13" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda") prompt = """### System: You are FF-LLM, a helpful assistant. ### Instruction: What is the capital of France? ### Response: """ inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## Limitations and known weaknesses - **2B parameters** — knowledge ceiling lower than 7B+ models - **Format compliance moderate**: 60% on strict format-discipline bench (yes/no, exact-N, single-letter) - **Entity disambiguation weakness**: occasional "anchor entity" over-attribution (e.g., default to Edison for inventor questions) - **Weak domains** (per qualitative analysis): mathematics, literature, music, art - **Strong domains**: biology, geography, basic science, factual short-form QA --- ## Variant lineage | Variant | Status | Notes | |---|---|---| | FF_3 | base | initial release | | FF_3.1 | published | post-SFT, MMLU 26.72% | | FF_3.2 | **discontinued** | early experiment, not maintained | | FF_3.11 | published | specialized variant, MMLU 25.20%, 106-bench 71% | | **FF_3.13** | **current champion** | knowledge repair on FF_3.11 base, MMLU 28.05% | | FF_3.14 | **rejected** | full SFT with humanities focus, MMLU flat (no improvement) | | SLERP t=0.10 (FF_3.13 + FF_3.11) | candidate backup | MMLU 29.10% (+0.41pp), 106-bench tie | --- ## Reproducibility All training data shards, scripts, and intermediate checkpoints are tracked in cloud storage: - Datasets: `s3://ff-llm-datasets/` - Champion model: `s3://ff-llm-datasets/champions/latest/` - Build scripts: `s3://ff-llm-datasets/ff314/build/` (includes Block E/F builders, anti-anchoring tables, philosophy seeds) For reproduction support, contact the author. --- ## Citation ```bibtex @misc{ff_3_13_2026, author = {francescofiamingo1}, title = {FF_3.13: a 2B GPT-2 model with knowledge repair fine-tuning}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/francescofiamingo1/FF_3.13}} } ``` --- *Last updated: 2026-04-18*