FF_3.13 — Release Notes ========================= Model: FF_3.13 Source: ckpt-150 (from 3-epoch repair training, Vast instance 103.177.249.208:33448) Source path: /workspace/ff311_repair_output/checkpoint-150 Status: CHAMPION (supersedes FF_3.11 as primary FF-LLM release) Date: 2026-04-17 Architecture ------------ GPT-2 decoder-only, 2.02B parameters n_layer=38, d_model=2048, n_heads=16, n_inner=8192, context=2048 Vocabulary: GPT-2 BPE, 50257 tokens Precision: bf16 Training summary ---------------- Base: FF_3.11 / mix07v4_0.2 (SLERP merge of FF_3.1 + surgical FT, t=0.20) Hardware: 8x RTX 5090 (DeepSpeed ZeRO-2) Dataset: 15,205 train + 801 val examples (A=10714 MCQ / B=929 factual / D=3562 numeric) Hyperparams: lr=2.5e-6 (cosine), warmup=0.05, bf16, gradient_checkpointing Epochs: 3 (early-stopped at step 200/357 via no-improvement rule) Selected step: 150 Main benchmark notes -------------------- MMLU full (lm-eval-harness v0.4.11): 28.05% - vs FF_3.11 baseline (25.20%): +2.85pp - vs FF_3.1 baseline (26.72%): +1.33pp - social sciences: 30.74% / stem: 29.34% / other: 26.84% / humanities: 24.48% 106-bench total (greedy, rep_penalty=1.0): 74.5% - arith: 100.0% (vs FF_3.11 80.0%) - science: 84.0% (vs FF_3.11 80.0%) - geo: 64.0% (vs FF_3.11 56.0%) - person: 88.0% (tied) - format: 56.0% (vs FF_3.11 72.0% — known regression) Strengths --------- - arithmetic / science / geo factual recall - damaged MMLU domains (prof_medicine 43.38%, hs_statistics 38.43%, security 39.59%, hs_macroeconomics 34.62%, hs_government 32.64%, medical_genetics 26.00%) Known gaps ---------- - Strict format compliance (-16pp vs FF_3.11 on yes/no and exact-count prompts) - Humanities / art / entity disambiguation (e.g., Edison over-anchoring) - Next repair round should add entity-disambiguation, humanities, arts, and invention-history examples Rejected alternative -------------------- ckpt-200 (MMLU 28.17%, +0.12pp over ckpt-150) was rejected: - gain below 0.15pp stability threshold - degraded 5 of 6 weak domains Prompt template (required) -------------------------- ### System: You are FF-LLM, a helpful assistant. ### Instruction: {question} ### Response: Decoding recommendations ------------------------ Use greedy (do_sample=False, num_beams=1, top_p=1.0, top_k=0, repetition_penalty=1.0). Sampling at temperature 0.7 underperforms greedy on factual tests (~29% vs ~34%). Storage ------- S3 primary: s3://ff-llm-datasets/ff313/final/ S3 alias: s3://ff-llm-datasets/champions/latest/ HuggingFace: francescofiamingo1/FF_3.13 Local master: C:\Users\f_fia\FF_3.13_master\ (full, incl. training artifacts) Local infer: C:\Users\f_fia\FF_3.13_inference\ (inference-only subset) Excluded from master -------------------- /workspace/ff311_repair_output/checkpoint-150/global_step150/ (27 GB, DeepSpeed ZeRO-2 optimizer shards). Not preserved — useful only for resuming DeepSpeed training from this exact step; weights are intact in model.safetensors. Conservative exclusion to avoid disproportionate storage cost.