FF_3.13 — Release Notes
=========================

Model:   FF_3.13
Source:  ckpt-150 (from 3-epoch repair training, Vast instance 103.177.249.208:33448)
          Source path: /workspace/ff311_repair_output/checkpoint-150
Status:  CHAMPION (supersedes FF_3.11 as primary FF-LLM release)
Date:    2026-04-17

Architecture
------------
GPT-2 decoder-only, 2.02B parameters
n_layer=38, d_model=2048, n_heads=16, n_inner=8192, context=2048
Vocabulary: GPT-2 BPE, 50257 tokens
Precision: bf16

Training summary
----------------
Base:           FF_3.11 / mix07v4_0.2 (SLERP merge of FF_3.1 + surgical FT, t=0.20)
Hardware:       8x RTX 5090 (DeepSpeed ZeRO-2)
Dataset:        15,205 train + 801 val examples (A=10714 MCQ / B=929 factual / D=3562 numeric)
Hyperparams:    lr=2.5e-6 (cosine), warmup=0.05, bf16, gradient_checkpointing
Epochs:         3 (early-stopped at step 200/357 via no-improvement rule)
Selected step:  150

Main benchmark notes
--------------------
MMLU full (lm-eval-harness v0.4.11):   28.05%
  - vs FF_3.11 baseline (25.20%):      +2.85pp
  - vs FF_3.1 baseline (26.72%):       +1.33pp
  - social sciences: 30.74% / stem: 29.34% / other: 26.84% / humanities: 24.48%

106-bench total (greedy, rep_penalty=1.0):  74.5%
  - arith:    100.0% (vs FF_3.11 80.0%)
  - science:  84.0%  (vs FF_3.11 80.0%)
  - geo:      64.0%  (vs FF_3.11 56.0%)
  - person:   88.0%  (tied)
  - format:   56.0%  (vs FF_3.11 72.0% — known regression)

Strengths
---------
- arithmetic / science / geo factual recall
- damaged MMLU domains (prof_medicine 43.38%, hs_statistics 38.43%, security 39.59%,
  hs_macroeconomics 34.62%, hs_government 32.64%, medical_genetics 26.00%)

Known gaps
----------
- Strict format compliance (-16pp vs FF_3.11 on yes/no and exact-count prompts)
- Humanities / art / entity disambiguation (e.g., Edison over-anchoring)
- Next repair round should add entity-disambiguation, humanities, arts, and invention-history examples

Rejected alternative
--------------------
ckpt-200 (MMLU 28.17%, +0.12pp over ckpt-150) was rejected:
- gain below 0.15pp stability threshold
- degraded 5 of 6 weak domains

Prompt template (required)
--------------------------
### System:
You are FF-LLM, a helpful assistant.

### Instruction:
{question}

### Response:

Decoding recommendations
------------------------
Use greedy (do_sample=False, num_beams=1, top_p=1.0, top_k=0, repetition_penalty=1.0).
Sampling at temperature 0.7 underperforms greedy on factual tests (~29% vs ~34%).

Storage
-------
S3 primary:   s3://ff-llm-datasets/ff313/final/
S3 alias:     s3://ff-llm-datasets/champions/latest/
HuggingFace:  francescofiamingo1/FF_3.13
Local master: C:\Users\f_fia\FF_3.13_master\ (full, incl. training artifacts)
Local infer:  C:\Users\f_fia\FF_3.13_inference\ (inference-only subset)

Excluded from master
--------------------
/workspace/ff311_repair_output/checkpoint-150/global_step150/ (27 GB, DeepSpeed ZeRO-2
optimizer shards). Not preserved — useful only for resuming DeepSpeed training from this
exact step; weights are intact in model.safetensors. Conservative exclusion to avoid
disproportionate storage cost.