87 lines
3.1 KiB
Plaintext
87 lines
3.1 KiB
Plaintext
FF_3.13 — Release Notes
|
|
=========================
|
|
|
|
Model: FF_3.13
|
|
Source: ckpt-150 (from 3-epoch repair training, Vast instance 103.177.249.208:33448)
|
|
Source path: /workspace/ff311_repair_output/checkpoint-150
|
|
Status: CHAMPION (supersedes FF_3.11 as primary FF-LLM release)
|
|
Date: 2026-04-17
|
|
|
|
Architecture
|
|
------------
|
|
GPT-2 decoder-only, 2.02B parameters
|
|
n_layer=38, d_model=2048, n_heads=16, n_inner=8192, context=2048
|
|
Vocabulary: GPT-2 BPE, 50257 tokens
|
|
Precision: bf16
|
|
|
|
Training summary
|
|
----------------
|
|
Base: FF_3.11 / mix07v4_0.2 (SLERP merge of FF_3.1 + surgical FT, t=0.20)
|
|
Hardware: 8x RTX 5090 (DeepSpeed ZeRO-2)
|
|
Dataset: 15,205 train + 801 val examples (A=10714 MCQ / B=929 factual / D=3562 numeric)
|
|
Hyperparams: lr=2.5e-6 (cosine), warmup=0.05, bf16, gradient_checkpointing
|
|
Epochs: 3 (early-stopped at step 200/357 via no-improvement rule)
|
|
Selected step: 150
|
|
|
|
Main benchmark notes
|
|
--------------------
|
|
MMLU full (lm-eval-harness v0.4.11): 28.05%
|
|
- vs FF_3.11 baseline (25.20%): +2.85pp
|
|
- vs FF_3.1 baseline (26.72%): +1.33pp
|
|
- social sciences: 30.74% / stem: 29.34% / other: 26.84% / humanities: 24.48%
|
|
|
|
106-bench total (greedy, rep_penalty=1.0): 74.5%
|
|
- arith: 100.0% (vs FF_3.11 80.0%)
|
|
- science: 84.0% (vs FF_3.11 80.0%)
|
|
- geo: 64.0% (vs FF_3.11 56.0%)
|
|
- person: 88.0% (tied)
|
|
- format: 56.0% (vs FF_3.11 72.0% — known regression)
|
|
|
|
Strengths
|
|
---------
|
|
- arithmetic / science / geo factual recall
|
|
- damaged MMLU domains (prof_medicine 43.38%, hs_statistics 38.43%, security 39.59%,
|
|
hs_macroeconomics 34.62%, hs_government 32.64%, medical_genetics 26.00%)
|
|
|
|
Known gaps
|
|
----------
|
|
- Strict format compliance (-16pp vs FF_3.11 on yes/no and exact-count prompts)
|
|
- Humanities / art / entity disambiguation (e.g., Edison over-anchoring)
|
|
- Next repair round should add entity-disambiguation, humanities, arts, and invention-history examples
|
|
|
|
Rejected alternative
|
|
--------------------
|
|
ckpt-200 (MMLU 28.17%, +0.12pp over ckpt-150) was rejected:
|
|
- gain below 0.15pp stability threshold
|
|
- degraded 5 of 6 weak domains
|
|
|
|
Prompt template (required)
|
|
--------------------------
|
|
### System:
|
|
You are FF-LLM, a helpful assistant.
|
|
|
|
### Instruction:
|
|
{question}
|
|
|
|
### Response:
|
|
|
|
Decoding recommendations
|
|
------------------------
|
|
Use greedy (do_sample=False, num_beams=1, top_p=1.0, top_k=0, repetition_penalty=1.0).
|
|
Sampling at temperature 0.7 underperforms greedy on factual tests (~29% vs ~34%).
|
|
|
|
Storage
|
|
-------
|
|
S3 primary: s3://ff-llm-datasets/ff313/final/
|
|
S3 alias: s3://ff-llm-datasets/champions/latest/
|
|
HuggingFace: francescofiamingo1/FF_3.13
|
|
Local master: C:\Users\f_fia\FF_3.13_master\ (full, incl. training artifacts)
|
|
Local infer: C:\Users\f_fia\FF_3.13_inference\ (inference-only subset)
|
|
|
|
Excluded from master
|
|
--------------------
|
|
/workspace/ff311_repair_output/checkpoint-150/global_step150/ (27 GB, DeepSpeed ZeRO-2
|
|
optimizer shards). Not preserved — useful only for resuming DeepSpeed training from this
|
|
exact step; weights are intact in model.safetensors. Conservative exclusion to avoid
|
|
disproportionate storage cost.
|