Files
FF_3.13/RELEASE_NOTES.txt
ModelHub XC d72c5d9225 初始化项目,由ModelHub XC社区提供模型
Model: francescofiamingo1/FF_3.13
Source: Original Platform
2026-04-27 07:28:11 +08:00

87 lines
3.1 KiB
Plaintext

FF_3.13 — Release Notes
=========================
Model: FF_3.13
Source: ckpt-150 (from 3-epoch repair training, Vast instance 103.177.249.208:33448)
Source path: /workspace/ff311_repair_output/checkpoint-150
Status: CHAMPION (supersedes FF_3.11 as primary FF-LLM release)
Date: 2026-04-17
Architecture
------------
GPT-2 decoder-only, 2.02B parameters
n_layer=38, d_model=2048, n_heads=16, n_inner=8192, context=2048
Vocabulary: GPT-2 BPE, 50257 tokens
Precision: bf16
Training summary
----------------
Base: FF_3.11 / mix07v4_0.2 (SLERP merge of FF_3.1 + surgical FT, t=0.20)
Hardware: 8x RTX 5090 (DeepSpeed ZeRO-2)
Dataset: 15,205 train + 801 val examples (A=10714 MCQ / B=929 factual / D=3562 numeric)
Hyperparams: lr=2.5e-6 (cosine), warmup=0.05, bf16, gradient_checkpointing
Epochs: 3 (early-stopped at step 200/357 via no-improvement rule)
Selected step: 150
Main benchmark notes
--------------------
MMLU full (lm-eval-harness v0.4.11): 28.05%
- vs FF_3.11 baseline (25.20%): +2.85pp
- vs FF_3.1 baseline (26.72%): +1.33pp
- social sciences: 30.74% / stem: 29.34% / other: 26.84% / humanities: 24.48%
106-bench total (greedy, rep_penalty=1.0): 74.5%
- arith: 100.0% (vs FF_3.11 80.0%)
- science: 84.0% (vs FF_3.11 80.0%)
- geo: 64.0% (vs FF_3.11 56.0%)
- person: 88.0% (tied)
- format: 56.0% (vs FF_3.11 72.0% — known regression)
Strengths
---------
- arithmetic / science / geo factual recall
- damaged MMLU domains (prof_medicine 43.38%, hs_statistics 38.43%, security 39.59%,
hs_macroeconomics 34.62%, hs_government 32.64%, medical_genetics 26.00%)
Known gaps
----------
- Strict format compliance (-16pp vs FF_3.11 on yes/no and exact-count prompts)
- Humanities / art / entity disambiguation (e.g., Edison over-anchoring)
- Next repair round should add entity-disambiguation, humanities, arts, and invention-history examples
Rejected alternative
--------------------
ckpt-200 (MMLU 28.17%, +0.12pp over ckpt-150) was rejected:
- gain below 0.15pp stability threshold
- degraded 5 of 6 weak domains
Prompt template (required)
--------------------------
### System:
You are FF-LLM, a helpful assistant.
### Instruction:
{question}
### Response:
Decoding recommendations
------------------------
Use greedy (do_sample=False, num_beams=1, top_p=1.0, top_k=0, repetition_penalty=1.0).
Sampling at temperature 0.7 underperforms greedy on factual tests (~29% vs ~34%).
Storage
-------
S3 primary: s3://ff-llm-datasets/ff313/final/
S3 alias: s3://ff-llm-datasets/champions/latest/
HuggingFace: francescofiamingo1/FF_3.13
Local master: C:\Users\f_fia\FF_3.13_master\ (full, incl. training artifacts)
Local infer: C:\Users\f_fia\FF_3.13_inference\ (inference-only subset)
Excluded from master
--------------------
/workspace/ff311_repair_output/checkpoint-150/global_step150/ (27 GB, DeepSpeed ZeRO-2
optimizer shards). Not preserved — useful only for resuming DeepSpeed training from this
exact step; weights are intact in model.safetensors. Conservative exclusion to avoid
disproportionate storage cost.