初始化项目，由ModelHub XC社区提供模型

Model: francescofiamingo1/FF_3.13 Source: Original Platform
2026-04-27 07:28:11 +08:00
commit d72c5d9225
11 changed files with 351125 additions and 0 deletions
--- a/RELEASE_NOTES.txt
+++ b/RELEASE_NOTES.txt
@@ -0,0 +1,86 @@
+FF_3.13 — Release Notes
+=========================
+
+Model:   FF_3.13
+Source:  ckpt-150 (from 3-epoch repair training, Vast instance 103.177.249.208:33448)
+          Source path: /workspace/ff311_repair_output/checkpoint-150
+Status:  CHAMPION (supersedes FF_3.11 as primary FF-LLM release)
+Date:    2026-04-17
+
+Architecture
+------------
+GPT-2 decoder-only, 2.02B parameters
+n_layer=38, d_model=2048, n_heads=16, n_inner=8192, context=2048
+Vocabulary: GPT-2 BPE, 50257 tokens
+Precision: bf16
+
+Training summary
+----------------
+Base:           FF_3.11 / mix07v4_0.2 (SLERP merge of FF_3.1 + surgical FT, t=0.20)
+Hardware:       8x RTX 5090 (DeepSpeed ZeRO-2)
+Dataset:        15,205 train + 801 val examples (A=10714 MCQ / B=929 factual / D=3562 numeric)
+Hyperparams:    lr=2.5e-6 (cosine), warmup=0.05, bf16, gradient_checkpointing
+Epochs:         3 (early-stopped at step 200/357 via no-improvement rule)
+Selected step:  150
+
+Main benchmark notes
+--------------------
+MMLU full (lm-eval-harness v0.4.11):   28.05%
+  - vs FF_3.11 baseline (25.20%):      +2.85pp
+  - vs FF_3.1 baseline (26.72%):       +1.33pp
+  - social sciences: 30.74% / stem: 29.34% / other: 26.84% / humanities: 24.48%
+
+106-bench total (greedy, rep_penalty=1.0):  74.5%
+  - arith:    100.0% (vs FF_3.11 80.0%)
+  - science:  84.0%  (vs FF_3.11 80.0%)
+  - geo:      64.0%  (vs FF_3.11 56.0%)
+  - person:   88.0%  (tied)
+  - format:   56.0%  (vs FF_3.11 72.0% — known regression)
+
+Strengths
+---------
+- arithmetic / science / geo factual recall
+- damaged MMLU domains (prof_medicine 43.38%, hs_statistics 38.43%, security 39.59%,
+  hs_macroeconomics 34.62%, hs_government 32.64%, medical_genetics 26.00%)
+
+Known gaps
+----------
+- Strict format compliance (-16pp vs FF_3.11 on yes/no and exact-count prompts)
+- Humanities / art / entity disambiguation (e.g., Edison over-anchoring)
+- Next repair round should add entity-disambiguation, humanities, arts, and invention-history examples
+
+Rejected alternative
+--------------------
+ckpt-200 (MMLU 28.17%, +0.12pp over ckpt-150) was rejected:
+- gain below 0.15pp stability threshold
+- degraded 5 of 6 weak domains
+
+Prompt template (required)
+--------------------------
+### System:
+You are FF-LLM, a helpful assistant.
+
+### Instruction:
+{question}
+
+### Response:
+
+Decoding recommendations
+------------------------
+Use greedy (do_sample=False, num_beams=1, top_p=1.0, top_k=0, repetition_penalty=1.0).
+Sampling at temperature 0.7 underperforms greedy on factual tests (~29% vs ~34%).
+
+Storage
+-------
+S3 primary:   s3://ff-llm-datasets/ff313/final/
+S3 alias:     s3://ff-llm-datasets/champions/latest/
+HuggingFace:  francescofiamingo1/FF_3.13
+Local master: C:\Users\f_fia\FF_3.13_master\ (full, incl. training artifacts)
+Local infer:  C:\Users\f_fia\FF_3.13_inference\ (inference-only subset)
+
+Excluded from master
+--------------------
+/workspace/ff311_repair_output/checkpoint-150/global_step150/ (27 GB, DeepSpeed ZeRO-2
+optimizer shards). Not preserved — useful only for resuming DeepSpeed training from this
+exact step; weights are intact in model.safetensors. Conservative exclusion to avoid
+disproportionate storage cost.