初始化项目，由ModelHub XC社区提供模型

Model: francescofiamingo1/FF_3.13 Source: Original Platform
2026-04-27 07:28:11 +08:00
commit d72c5d9225
11 changed files with 351125 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,35 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,328 @@
 ---
 language:
 - en
 license: apache-2.0
 library_name: transformers
 tags:
 - text-generation
 - gpt2
 - causal-lm
 - fine-tuned
 - knowledge-repair
 pipeline_tag: text-generation
 model_type: gpt2
 ---
 # FF_3.13
 > **Champion model** of the FF-LLM line. A 2.02B GPT-2 architecture model fine-tuned through a multi-stage pipeline (pretraining → SFT → distillation → surgical fine-tuning → knowledge repair) for general-purpose factual question answering.
 ---
 ## Model overview
 | | |
 |---|---|
 | **Architecture** | GPT-2 (causal LM) |
 | **Parameters** | 2.02B |
 | **Hidden size** | 2048 |
 | **Layers** | 38 |
 | **Attention heads** | 16 |
 | **Vocab size** | 50,257 |
 | **Tokenizer** | GPT-2 BPE |
 | **Context length** | 1024 tokens |
 | **Precision** | bfloat16 (also fp16/fp32 compatible) |
 | **License** | Apache 2.0 |
 | **Author** | francescofiamingo1 |
 ---
 ## Benchmark performance
 ### MMLU (Massive Multitask Language Understanding)
 Evaluated with `lm-eval-harness v0.4.11`, greedy decoding.
 | Split | Score |
 |---|---|
 | **MMLU full (14,042 items)** | **28.05%** |
 | MMLU dev (285 items) | 25.61% |
 ### Macro-domain breakdown (MMLU full)
 | Macro | Subjects | Accuracy |
 |---|---|---|
 | STEM | 19 subjects | 30.70% |
 | Humanities | 13 subjects | 26.06% |
 | Social Sciences | 12 subjects | 30.03% |
 | Other (medicine, law, professional) | 13 subjects | 29.32% |
 ### 106-bench (custom factual benchmark)
 Custom 106-prompt benchmark with strict TRUTH-list scoring:
 | Category | N | Score |
 |---|---|---|
 | arithmetic | 5 | 5/5 (100.0%) |
 | open-ended | 1 | 1/1 (100.0%) |
 | person | 25 | 22/25 (88.0%) |
 | science | 25 | 21/25 (84.0%) |
 | geography | 25 | 15/25 (60.0%) |
 | format compliance | 25 | 15/25 (60.0%) |
 | **TOTAL** | **106** | **79/106 (74.5%)** |
 ### Improvement vs precursors
 | Model | MMLU full | Δ vs FF_3.13 |
 |---|---|---|
 | FF_3 (base, original release) | — | — |
 | FF_3.1 (post-SFT) | 26.72% | -1.33pp |
 | FF_3.11 (specialized variant) | 25.20% | -2.85pp |
 | **FF_3.13 (this model)** | **28.05%** | **— champion** |
 ---
 ## Training pipeline — chronological view
 The model went through **7 distinct stages** of training. Below is the complete history.
 ### Stage 1 — Pretraining
 **Architecture chosen:** GPT-2 (2.02B parameters), trained from scratch on a curated multi-source web + encyclopedic + educational corpus.
 | Item | Value |
 |---|---|
 | Hardware | 8× NVIDIA RTX 5090 (24 GB each = 192 GB VRAM total) |
 | Throughput | **~220,000 tokens/sec sustained** (100% GPU utilization, all 8 GPUs in parallel) |
 | Framework | PyTorch + DeepSpeed ZeRO-2 |
 | Precision | bfloat16 |
 | **Total pretraining tokens** | **~90 billion tokens** |
 | Wall-clock pretraining time | ~5 days continuous (90B / 220K tok/s ≈ 4.7 days) |
 #### Pretraining data composition
 The pretraining corpus was assembled from **8 distinct sources**, organized in two training modules (M1 BASE / Extra and M2 BASE / Extra) with quality-tiered weighting.
 | Dataset | Module | Type | Quality weight |
 |---|---|---|---|
 | **FineWeb general** | M1 BASE | Web | medium |
 | **FineWeb 10BT** | M1 Extra25 | Web high quality | medium |
 | **FineWeb EDU** | M2 BASE | Educational | **high** |
 | **FineWeb EDU extended** | M2 Extra | Educational reasoning | medium |
 | **C4 EN** | M1 C4 | Web filtered | medium |
 | **Wikipedia EN** | M1 BASE | Encyclopedic | low |
 | **Web Clean custom** | M1 BASE / Extra | Web filtered | low |
 | **News crawl** | M1 BASE | Journalistic | low |
 #### Mix proportions (approximate)
 - **60–65% FineWeb** (various slices: general, 10BT, EDU, EDU extended)
 - **15–20% C4 EN**
 - **5–10% Wikipedia EN**
 - **5–10% Web Clean custom**
 - **~5% News crawl**
 This mix prioritizes educational content (FineWeb EDU = high weight) and high-quality web text, with encyclopedic and journalistic sources providing factual grounding.
 ### Stage 2 — Supervised Fine-Tuning (SFT)
 **Objective:** general instruction-following + factual knowledge alignment.
 **Data sources (~860K total examples):**
 - **OpenHermes** (cleaned)
 - **UltraChat** (cleaned)
 - **WildChat** (cleaned)
 - **Numina** (math reasoning)
 - **OpenThoughts** (chain-of-thought)
 - **Eurus** (multi-task)
 Composition: ~760K core + 100K augmentation examples. Sharded under `s3://ff-llm-datasets/sft/shards_v2/`.
 ### Stage 3 — Direct Preference Optimization (DPO) — **REJECTED**
 **Two DPO experiments were attempted and discarded:**
 | DPO variant | Pairs | Result |
 |---|---|---|
 | v1 — WizardLM/Alpaca preferences | 38,863 | **-3pp MMLU** → rejected |
 | v2 — UltraFeedback (argilla/ultrafeedback-binarized) | 60,917 | **-3pp MMLU** → rejected |
 **Lesson:** DPO consistently caused MMLU regression (~-3pp) regardless of hyperparameters. **Not used in the final model.**
 ### Stage 4 — Distillation v3
 Knowledge distillation from larger teacher models on a curated question set.
 | Item | Value |
 |---|---|
 | Total questions | 108,779 |
 | Source mix | hellaswag (37%), openhermes (28%), mmlu (14%), math (12%), gsm8k (7%), arc (2%), truthfulqa (1%) |
 | S3 path | `s3://ff-llm-datasets/distill_v3/` |
 ### Stage 5 — LoRA experiments — **REJECTED**
 Multiple LoRA fine-tuning attempts were tried for surgical improvements:
 | LoRA experiment | Examples | Result |
 |---|---|---|
 | LoRA v4b (synthetic instruction) | 6,000 | marginal, not promoted |
 | LoRA format-only v1/v3 | 1,779–2,092 | **catastrophic forgetting** (-3 to -4pp MMLU full) |
 **Lesson:** LoRA at LR ≥ 5e-4 with template-structured data caused the model to overfit to template patterns rather than learn generalizable behavior. **Not used in the final model.**
 ### Stage 6 — Surgical Fine-Tuning
 Targeted fine-tuning on a small curated set focused on output discipline (yes/no answers, single-letter MCQ, exact-N lists, numeric-only).
 | Item | Value |
 |---|---|
 | Examples | 3,000 |
 | Path | `D:\ff_llm\ff31_surgical.jsonl` |
 ### Stage 7 — Knowledge Repair Training (produces FF_3.13)
 The decisive stage that turned FF_3.11 into FF_3.13.
 **Dataset composition (16,006 total):**
 | Block | Description | Examples |
 |---|---|---|
 | Block A | MMLU-style MCQ (multiple choice questions across diverse subjects) | 10,714 |
 | Block B | Factual concise (TruthfulQA-like, <100 char answers) | 929 |
 | Block D | Numeric microreasoning (arithmetic word problems with step solutions) | 3,562 |
 | Validation set | held-out for monitoring | 801 |
 **Training configuration:**
 | Item | Value |
 |---|---|
 | Hardware | 8× NVIDIA RTX 5090 |
 | Framework | DeepSpeed ZeRO-2 |
 | Precision | bfloat16 |
 | Optimizer | AdamW |
 | Learning rate | 2.5e-6 (cosine schedule) |
 | Epochs | 3 (early-stopped at step 200/357) |
 | Effective batch size | (configured for 8-GPU DDP) |
 | Wall-clock | ~30 min total |
 **Checkpoint sweep & selection:**
 | Checkpoint | MMLU full | Status |
 |---|---|---|
 | 1-epoch ckpt-100 | 27.47% | not selected |
 | 3-epoch ckpt-50 | 27.21% | not selected |
 | 3-epoch ckpt-100 | 27.86% | not selected |
 | **3-epoch ckpt-150** | **28.05%** | **CHAMPION → FF_3.13** ✅ |
 | 3-epoch ckpt-200 | 28.17% | rejected (+0.12pp marginal, regressed 5/6 weak domains) |
 ---
 ## Compute infrastructure summary
 | Resource | Specification |
 |---|---|
 | GPU | 8× NVIDIA RTX 5090 (24 GB VRAM each, 192 GB total) |
 | GPU utilization | ~100% sustained during training |
 | Throughput (pretraining) | ~220,000 tokens/sec |
 | Distributed training | DeepSpeed ZeRO-2 |
 | Numerical precision | bfloat16 (training and inference) |
 | Cloud provider | Vast.ai |
 ---
 ## Recommended usage
 ### Prompt template (Alpaca-style)
 ```
 ### System:
 You are FF-LLM, a helpful assistant.
 ### Instruction:
 <your question>
 ### Response:
 ```
 ### Decoding settings
 - **Always use greedy decoding** (`do_sample=False`).
 - Sampling has been shown to degrade factual accuracy by ~5pp on this model family.
 ### Quick start (transformers)
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 model_id = "francescofiamingo1/FF_3.13"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda")
 prompt = """### System:
 You are FF-LLM, a helpful assistant.
 ### Instruction:
 What is the capital of France?
 ### Response:
 """
 inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
 outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ---
 ## Limitations and known weaknesses
 - **2B parameters** — knowledge ceiling lower than 7B+ models
 - **Format compliance moderate**: 60% on strict format-discipline bench (yes/no, exact-N, single-letter)
 - **Entity disambiguation weakness**: occasional "anchor entity" over-attribution (e.g., default to Edison for inventor questions)
 - **Weak domains** (per qualitative analysis): mathematics, literature, music, art
 - **Strong domains**: biology, geography, basic science, factual short-form QA
 ---
 ## Variant lineage
 | Variant | Status | Notes |
 |---|---|---|
 | FF_3 | base | initial release |
 | FF_3.1 | published | post-SFT, MMLU 26.72% |
 | FF_3.2 | **discontinued** | early experiment, not maintained |
 | FF_3.11 | published | specialized variant, MMLU 25.20%, 106-bench 71% |
 | **FF_3.13** | **current champion** | knowledge repair on FF_3.11 base, MMLU 28.05% |
 | FF_3.14 | **rejected** | full SFT with humanities focus, MMLU flat (no improvement) |
 | SLERP t=0.10 (FF_3.13 + FF_3.11) | candidate backup | MMLU 29.10% (+0.41pp), 106-bench tie |
 ---
 ## Reproducibility
 All training data shards, scripts, and intermediate checkpoints are tracked in cloud storage:
 - Datasets: `s3://ff-llm-datasets/`
 - Champion model: `s3://ff-llm-datasets/champions/latest/`
 - Build scripts: `s3://ff-llm-datasets/ff314/build/` (includes Block E/F builders, anti-anchoring tables, philosophy seeds)
 For reproduction support, contact the author.
 ---
 ## Citation
 ```bibtex
@misc{ff_3_13_2026,
  author       = {francescofiamingo1},
  title        = {FF_3.13: a 2B GPT-2 model with knowledge repair fine-tuning},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/francescofiamingo1/FF_3.13}}
 }
 ```
 ---
 *Last updated: 2026-04-18*
--- a/RELEASE_NOTES.txt
+++ b/RELEASE_NOTES.txt
@@ -0,0 +1,86 @@
 FF_3.13 — Release Notes
 =========================
 Model:   FF_3.13
 Source:  ckpt-150 (from 3-epoch repair training, Vast instance 103.177.249.208:33448)
          Source path: /workspace/ff311_repair_output/checkpoint-150
 Status:  CHAMPION (supersedes FF_3.11 as primary FF-LLM release)
 Date:    2026-04-17
 Architecture
 ------------
 GPT-2 decoder-only, 2.02B parameters
 n_layer=38, d_model=2048, n_heads=16, n_inner=8192, context=2048
 Vocabulary: GPT-2 BPE, 50257 tokens
 Precision: bf16
 Training summary
 ----------------
 Base:           FF_3.11 / mix07v4_0.2 (SLERP merge of FF_3.1 + surgical FT, t=0.20)
 Hardware:       8x RTX 5090 (DeepSpeed ZeRO-2)
 Dataset:        15,205 train + 801 val examples (A=10714 MCQ / B=929 factual / D=3562 numeric)
 Hyperparams:    lr=2.5e-6 (cosine), warmup=0.05, bf16, gradient_checkpointing
 Epochs:         3 (early-stopped at step 200/357 via no-improvement rule)
 Selected step:  150
 Main benchmark notes
 --------------------
 MMLU full (lm-eval-harness v0.4.11):   28.05%
  - vs FF_3.11 baseline (25.20%):      +2.85pp
  - vs FF_3.1 baseline (26.72%):       +1.33pp
  - social sciences: 30.74% / stem: 29.34% / other: 26.84% / humanities: 24.48%
 106-bench total (greedy, rep_penalty=1.0):  74.5%
  - arith:    100.0% (vs FF_3.11 80.0%)
  - science:  84.0%  (vs FF_3.11 80.0%)
  - geo:      64.0%  (vs FF_3.11 56.0%)
  - person:   88.0%  (tied)
  - format:   56.0%  (vs FF_3.11 72.0% — known regression)
 Strengths
 ---------
 - arithmetic / science / geo factual recall
 - damaged MMLU domains (prof_medicine 43.38%, hs_statistics 38.43%, security 39.59%,
  hs_macroeconomics 34.62%, hs_government 32.64%, medical_genetics 26.00%)
 Known gaps
 ----------
 - Strict format compliance (-16pp vs FF_3.11 on yes/no and exact-count prompts)
 - Humanities / art / entity disambiguation (e.g., Edison over-anchoring)
 - Next repair round should add entity-disambiguation, humanities, arts, and invention-history examples
 Rejected alternative
 --------------------
 ckpt-200 (MMLU 28.17%, +0.12pp over ckpt-150) was rejected:
 - gain below 0.15pp stability threshold
 - degraded 5 of 6 weak domains
 Prompt template (required)
 --------------------------
 ### System:
 You are FF-LLM, a helpful assistant.
 ### Instruction:
 {question}
 ### Response:
 Decoding recommendations
 ------------------------
 Use greedy (do_sample=False, num_beams=1, top_p=1.0, top_k=0, repetition_penalty=1.0).
 Sampling at temperature 0.7 underperforms greedy on factual tests (~29% vs ~34%).
 Storage
 -------
 S3 primary:   s3://ff-llm-datasets/ff313/final/
 S3 alias:     s3://ff-llm-datasets/champions/latest/
 HuggingFace:  francescofiamingo1/FF_3.13
 Local master: C:\Users\f_fia\FF_3.13_master\ (full, incl. training artifacts)
 Local infer:  C:\Users\f_fia\FF_3.13_inference\ (inference-only subset)
 Excluded from master
 --------------------
 /workspace/ff311_repair_output/checkpoint-150/global_step150/ (27 GB, DeepSpeed ZeRO-2
 optimizer shards). Not preserved — useful only for resuming DeepSpeed training from this
 exact step; weights are intact in model.safetensors. Conservative exclusion to avoid
 disproportionate storage cost.
--- a/config.json
+++ b/config.json
@@ -0,0 +1,35 @@
 {
  "activation_function": "gelu_new",
  "add_cross_attention": false,
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "dtype": "bfloat16",
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 2048,
  "n_embd": 2048,
  "n_head": 16,
  "n_inner": 8192,
  "n_layer": 38,
  "n_positions": 2048,
  "pad_token_id": 50256,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "tie_word_embeddings": true,
  "transformers_version": "5.5.4",
  "use_cache": false,
  "vocab_size": 50257
 }
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,9 @@
 {
  "_from_model_config": true,
  "bos_token_id": 50256,
  "eos_token_id": [
    50256
  ],
  "pad_token_id": 50256,
  "transformers_version": "5.5.4"
 }
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:ec7e2c09a7fcb7d400d84d126b3a95a3eb6c40e229be0cece611ba267944a9aa
 size 4247379664
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,30 @@
 {
  "bos_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
 }
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,13 @@
 {
  "add_prefix_space": false,
  "backend": "tokenizers",
  "bos_token": "<|endoftext|>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|endoftext|>",
  "errors": "replace",
  "is_local": true,
  "model_max_length": 1024,
  "pad_token": "<|endoftext|>",
  "tokenizer_class": "GPT2Tokenizer",
  "unk_token": "<|endoftext|>"
 }
--- a/vocab.json
+++ b/vocab.json