初始化项目，由ModelHub XC社区提供模型

Model: francescofiamingo1/FF_3.13 Source: Original Platform
2026-04-27 07:28:11 +08:00
commit d72c5d9225
11 changed files with 351125 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,328 @@
+---
+language:
+- en
+license: apache-2.0
+library_name: transformers
+tags:
+- text-generation
+- gpt2
+- causal-lm
+- fine-tuned
+- knowledge-repair
+pipeline_tag: text-generation
+model_type: gpt2
+---
+
+# FF_3.13
+
+> **Champion model** of the FF-LLM line. A 2.02B GPT-2 architecture model fine-tuned through a multi-stage pipeline (pretraining → SFT → distillation → surgical fine-tuning → knowledge repair) for general-purpose factual question answering.
+
+---
+
+## Model overview
+
+| | |
+|---|---|
+| **Architecture** | GPT-2 (causal LM) |
+| **Parameters** | 2.02B |
+| **Hidden size** | 2048 |
+| **Layers** | 38 |
+| **Attention heads** | 16 |
+| **Vocab size** | 50,257 |
+| **Tokenizer** | GPT-2 BPE |
+| **Context length** | 1024 tokens |
+| **Precision** | bfloat16 (also fp16/fp32 compatible) |
+| **License** | Apache 2.0 |
+| **Author** | francescofiamingo1 |
+
+---
+
+## Benchmark performance
+
+### MMLU (Massive Multitask Language Understanding)
+
+Evaluated with `lm-eval-harness v0.4.11`, greedy decoding.
+
+| Split | Score |
+|---|---|
+| **MMLU full (14,042 items)** | **28.05%** |
+| MMLU dev (285 items) | 25.61% |
+
+### Macro-domain breakdown (MMLU full)
+
+| Macro | Subjects | Accuracy |
+|---|---|---|
+| STEM | 19 subjects | 30.70% |
+| Humanities | 13 subjects | 26.06% |
+| Social Sciences | 12 subjects | 30.03% |
+| Other (medicine, law, professional) | 13 subjects | 29.32% |
+
+### 106-bench (custom factual benchmark)
+
+Custom 106-prompt benchmark with strict TRUTH-list scoring:
+
+| Category | N | Score |
+|---|---|---|
+| arithmetic | 5 | 5/5 (100.0%) |
+| open-ended | 1 | 1/1 (100.0%) |
+| person | 25 | 22/25 (88.0%) |
+| science | 25 | 21/25 (84.0%) |
+| geography | 25 | 15/25 (60.0%) |
+| format compliance | 25 | 15/25 (60.0%) |
+| **TOTAL** | **106** | **79/106 (74.5%)** |
+
+### Improvement vs precursors
+
+| Model | MMLU full | Δ vs FF_3.13 |
+|---|---|---|
+| FF_3 (base, original release) | — | — |
+| FF_3.1 (post-SFT) | 26.72% | -1.33pp |
+| FF_3.11 (specialized variant) | 25.20% | -2.85pp |
+| **FF_3.13 (this model)** | **28.05%** | **— champion** |
+
+---
+
+## Training pipeline — chronological view
+
+The model went through **7 distinct stages** of training. Below is the complete history.
+
+### Stage 1 — Pretraining
+
+**Architecture chosen:** GPT-2 (2.02B parameters), trained from scratch on a curated multi-source web + encyclopedic + educational corpus.
+
+| Item | Value |
+|---|---|
+| Hardware | 8× NVIDIA RTX 5090 (24 GB each = 192 GB VRAM total) |
+| Throughput | **~220,000 tokens/sec sustained** (100% GPU utilization, all 8 GPUs in parallel) |
+| Framework | PyTorch + DeepSpeed ZeRO-2 |
+| Precision | bfloat16 |
+| **Total pretraining tokens** | **~90 billion tokens** |
+| Wall-clock pretraining time | ~5 days continuous (90B / 220K tok/s ≈ 4.7 days) |
+
+#### Pretraining data composition
+
+The pretraining corpus was assembled from **8 distinct sources**, organized in two training modules (M1 BASE / Extra and M2 BASE / Extra) with quality-tiered weighting.
+
+| Dataset | Module | Type | Quality weight |
+|---|---|---|---|
+| **FineWeb general** | M1 BASE | Web | medium |
+| **FineWeb 10BT** | M1 Extra25 | Web high quality | medium |
+| **FineWeb EDU** | M2 BASE | Educational | **high** |
+| **FineWeb EDU extended** | M2 Extra | Educational reasoning | medium |
+| **C4 EN** | M1 C4 | Web filtered | medium |
+| **Wikipedia EN** | M1 BASE | Encyclopedic | low |
+| **Web Clean custom** | M1 BASE / Extra | Web filtered | low |
+| **News crawl** | M1 BASE | Journalistic | low |
+
+#### Mix proportions (approximate)
+
+- **60–65% FineWeb** (various slices: general, 10BT, EDU, EDU extended)
+- **15–20% C4 EN**
+- **5–10% Wikipedia EN**
+- **5–10% Web Clean custom**
+- **~5% News crawl**
+
+This mix prioritizes educational content (FineWeb EDU = high weight) and high-quality web text, with encyclopedic and journalistic sources providing factual grounding.
+
+### Stage 2 — Supervised Fine-Tuning (SFT)
+
+**Objective:** general instruction-following + factual knowledge alignment.
+
+**Data sources (~860K total examples):**
+- **OpenHermes** (cleaned)
+- **UltraChat** (cleaned)
+- **WildChat** (cleaned)
+- **Numina** (math reasoning)
+- **OpenThoughts** (chain-of-thought)
+- **Eurus** (multi-task)
+
+Composition: ~760K core + 100K augmentation examples. Sharded under `s3://ff-llm-datasets/sft/shards_v2/`.
+
+### Stage 3 — Direct Preference Optimization (DPO) — **REJECTED**
+
+**Two DPO experiments were attempted and discarded:**
+
+| DPO variant | Pairs | Result |
+|---|---|---|
+| v1 — WizardLM/Alpaca preferences | 38,863 | **-3pp MMLU** → rejected |
+| v2 — UltraFeedback (argilla/ultrafeedback-binarized) | 60,917 | **-3pp MMLU** → rejected |
+
+**Lesson:** DPO consistently caused MMLU regression (~-3pp) regardless of hyperparameters. **Not used in the final model.**
+
+### Stage 4 — Distillation v3
+
+Knowledge distillation from larger teacher models on a curated question set.
+
+| Item | Value |
+|---|---|
+| Total questions | 108,779 |
+| Source mix | hellaswag (37%), openhermes (28%), mmlu (14%), math (12%), gsm8k (7%), arc (2%), truthfulqa (1%) |
+| S3 path | `s3://ff-llm-datasets/distill_v3/` |
+
+### Stage 5 — LoRA experiments — **REJECTED**
+
+Multiple LoRA fine-tuning attempts were tried for surgical improvements:
+
+| LoRA experiment | Examples | Result |
+|---|---|---|
+| LoRA v4b (synthetic instruction) | 6,000 | marginal, not promoted |
+| LoRA format-only v1/v3 | 1,779–2,092 | **catastrophic forgetting** (-3 to -4pp MMLU full) |
+
+**Lesson:** LoRA at LR ≥ 5e-4 with template-structured data caused the model to overfit to template patterns rather than learn generalizable behavior. **Not used in the final model.**
+
+### Stage 6 — Surgical Fine-Tuning
+
+Targeted fine-tuning on a small curated set focused on output discipline (yes/no answers, single-letter MCQ, exact-N lists, numeric-only).
+
+| Item | Value |
+|---|---|
+| Examples | 3,000 |
+| Path | `D:\ff_llm\ff31_surgical.jsonl` |
+
+### Stage 7 — Knowledge Repair Training (produces FF_3.13)
+
+The decisive stage that turned FF_3.11 into FF_3.13.
+
+**Dataset composition (16,006 total):**
+
+| Block | Description | Examples |
+|---|---|---|
+| Block A | MMLU-style MCQ (multiple choice questions across diverse subjects) | 10,714 |
+| Block B | Factual concise (TruthfulQA-like, <100 char answers) | 929 |
+| Block D | Numeric microreasoning (arithmetic word problems with step solutions) | 3,562 |
+| Validation set | held-out for monitoring | 801 |
+
+**Training configuration:**
+
+| Item | Value |
+|---|---|
+| Hardware | 8× NVIDIA RTX 5090 |
+| Framework | DeepSpeed ZeRO-2 |
+| Precision | bfloat16 |
+| Optimizer | AdamW |
+| Learning rate | 2.5e-6 (cosine schedule) |
+| Epochs | 3 (early-stopped at step 200/357) |
+| Effective batch size | (configured for 8-GPU DDP) |
+| Wall-clock | ~30 min total |
+
+**Checkpoint sweep & selection:**
+
+| Checkpoint | MMLU full | Status |
+|---|---|---|
+| 1-epoch ckpt-100 | 27.47% | not selected |
+| 3-epoch ckpt-50 | 27.21% | not selected |
+| 3-epoch ckpt-100 | 27.86% | not selected |
+| **3-epoch ckpt-150** | **28.05%** | **CHAMPION → FF_3.13** ✅ |
+| 3-epoch ckpt-200 | 28.17% | rejected (+0.12pp marginal, regressed 5/6 weak domains) |
+
+---
+
+## Compute infrastructure summary
+
+| Resource | Specification |
+|---|---|
+| GPU | 8× NVIDIA RTX 5090 (24 GB VRAM each, 192 GB total) |
+| GPU utilization | ~100% sustained during training |
+| Throughput (pretraining) | ~220,000 tokens/sec |
+| Distributed training | DeepSpeed ZeRO-2 |
+| Numerical precision | bfloat16 (training and inference) |
+| Cloud provider | Vast.ai |
+
+---
+
+## Recommended usage
+
+### Prompt template (Alpaca-style)
+
+```
+### System:
+You are FF-LLM, a helpful assistant.
+
+### Instruction:
+<your question>
+
+### Response:
+```
+
+### Decoding settings
+
+- **Always use greedy decoding** (`do_sample=False`).
+- Sampling has been shown to degrade factual accuracy by ~5pp on this model family.
+
+### Quick start (transformers)
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+
+model_id = "francescofiamingo1/FF_3.13"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda")
+
+prompt = """### System:
+You are FF-LLM, a helpful assistant.
+
+### Instruction:
+What is the capital of France?
+
+### Response:
+"""
+
+inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+
+---
+
+## Limitations and known weaknesses
+
+- **2B parameters** — knowledge ceiling lower than 7B+ models
+- **Format compliance moderate**: 60% on strict format-discipline bench (yes/no, exact-N, single-letter)
+- **Entity disambiguation weakness**: occasional "anchor entity" over-attribution (e.g., default to Edison for inventor questions)
+- **Weak domains** (per qualitative analysis): mathematics, literature, music, art
+- **Strong domains**: biology, geography, basic science, factual short-form QA
+
+---
+
+## Variant lineage
+
+| Variant | Status | Notes |
+|---|---|---|
+| FF_3 | base | initial release |
+| FF_3.1 | published | post-SFT, MMLU 26.72% |
+| FF_3.2 | **discontinued** | early experiment, not maintained |
+| FF_3.11 | published | specialized variant, MMLU 25.20%, 106-bench 71% |
+| **FF_3.13** | **current champion** | knowledge repair on FF_3.11 base, MMLU 28.05% |
+| FF_3.14 | **rejected** | full SFT with humanities focus, MMLU flat (no improvement) |
+| SLERP t=0.10 (FF_3.13 + FF_3.11) | candidate backup | MMLU 29.10% (+0.41pp), 106-bench tie |
+
+---
+
+## Reproducibility
+
+All training data shards, scripts, and intermediate checkpoints are tracked in cloud storage:
+
+- Datasets: `s3://ff-llm-datasets/`
+- Champion model: `s3://ff-llm-datasets/champions/latest/`
+- Build scripts: `s3://ff-llm-datasets/ff314/build/` (includes Block E/F builders, anti-anchoring tables, philosophy seeds)
+
+For reproduction support, contact the author.
+
+---
+
+## Citation
+
+```bibtex
+@misc{ff_3_13_2026,
+  author       = {francescofiamingo1},
+  title        = {FF_3.13: a 2B GPT-2 model with knowledge repair fine-tuning},
+  year         = {2026},
+  publisher    = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/francescofiamingo1/FF_3.13}}
+}
+```
+
+---
+
+*Last updated: 2026-04-18*