Files
FF_3.13/README.md
ModelHub XC d72c5d9225 初始化项目,由ModelHub XC社区提供模型
Model: francescofiamingo1/FF_3.13
Source: Original Platform
2026-04-27 07:28:11 +08:00

329 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- text-generation
- gpt2
- causal-lm
- fine-tuned
- knowledge-repair
pipeline_tag: text-generation
model_type: gpt2
---
# FF_3.13
> **Champion model** of the FF-LLM line. A 2.02B GPT-2 architecture model fine-tuned through a multi-stage pipeline (pretraining → SFT → distillation → surgical fine-tuning → knowledge repair) for general-purpose factual question answering.
---
## Model overview
| | |
|---|---|
| **Architecture** | GPT-2 (causal LM) |
| **Parameters** | 2.02B |
| **Hidden size** | 2048 |
| **Layers** | 38 |
| **Attention heads** | 16 |
| **Vocab size** | 50,257 |
| **Tokenizer** | GPT-2 BPE |
| **Context length** | 1024 tokens |
| **Precision** | bfloat16 (also fp16/fp32 compatible) |
| **License** | Apache 2.0 |
| **Author** | francescofiamingo1 |
---
## Benchmark performance
### MMLU (Massive Multitask Language Understanding)
Evaluated with `lm-eval-harness v0.4.11`, greedy decoding.
| Split | Score |
|---|---|
| **MMLU full (14,042 items)** | **28.05%** |
| MMLU dev (285 items) | 25.61% |
### Macro-domain breakdown (MMLU full)
| Macro | Subjects | Accuracy |
|---|---|---|
| STEM | 19 subjects | 30.70% |
| Humanities | 13 subjects | 26.06% |
| Social Sciences | 12 subjects | 30.03% |
| Other (medicine, law, professional) | 13 subjects | 29.32% |
### 106-bench (custom factual benchmark)
Custom 106-prompt benchmark with strict TRUTH-list scoring:
| Category | N | Score |
|---|---|---|
| arithmetic | 5 | 5/5 (100.0%) |
| open-ended | 1 | 1/1 (100.0%) |
| person | 25 | 22/25 (88.0%) |
| science | 25 | 21/25 (84.0%) |
| geography | 25 | 15/25 (60.0%) |
| format compliance | 25 | 15/25 (60.0%) |
| **TOTAL** | **106** | **79/106 (74.5%)** |
### Improvement vs precursors
| Model | MMLU full | Δ vs FF_3.13 |
|---|---|---|
| FF_3 (base, original release) | — | — |
| FF_3.1 (post-SFT) | 26.72% | -1.33pp |
| FF_3.11 (specialized variant) | 25.20% | -2.85pp |
| **FF_3.13 (this model)** | **28.05%** | **— champion** |
---
## Training pipeline — chronological view
The model went through **7 distinct stages** of training. Below is the complete history.
### Stage 1 — Pretraining
**Architecture chosen:** GPT-2 (2.02B parameters), trained from scratch on a curated multi-source web + encyclopedic + educational corpus.
| Item | Value |
|---|---|
| Hardware | 8× NVIDIA RTX 5090 (24 GB each = 192 GB VRAM total) |
| Throughput | **~220,000 tokens/sec sustained** (100% GPU utilization, all 8 GPUs in parallel) |
| Framework | PyTorch + DeepSpeed ZeRO-2 |
| Precision | bfloat16 |
| **Total pretraining tokens** | **~90 billion tokens** |
| Wall-clock pretraining time | ~5 days continuous (90B / 220K tok/s ≈ 4.7 days) |
#### Pretraining data composition
The pretraining corpus was assembled from **8 distinct sources**, organized in two training modules (M1 BASE / Extra and M2 BASE / Extra) with quality-tiered weighting.
| Dataset | Module | Type | Quality weight |
|---|---|---|---|
| **FineWeb general** | M1 BASE | Web | medium |
| **FineWeb 10BT** | M1 Extra25 | Web high quality | medium |
| **FineWeb EDU** | M2 BASE | Educational | **high** |
| **FineWeb EDU extended** | M2 Extra | Educational reasoning | medium |
| **C4 EN** | M1 C4 | Web filtered | medium |
| **Wikipedia EN** | M1 BASE | Encyclopedic | low |
| **Web Clean custom** | M1 BASE / Extra | Web filtered | low |
| **News crawl** | M1 BASE | Journalistic | low |
#### Mix proportions (approximate)
- **6065% FineWeb** (various slices: general, 10BT, EDU, EDU extended)
- **1520% C4 EN**
- **510% Wikipedia EN**
- **510% Web Clean custom**
- **~5% News crawl**
This mix prioritizes educational content (FineWeb EDU = high weight) and high-quality web text, with encyclopedic and journalistic sources providing factual grounding.
### Stage 2 — Supervised Fine-Tuning (SFT)
**Objective:** general instruction-following + factual knowledge alignment.
**Data sources (~860K total examples):**
- **OpenHermes** (cleaned)
- **UltraChat** (cleaned)
- **WildChat** (cleaned)
- **Numina** (math reasoning)
- **OpenThoughts** (chain-of-thought)
- **Eurus** (multi-task)
Composition: ~760K core + 100K augmentation examples. Sharded under `s3://ff-llm-datasets/sft/shards_v2/`.
### Stage 3 — Direct Preference Optimization (DPO) — **REJECTED**
**Two DPO experiments were attempted and discarded:**
| DPO variant | Pairs | Result |
|---|---|---|
| v1 — WizardLM/Alpaca preferences | 38,863 | **-3pp MMLU** → rejected |
| v2 — UltraFeedback (argilla/ultrafeedback-binarized) | 60,917 | **-3pp MMLU** → rejected |
**Lesson:** DPO consistently caused MMLU regression (~-3pp) regardless of hyperparameters. **Not used in the final model.**
### Stage 4 — Distillation v3
Knowledge distillation from larger teacher models on a curated question set.
| Item | Value |
|---|---|
| Total questions | 108,779 |
| Source mix | hellaswag (37%), openhermes (28%), mmlu (14%), math (12%), gsm8k (7%), arc (2%), truthfulqa (1%) |
| S3 path | `s3://ff-llm-datasets/distill_v3/` |
### Stage 5 — LoRA experiments — **REJECTED**
Multiple LoRA fine-tuning attempts were tried for surgical improvements:
| LoRA experiment | Examples | Result |
|---|---|---|
| LoRA v4b (synthetic instruction) | 6,000 | marginal, not promoted |
| LoRA format-only v1/v3 | 1,7792,092 | **catastrophic forgetting** (-3 to -4pp MMLU full) |
**Lesson:** LoRA at LR ≥ 5e-4 with template-structured data caused the model to overfit to template patterns rather than learn generalizable behavior. **Not used in the final model.**
### Stage 6 — Surgical Fine-Tuning
Targeted fine-tuning on a small curated set focused on output discipline (yes/no answers, single-letter MCQ, exact-N lists, numeric-only).
| Item | Value |
|---|---|
| Examples | 3,000 |
| Path | `D:\ff_llm\ff31_surgical.jsonl` |
### Stage 7 — Knowledge Repair Training (produces FF_3.13)
The decisive stage that turned FF_3.11 into FF_3.13.
**Dataset composition (16,006 total):**
| Block | Description | Examples |
|---|---|---|
| Block A | MMLU-style MCQ (multiple choice questions across diverse subjects) | 10,714 |
| Block B | Factual concise (TruthfulQA-like, <100 char answers) | 929 |
| Block D | Numeric microreasoning (arithmetic word problems with step solutions) | 3,562 |
| Validation set | held-out for monitoring | 801 |
**Training configuration:**
| Item | Value |
|---|---|
| Hardware | 8× NVIDIA RTX 5090 |
| Framework | DeepSpeed ZeRO-2 |
| Precision | bfloat16 |
| Optimizer | AdamW |
| Learning rate | 2.5e-6 (cosine schedule) |
| Epochs | 3 (early-stopped at step 200/357) |
| Effective batch size | (configured for 8-GPU DDP) |
| Wall-clock | ~30 min total |
**Checkpoint sweep & selection:**
| Checkpoint | MMLU full | Status |
|---|---|---|
| 1-epoch ckpt-100 | 27.47% | not selected |
| 3-epoch ckpt-50 | 27.21% | not selected |
| 3-epoch ckpt-100 | 27.86% | not selected |
| **3-epoch ckpt-150** | **28.05%** | **CHAMPION → FF_3.13** |
| 3-epoch ckpt-200 | 28.17% | rejected (+0.12pp marginal, regressed 5/6 weak domains) |
---
## Compute infrastructure summary
| Resource | Specification |
|---|---|
| GPU | 8× NVIDIA RTX 5090 (24 GB VRAM each, 192 GB total) |
| GPU utilization | ~100% sustained during training |
| Throughput (pretraining) | ~220,000 tokens/sec |
| Distributed training | DeepSpeed ZeRO-2 |
| Numerical precision | bfloat16 (training and inference) |
| Cloud provider | Vast.ai |
---
## Recommended usage
### Prompt template (Alpaca-style)
```
### System:
You are FF-LLM, a helpful assistant.
### Instruction:
<your question>
### Response:
```
### Decoding settings
- **Always use greedy decoding** (`do_sample=False`).
- Sampling has been shown to degrade factual accuracy by ~5pp on this model family.
### Quick start (transformers)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "francescofiamingo1/FF_3.13"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda")
prompt = """### System:
You are FF-LLM, a helpful assistant.
### Instruction:
What is the capital of France?
### Response:
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## Limitations and known weaknesses
- **2B parameters** knowledge ceiling lower than 7B+ models
- **Format compliance moderate**: 60% on strict format-discipline bench (yes/no, exact-N, single-letter)
- **Entity disambiguation weakness**: occasional "anchor entity" over-attribution (e.g., default to Edison for inventor questions)
- **Weak domains** (per qualitative analysis): mathematics, literature, music, art
- **Strong domains**: biology, geography, basic science, factual short-form QA
---
## Variant lineage
| Variant | Status | Notes |
|---|---|---|
| FF_3 | base | initial release |
| FF_3.1 | published | post-SFT, MMLU 26.72% |
| FF_3.2 | **discontinued** | early experiment, not maintained |
| FF_3.11 | published | specialized variant, MMLU 25.20%, 106-bench 71% |
| **FF_3.13** | **current champion** | knowledge repair on FF_3.11 base, MMLU 28.05% |
| FF_3.14 | **rejected** | full SFT with humanities focus, MMLU flat (no improvement) |
| SLERP t=0.10 (FF_3.13 + FF_3.11) | candidate backup | MMLU 29.10% (+0.41pp), 106-bench tie |
---
## Reproducibility
All training data shards, scripts, and intermediate checkpoints are tracked in cloud storage:
- Datasets: `s3://ff-llm-datasets/`
- Champion model: `s3://ff-llm-datasets/champions/latest/`
- Build scripts: `s3://ff-llm-datasets/ff314/build/` (includes Block E/F builders, anti-anchoring tables, philosophy seeds)
For reproduction support, contact the author.
---
## Citation
```bibtex
@misc{ff_3_13_2026,
author = {francescofiamingo1},
title = {FF_3.13: a 2B GPT-2 model with knowledge repair fine-tuning},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/francescofiamingo1/FF_3.13}}
}
```
---
*Last updated: 2026-04-18*