初始化项目,由ModelHub XC社区提供模型
Model: francescofiamingo1/FF_3.13 Source: Original Platform
This commit is contained in:
35
.gitattributes
vendored
Normal file
35
.gitattributes
vendored
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
328
README.md
Normal file
328
README.md
Normal file
@@ -0,0 +1,328 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- en
|
||||||
|
license: apache-2.0
|
||||||
|
library_name: transformers
|
||||||
|
tags:
|
||||||
|
- text-generation
|
||||||
|
- gpt2
|
||||||
|
- causal-lm
|
||||||
|
- fine-tuned
|
||||||
|
- knowledge-repair
|
||||||
|
pipeline_tag: text-generation
|
||||||
|
model_type: gpt2
|
||||||
|
---
|
||||||
|
|
||||||
|
# FF_3.13
|
||||||
|
|
||||||
|
> **Champion model** of the FF-LLM line. A 2.02B GPT-2 architecture model fine-tuned through a multi-stage pipeline (pretraining → SFT → distillation → surgical fine-tuning → knowledge repair) for general-purpose factual question answering.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Model overview
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **Architecture** | GPT-2 (causal LM) |
|
||||||
|
| **Parameters** | 2.02B |
|
||||||
|
| **Hidden size** | 2048 |
|
||||||
|
| **Layers** | 38 |
|
||||||
|
| **Attention heads** | 16 |
|
||||||
|
| **Vocab size** | 50,257 |
|
||||||
|
| **Tokenizer** | GPT-2 BPE |
|
||||||
|
| **Context length** | 1024 tokens |
|
||||||
|
| **Precision** | bfloat16 (also fp16/fp32 compatible) |
|
||||||
|
| **License** | Apache 2.0 |
|
||||||
|
| **Author** | francescofiamingo1 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Benchmark performance
|
||||||
|
|
||||||
|
### MMLU (Massive Multitask Language Understanding)
|
||||||
|
|
||||||
|
Evaluated with `lm-eval-harness v0.4.11`, greedy decoding.
|
||||||
|
|
||||||
|
| Split | Score |
|
||||||
|
|---|---|
|
||||||
|
| **MMLU full (14,042 items)** | **28.05%** |
|
||||||
|
| MMLU dev (285 items) | 25.61% |
|
||||||
|
|
||||||
|
### Macro-domain breakdown (MMLU full)
|
||||||
|
|
||||||
|
| Macro | Subjects | Accuracy |
|
||||||
|
|---|---|---|
|
||||||
|
| STEM | 19 subjects | 30.70% |
|
||||||
|
| Humanities | 13 subjects | 26.06% |
|
||||||
|
| Social Sciences | 12 subjects | 30.03% |
|
||||||
|
| Other (medicine, law, professional) | 13 subjects | 29.32% |
|
||||||
|
|
||||||
|
### 106-bench (custom factual benchmark)
|
||||||
|
|
||||||
|
Custom 106-prompt benchmark with strict TRUTH-list scoring:
|
||||||
|
|
||||||
|
| Category | N | Score |
|
||||||
|
|---|---|---|
|
||||||
|
| arithmetic | 5 | 5/5 (100.0%) |
|
||||||
|
| open-ended | 1 | 1/1 (100.0%) |
|
||||||
|
| person | 25 | 22/25 (88.0%) |
|
||||||
|
| science | 25 | 21/25 (84.0%) |
|
||||||
|
| geography | 25 | 15/25 (60.0%) |
|
||||||
|
| format compliance | 25 | 15/25 (60.0%) |
|
||||||
|
| **TOTAL** | **106** | **79/106 (74.5%)** |
|
||||||
|
|
||||||
|
### Improvement vs precursors
|
||||||
|
|
||||||
|
| Model | MMLU full | Δ vs FF_3.13 |
|
||||||
|
|---|---|---|
|
||||||
|
| FF_3 (base, original release) | — | — |
|
||||||
|
| FF_3.1 (post-SFT) | 26.72% | -1.33pp |
|
||||||
|
| FF_3.11 (specialized variant) | 25.20% | -2.85pp |
|
||||||
|
| **FF_3.13 (this model)** | **28.05%** | **— champion** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Training pipeline — chronological view
|
||||||
|
|
||||||
|
The model went through **7 distinct stages** of training. Below is the complete history.
|
||||||
|
|
||||||
|
### Stage 1 — Pretraining
|
||||||
|
|
||||||
|
**Architecture chosen:** GPT-2 (2.02B parameters), trained from scratch on a curated multi-source web + encyclopedic + educational corpus.
|
||||||
|
|
||||||
|
| Item | Value |
|
||||||
|
|---|---|
|
||||||
|
| Hardware | 8× NVIDIA RTX 5090 (24 GB each = 192 GB VRAM total) |
|
||||||
|
| Throughput | **~220,000 tokens/sec sustained** (100% GPU utilization, all 8 GPUs in parallel) |
|
||||||
|
| Framework | PyTorch + DeepSpeed ZeRO-2 |
|
||||||
|
| Precision | bfloat16 |
|
||||||
|
| **Total pretraining tokens** | **~90 billion tokens** |
|
||||||
|
| Wall-clock pretraining time | ~5 days continuous (90B / 220K tok/s ≈ 4.7 days) |
|
||||||
|
|
||||||
|
#### Pretraining data composition
|
||||||
|
|
||||||
|
The pretraining corpus was assembled from **8 distinct sources**, organized in two training modules (M1 BASE / Extra and M2 BASE / Extra) with quality-tiered weighting.
|
||||||
|
|
||||||
|
| Dataset | Module | Type | Quality weight |
|
||||||
|
|---|---|---|---|
|
||||||
|
| **FineWeb general** | M1 BASE | Web | medium |
|
||||||
|
| **FineWeb 10BT** | M1 Extra25 | Web high quality | medium |
|
||||||
|
| **FineWeb EDU** | M2 BASE | Educational | **high** |
|
||||||
|
| **FineWeb EDU extended** | M2 Extra | Educational reasoning | medium |
|
||||||
|
| **C4 EN** | M1 C4 | Web filtered | medium |
|
||||||
|
| **Wikipedia EN** | M1 BASE | Encyclopedic | low |
|
||||||
|
| **Web Clean custom** | M1 BASE / Extra | Web filtered | low |
|
||||||
|
| **News crawl** | M1 BASE | Journalistic | low |
|
||||||
|
|
||||||
|
#### Mix proportions (approximate)
|
||||||
|
|
||||||
|
- **60–65% FineWeb** (various slices: general, 10BT, EDU, EDU extended)
|
||||||
|
- **15–20% C4 EN**
|
||||||
|
- **5–10% Wikipedia EN**
|
||||||
|
- **5–10% Web Clean custom**
|
||||||
|
- **~5% News crawl**
|
||||||
|
|
||||||
|
This mix prioritizes educational content (FineWeb EDU = high weight) and high-quality web text, with encyclopedic and journalistic sources providing factual grounding.
|
||||||
|
|
||||||
|
### Stage 2 — Supervised Fine-Tuning (SFT)
|
||||||
|
|
||||||
|
**Objective:** general instruction-following + factual knowledge alignment.
|
||||||
|
|
||||||
|
**Data sources (~860K total examples):**
|
||||||
|
- **OpenHermes** (cleaned)
|
||||||
|
- **UltraChat** (cleaned)
|
||||||
|
- **WildChat** (cleaned)
|
||||||
|
- **Numina** (math reasoning)
|
||||||
|
- **OpenThoughts** (chain-of-thought)
|
||||||
|
- **Eurus** (multi-task)
|
||||||
|
|
||||||
|
Composition: ~760K core + 100K augmentation examples. Sharded under `s3://ff-llm-datasets/sft/shards_v2/`.
|
||||||
|
|
||||||
|
### Stage 3 — Direct Preference Optimization (DPO) — **REJECTED**
|
||||||
|
|
||||||
|
**Two DPO experiments were attempted and discarded:**
|
||||||
|
|
||||||
|
| DPO variant | Pairs | Result |
|
||||||
|
|---|---|---|
|
||||||
|
| v1 — WizardLM/Alpaca preferences | 38,863 | **-3pp MMLU** → rejected |
|
||||||
|
| v2 — UltraFeedback (argilla/ultrafeedback-binarized) | 60,917 | **-3pp MMLU** → rejected |
|
||||||
|
|
||||||
|
**Lesson:** DPO consistently caused MMLU regression (~-3pp) regardless of hyperparameters. **Not used in the final model.**
|
||||||
|
|
||||||
|
### Stage 4 — Distillation v3
|
||||||
|
|
||||||
|
Knowledge distillation from larger teacher models on a curated question set.
|
||||||
|
|
||||||
|
| Item | Value |
|
||||||
|
|---|---|
|
||||||
|
| Total questions | 108,779 |
|
||||||
|
| Source mix | hellaswag (37%), openhermes (28%), mmlu (14%), math (12%), gsm8k (7%), arc (2%), truthfulqa (1%) |
|
||||||
|
| S3 path | `s3://ff-llm-datasets/distill_v3/` |
|
||||||
|
|
||||||
|
### Stage 5 — LoRA experiments — **REJECTED**
|
||||||
|
|
||||||
|
Multiple LoRA fine-tuning attempts were tried for surgical improvements:
|
||||||
|
|
||||||
|
| LoRA experiment | Examples | Result |
|
||||||
|
|---|---|---|
|
||||||
|
| LoRA v4b (synthetic instruction) | 6,000 | marginal, not promoted |
|
||||||
|
| LoRA format-only v1/v3 | 1,779–2,092 | **catastrophic forgetting** (-3 to -4pp MMLU full) |
|
||||||
|
|
||||||
|
**Lesson:** LoRA at LR ≥ 5e-4 with template-structured data caused the model to overfit to template patterns rather than learn generalizable behavior. **Not used in the final model.**
|
||||||
|
|
||||||
|
### Stage 6 — Surgical Fine-Tuning
|
||||||
|
|
||||||
|
Targeted fine-tuning on a small curated set focused on output discipline (yes/no answers, single-letter MCQ, exact-N lists, numeric-only).
|
||||||
|
|
||||||
|
| Item | Value |
|
||||||
|
|---|---|
|
||||||
|
| Examples | 3,000 |
|
||||||
|
| Path | `D:\ff_llm\ff31_surgical.jsonl` |
|
||||||
|
|
||||||
|
### Stage 7 — Knowledge Repair Training (produces FF_3.13)
|
||||||
|
|
||||||
|
The decisive stage that turned FF_3.11 into FF_3.13.
|
||||||
|
|
||||||
|
**Dataset composition (16,006 total):**
|
||||||
|
|
||||||
|
| Block | Description | Examples |
|
||||||
|
|---|---|---|
|
||||||
|
| Block A | MMLU-style MCQ (multiple choice questions across diverse subjects) | 10,714 |
|
||||||
|
| Block B | Factual concise (TruthfulQA-like, <100 char answers) | 929 |
|
||||||
|
| Block D | Numeric microreasoning (arithmetic word problems with step solutions) | 3,562 |
|
||||||
|
| Validation set | held-out for monitoring | 801 |
|
||||||
|
|
||||||
|
**Training configuration:**
|
||||||
|
|
||||||
|
| Item | Value |
|
||||||
|
|---|---|
|
||||||
|
| Hardware | 8× NVIDIA RTX 5090 |
|
||||||
|
| Framework | DeepSpeed ZeRO-2 |
|
||||||
|
| Precision | bfloat16 |
|
||||||
|
| Optimizer | AdamW |
|
||||||
|
| Learning rate | 2.5e-6 (cosine schedule) |
|
||||||
|
| Epochs | 3 (early-stopped at step 200/357) |
|
||||||
|
| Effective batch size | (configured for 8-GPU DDP) |
|
||||||
|
| Wall-clock | ~30 min total |
|
||||||
|
|
||||||
|
**Checkpoint sweep & selection:**
|
||||||
|
|
||||||
|
| Checkpoint | MMLU full | Status |
|
||||||
|
|---|---|---|
|
||||||
|
| 1-epoch ckpt-100 | 27.47% | not selected |
|
||||||
|
| 3-epoch ckpt-50 | 27.21% | not selected |
|
||||||
|
| 3-epoch ckpt-100 | 27.86% | not selected |
|
||||||
|
| **3-epoch ckpt-150** | **28.05%** | **CHAMPION → FF_3.13** ✅ |
|
||||||
|
| 3-epoch ckpt-200 | 28.17% | rejected (+0.12pp marginal, regressed 5/6 weak domains) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Compute infrastructure summary
|
||||||
|
|
||||||
|
| Resource | Specification |
|
||||||
|
|---|---|
|
||||||
|
| GPU | 8× NVIDIA RTX 5090 (24 GB VRAM each, 192 GB total) |
|
||||||
|
| GPU utilization | ~100% sustained during training |
|
||||||
|
| Throughput (pretraining) | ~220,000 tokens/sec |
|
||||||
|
| Distributed training | DeepSpeed ZeRO-2 |
|
||||||
|
| Numerical precision | bfloat16 (training and inference) |
|
||||||
|
| Cloud provider | Vast.ai |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended usage
|
||||||
|
|
||||||
|
### Prompt template (Alpaca-style)
|
||||||
|
|
||||||
|
```
|
||||||
|
### System:
|
||||||
|
You are FF-LLM, a helpful assistant.
|
||||||
|
|
||||||
|
### Instruction:
|
||||||
|
<your question>
|
||||||
|
|
||||||
|
### Response:
|
||||||
|
```
|
||||||
|
|
||||||
|
### Decoding settings
|
||||||
|
|
||||||
|
- **Always use greedy decoding** (`do_sample=False`).
|
||||||
|
- Sampling has been shown to degrade factual accuracy by ~5pp on this model family.
|
||||||
|
|
||||||
|
### Quick start (transformers)
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||||
|
import torch
|
||||||
|
|
||||||
|
model_id = "francescofiamingo1/FF_3.13"
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda")
|
||||||
|
|
||||||
|
prompt = """### System:
|
||||||
|
You are FF-LLM, a helpful assistant.
|
||||||
|
|
||||||
|
### Instruction:
|
||||||
|
What is the capital of France?
|
||||||
|
|
||||||
|
### Response:
|
||||||
|
"""
|
||||||
|
|
||||||
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
|
||||||
|
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
|
||||||
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Limitations and known weaknesses
|
||||||
|
|
||||||
|
- **2B parameters** — knowledge ceiling lower than 7B+ models
|
||||||
|
- **Format compliance moderate**: 60% on strict format-discipline bench (yes/no, exact-N, single-letter)
|
||||||
|
- **Entity disambiguation weakness**: occasional "anchor entity" over-attribution (e.g., default to Edison for inventor questions)
|
||||||
|
- **Weak domains** (per qualitative analysis): mathematics, literature, music, art
|
||||||
|
- **Strong domains**: biology, geography, basic science, factual short-form QA
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Variant lineage
|
||||||
|
|
||||||
|
| Variant | Status | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| FF_3 | base | initial release |
|
||||||
|
| FF_3.1 | published | post-SFT, MMLU 26.72% |
|
||||||
|
| FF_3.2 | **discontinued** | early experiment, not maintained |
|
||||||
|
| FF_3.11 | published | specialized variant, MMLU 25.20%, 106-bench 71% |
|
||||||
|
| **FF_3.13** | **current champion** | knowledge repair on FF_3.11 base, MMLU 28.05% |
|
||||||
|
| FF_3.14 | **rejected** | full SFT with humanities focus, MMLU flat (no improvement) |
|
||||||
|
| SLERP t=0.10 (FF_3.13 + FF_3.11) | candidate backup | MMLU 29.10% (+0.41pp), 106-bench tie |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Reproducibility
|
||||||
|
|
||||||
|
All training data shards, scripts, and intermediate checkpoints are tracked in cloud storage:
|
||||||
|
|
||||||
|
- Datasets: `s3://ff-llm-datasets/`
|
||||||
|
- Champion model: `s3://ff-llm-datasets/champions/latest/`
|
||||||
|
- Build scripts: `s3://ff-llm-datasets/ff314/build/` (includes Block E/F builders, anti-anchoring tables, philosophy seeds)
|
||||||
|
|
||||||
|
For reproduction support, contact the author.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
|
||||||
|
```bibtex
|
||||||
|
@misc{ff_3_13_2026,
|
||||||
|
author = {francescofiamingo1},
|
||||||
|
title = {FF_3.13: a 2B GPT-2 model with knowledge repair fine-tuning},
|
||||||
|
year = {2026},
|
||||||
|
publisher = {Hugging Face},
|
||||||
|
howpublished = {\url{https://huggingface.co/francescofiamingo1/FF_3.13}}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Last updated: 2026-04-18*
|
||||||
86
RELEASE_NOTES.txt
Normal file
86
RELEASE_NOTES.txt
Normal file
@@ -0,0 +1,86 @@
|
|||||||
|
FF_3.13 — Release Notes
|
||||||
|
=========================
|
||||||
|
|
||||||
|
Model: FF_3.13
|
||||||
|
Source: ckpt-150 (from 3-epoch repair training, Vast instance 103.177.249.208:33448)
|
||||||
|
Source path: /workspace/ff311_repair_output/checkpoint-150
|
||||||
|
Status: CHAMPION (supersedes FF_3.11 as primary FF-LLM release)
|
||||||
|
Date: 2026-04-17
|
||||||
|
|
||||||
|
Architecture
|
||||||
|
------------
|
||||||
|
GPT-2 decoder-only, 2.02B parameters
|
||||||
|
n_layer=38, d_model=2048, n_heads=16, n_inner=8192, context=2048
|
||||||
|
Vocabulary: GPT-2 BPE, 50257 tokens
|
||||||
|
Precision: bf16
|
||||||
|
|
||||||
|
Training summary
|
||||||
|
----------------
|
||||||
|
Base: FF_3.11 / mix07v4_0.2 (SLERP merge of FF_3.1 + surgical FT, t=0.20)
|
||||||
|
Hardware: 8x RTX 5090 (DeepSpeed ZeRO-2)
|
||||||
|
Dataset: 15,205 train + 801 val examples (A=10714 MCQ / B=929 factual / D=3562 numeric)
|
||||||
|
Hyperparams: lr=2.5e-6 (cosine), warmup=0.05, bf16, gradient_checkpointing
|
||||||
|
Epochs: 3 (early-stopped at step 200/357 via no-improvement rule)
|
||||||
|
Selected step: 150
|
||||||
|
|
||||||
|
Main benchmark notes
|
||||||
|
--------------------
|
||||||
|
MMLU full (lm-eval-harness v0.4.11): 28.05%
|
||||||
|
- vs FF_3.11 baseline (25.20%): +2.85pp
|
||||||
|
- vs FF_3.1 baseline (26.72%): +1.33pp
|
||||||
|
- social sciences: 30.74% / stem: 29.34% / other: 26.84% / humanities: 24.48%
|
||||||
|
|
||||||
|
106-bench total (greedy, rep_penalty=1.0): 74.5%
|
||||||
|
- arith: 100.0% (vs FF_3.11 80.0%)
|
||||||
|
- science: 84.0% (vs FF_3.11 80.0%)
|
||||||
|
- geo: 64.0% (vs FF_3.11 56.0%)
|
||||||
|
- person: 88.0% (tied)
|
||||||
|
- format: 56.0% (vs FF_3.11 72.0% — known regression)
|
||||||
|
|
||||||
|
Strengths
|
||||||
|
---------
|
||||||
|
- arithmetic / science / geo factual recall
|
||||||
|
- damaged MMLU domains (prof_medicine 43.38%, hs_statistics 38.43%, security 39.59%,
|
||||||
|
hs_macroeconomics 34.62%, hs_government 32.64%, medical_genetics 26.00%)
|
||||||
|
|
||||||
|
Known gaps
|
||||||
|
----------
|
||||||
|
- Strict format compliance (-16pp vs FF_3.11 on yes/no and exact-count prompts)
|
||||||
|
- Humanities / art / entity disambiguation (e.g., Edison over-anchoring)
|
||||||
|
- Next repair round should add entity-disambiguation, humanities, arts, and invention-history examples
|
||||||
|
|
||||||
|
Rejected alternative
|
||||||
|
--------------------
|
||||||
|
ckpt-200 (MMLU 28.17%, +0.12pp over ckpt-150) was rejected:
|
||||||
|
- gain below 0.15pp stability threshold
|
||||||
|
- degraded 5 of 6 weak domains
|
||||||
|
|
||||||
|
Prompt template (required)
|
||||||
|
--------------------------
|
||||||
|
### System:
|
||||||
|
You are FF-LLM, a helpful assistant.
|
||||||
|
|
||||||
|
### Instruction:
|
||||||
|
{question}
|
||||||
|
|
||||||
|
### Response:
|
||||||
|
|
||||||
|
Decoding recommendations
|
||||||
|
------------------------
|
||||||
|
Use greedy (do_sample=False, num_beams=1, top_p=1.0, top_k=0, repetition_penalty=1.0).
|
||||||
|
Sampling at temperature 0.7 underperforms greedy on factual tests (~29% vs ~34%).
|
||||||
|
|
||||||
|
Storage
|
||||||
|
-------
|
||||||
|
S3 primary: s3://ff-llm-datasets/ff313/final/
|
||||||
|
S3 alias: s3://ff-llm-datasets/champions/latest/
|
||||||
|
HuggingFace: francescofiamingo1/FF_3.13
|
||||||
|
Local master: C:\Users\f_fia\FF_3.13_master\ (full, incl. training artifacts)
|
||||||
|
Local infer: C:\Users\f_fia\FF_3.13_inference\ (inference-only subset)
|
||||||
|
|
||||||
|
Excluded from master
|
||||||
|
--------------------
|
||||||
|
/workspace/ff311_repair_output/checkpoint-150/global_step150/ (27 GB, DeepSpeed ZeRO-2
|
||||||
|
optimizer shards). Not preserved — useful only for resuming DeepSpeed training from this
|
||||||
|
exact step; weights are intact in model.safetensors. Conservative exclusion to avoid
|
||||||
|
disproportionate storage cost.
|
||||||
35
config.json
Normal file
35
config.json
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
{
|
||||||
|
"activation_function": "gelu_new",
|
||||||
|
"add_cross_attention": false,
|
||||||
|
"architectures": [
|
||||||
|
"GPT2LMHeadModel"
|
||||||
|
],
|
||||||
|
"attn_pdrop": 0.1,
|
||||||
|
"bos_token_id": 50256,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"embd_pdrop": 0.1,
|
||||||
|
"eos_token_id": 50256,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"layer_norm_epsilon": 1e-05,
|
||||||
|
"model_type": "gpt2",
|
||||||
|
"n_ctx": 2048,
|
||||||
|
"n_embd": 2048,
|
||||||
|
"n_head": 16,
|
||||||
|
"n_inner": 8192,
|
||||||
|
"n_layer": 38,
|
||||||
|
"n_positions": 2048,
|
||||||
|
"pad_token_id": 50256,
|
||||||
|
"reorder_and_upcast_attn": false,
|
||||||
|
"resid_pdrop": 0.1,
|
||||||
|
"scale_attn_by_inverse_layer_idx": false,
|
||||||
|
"scale_attn_weights": true,
|
||||||
|
"summary_activation": null,
|
||||||
|
"summary_first_dropout": 0.1,
|
||||||
|
"summary_proj_to_labels": true,
|
||||||
|
"summary_type": "cls_index",
|
||||||
|
"summary_use_proj": true,
|
||||||
|
"tie_word_embeddings": true,
|
||||||
|
"transformers_version": "5.5.4",
|
||||||
|
"use_cache": false,
|
||||||
|
"vocab_size": 50257
|
||||||
|
}
|
||||||
9
generation_config.json
Normal file
9
generation_config.json
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
{
|
||||||
|
"_from_model_config": true,
|
||||||
|
"bos_token_id": 50256,
|
||||||
|
"eos_token_id": [
|
||||||
|
50256
|
||||||
|
],
|
||||||
|
"pad_token_id": 50256,
|
||||||
|
"transformers_version": "5.5.4"
|
||||||
|
}
|
||||||
50001
merges.txt
Normal file
50001
merges.txt
Normal file
File diff suppressed because it is too large
Load Diff
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:ec7e2c09a7fcb7d400d84d126b3a95a3eb6c40e229be0cece611ba267944a9aa
|
||||||
|
size 4247379664
|
||||||
30
special_tokens_map.json
Normal file
30
special_tokens_map.json
Normal file
@@ -0,0 +1,30 @@
|
|||||||
|
{
|
||||||
|
"bos_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"eos_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"pad_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"unk_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
}
|
||||||
250326
tokenizer.json
Normal file
250326
tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
13
tokenizer_config.json
Normal file
13
tokenizer_config.json
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
{
|
||||||
|
"add_prefix_space": false,
|
||||||
|
"backend": "tokenizers",
|
||||||
|
"bos_token": "<|endoftext|>",
|
||||||
|
"clean_up_tokenization_spaces": false,
|
||||||
|
"eos_token": "<|endoftext|>",
|
||||||
|
"errors": "replace",
|
||||||
|
"is_local": true,
|
||||||
|
"model_max_length": 1024,
|
||||||
|
"pad_token": "<|endoftext|>",
|
||||||
|
"tokenizer_class": "GPT2Tokenizer",
|
||||||
|
"unk_token": "<|endoftext|>"
|
||||||
|
}
|
||||||
50259
vocab.json
Normal file
50259
vocab.json
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user