46 lines
1.4 KiB
Markdown
46 lines
1.4 KiB
Markdown
---
|
|
library_name: transformers
|
|
tags:
|
|
- convergent-evolution
|
|
- fourier-features
|
|
- number-embeddings
|
|
license: mit
|
|
datasets:
|
|
- HuggingFaceFW/fineweb-edu
|
|
---
|
|
|
|
# convergent-llama-300M-adamw-isolate
|
|
|
|
A 300M-parameter language model trained from scratch on **FineWeb-Edu 10BT** (~9.4B tokens, 1 epoch) as part of the *Convergent Evolution* project, which investigates how Fourier features emerge in LLM number embeddings.
|
|
|
|
## Model details
|
|
|
|
| | |
|
|
|---|---|
|
|
| **Architecture** | LLaMA-style Transformer (12 layers, 1024 hidden, 16 heads, GQA) |
|
|
| **Parameters** | ~300M |
|
|
| **Optimizer** | AdamW |
|
|
| **Data perturbation** | block-diagonal attention mask (numbers cannot attend to context) |
|
|
| **Training data** | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) sample-10BT (~9.4B tokens) |
|
|
| **Context length** | 1024 |
|
|
| **Tokenizer** | Llama 3 (128K vocab) |
|
|
| **Batch size** | 512 sequences |
|
|
|
|
## Training dynamics
|
|
|
|
Intermediate checkpoints are saved as branches: `tokens-200M`, `tokens-400M`, ..., `tokens-9.6B`.
|
|
|
|
```python
|
|
from transformers import AutoModelForCausalLM
|
|
|
|
# Load final checkpoint
|
|
model = AutoModelForCausalLM.from_pretrained("deqing/convergent-llama-300M-adamw-isolate")
|
|
|
|
# Load intermediate checkpoint (e.g., at 1B tokens)
|
|
model = AutoModelForCausalLM.from_pretrained("deqing/convergent-llama-300M-adamw-isolate", revision="tokens-1B")
|
|
```
|
|
|
|
## Citation
|
|
|
|
Paper forthcoming.
|