273 lines
6.8 KiB
Markdown
273 lines
6.8 KiB
Markdown
|
|
---
|
|||
|
|
license: mit
|
|||
|
|
base_model: gpt-oss-20b
|
|||
|
|
tags:
|
|||
|
|
- math
|
|||
|
|
- reasoning
|
|||
|
|
- lora
|
|||
|
|
- mixture-of-experts
|
|||
|
|
- transformers
|
|||
|
|
- peft
|
|||
|
|
datasets:
|
|||
|
|
- AI-MO/NuminaMath-TIR
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# GPT-OSS-20B NuminaMath
|
|||
|
|
|
|||
|
|
## Overview
|
|||
|
|
|
|||
|
|
This repository provides GPT-OSS-20B model fine-tuned on the NuminaMath-TIR dataset which consists of 70k data points to improve mathematical olympiad reasoning and structured problem solving.
|
|||
|
|
|
|||
|
|
The adapters are designed to be used with the base model gpt-oss-20b, a Mixture-of-Experts (MoE) transformer architecture. Fine-tuning focuses on improving the model’s ability to generate step-by-step reasoning, symbolic manipulation, and detailed mathematical explanations when solving math problems.
|
|||
|
|
|
|||
|
|
Instead of updating the full model weights, parameter-efficient fine-tuning (PEFT) was used to modify only a small number of parameters in the attention layers. This allows the adapters to significantly improve reasoning ability while keeping training compute requirements relatively low.
|
|||
|
|
|
|||
|
|
The resulting LoRA adapters can be loaded on top of the base model to enhance its performance on mathematical olympiad reasoning tasks such as algebra, arithmetic, and problem-solving explanations.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Model Details
|
|||
|
|
|
|||
|
|
| Field | Value |
|
|||
|
|
| ------------------ | ------------------------------ |
|
|||
|
|
| Base Model | gpt-oss-20b |
|
|||
|
|
| Architecture | Mixture-of-Experts Transformer |
|
|||
|
|
| Fine-Tuning Method | LoRA (PEFT) |
|
|||
|
|
| Precision | BF16 |
|
|||
|
|
| Context Length | 8192 tokens |
|
|||
|
|
| Training Hardware | NVIDIA H100 |
|
|||
|
|
| Framework | PyTorch + Transformers + PEFT |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Training Data
|
|||
|
|
|
|||
|
|
## Dataset
|
|||
|
|
|
|||
|
|
The model was fine-tuned using the **NuminaMath-TIR dataset**, which contains mathematical problems paired with structured reasoning traces and final answers.
|
|||
|
|
|
|||
|
|
Dataset link:
|
|||
|
|
https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
|
|||
|
|
|
|||
|
|
The dataset includes problems across several mathematical domains including:
|
|||
|
|
|
|||
|
|
* arithmetic
|
|||
|
|
* algebra
|
|||
|
|
* number theory
|
|||
|
|
* geometry
|
|||
|
|
* calculus
|
|||
|
|
* reasoning-based problem solving
|
|||
|
|
|
|||
|
|
The dataset emphasizes **step-by-step explanations**, allowing the model to learn how to produce reasoning chains rather than only final answers.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Dataset Processing
|
|||
|
|
|
|||
|
|
The dataset was originally provided as a **CSV file** and processed prior to training.
|
|||
|
|
|
|||
|
|
Processing pipeline:
|
|||
|
|
|
|||
|
|
1. Loaded using **pandas**
|
|||
|
|
2. Columns normalized to:
|
|||
|
|
* `prompt`
|
|||
|
|
* `response`
|
|||
|
|
3. Empty rows removed
|
|||
|
|
4. Converted to **Hugging Face Dataset format**
|
|||
|
|
5. Randomized train/validation split
|
|||
|
|
|
|||
|
|
Dataset split:
|
|||
|
|
|
|||
|
|
| Split | Percentage |
|
|||
|
|
| ---------- | ---------- |
|
|||
|
|
| Train | 95% |
|
|||
|
|
| Validation | 5% |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Instruction Format
|
|||
|
|
|
|||
|
|
Training samples were converted into the following **chat-style instruction format** compatible with the GPT-OSS tokenizer.
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
<|im_start|>user
|
|||
|
|
{prompt}
|
|||
|
|
<|im_end|>
|
|||
|
|
<|im_start|>assistant
|
|||
|
|
{response}
|
|||
|
|
<|im_end|>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
This format enables the model to learn **structured conversational reasoning** and aligns with the instruction format used in many modern LLMs.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Training Procedure
|
|||
|
|
|
|||
|
|
The model was fine-tuned using **LoRA adapters applied only to attention layers**.
|
|||
|
|
|
|||
|
|
Because **gpt-oss-20b is a Mixture-of-Experts (MoE) architecture**, LoRA was intentionally **not applied to expert layers** in order to preserve the routing structure and maintain training stability.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## LoRA Target Modules
|
|||
|
|
|
|||
|
|
Adapters were applied to the following projection layers:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
q_proj
|
|||
|
|
k_proj
|
|||
|
|
v_proj
|
|||
|
|
o_proj
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
These correspond to the **query, key, value, and output projections** within the attention mechanism.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## LoRA Configuration
|
|||
|
|
|
|||
|
|
| Parameter | Value |
|
|||
|
|
| --------- | ----- |
|
|||
|
|
| Rank (r) | 64 |
|
|||
|
|
| Alpha | 128 |
|
|||
|
|
| Dropout | 0.05 |
|
|||
|
|
| Bias | none |
|
|||
|
|
|
|||
|
|
Only attention projections were modified, ensuring minimal disruption to the base model while still enabling meaningful behavioral improvements.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Training Hyperparameters
|
|||
|
|
|
|||
|
|
| Parameter | Value |
|
|||
|
|
| ----------------- | ------------- |
|
|||
|
|
| Epochs | 2 |
|
|||
|
|
| Learning Rate | 2e-4 |
|
|||
|
|
| Optimizer | AdamW (fused) |
|
|||
|
|
| Adam β1 | 0.9 |
|
|||
|
|
| Adam β2 | 0.95 |
|
|||
|
|
| Weight Decay | 0.01 |
|
|||
|
|
| Warmup Ratio | 0.03 |
|
|||
|
|
| Max Gradient Norm | 1.0 |
|
|||
|
|
|
|||
|
|
Batch configuration:
|
|||
|
|
|
|||
|
|
| Parameter | Value |
|
|||
|
|
| --------------------- | ----- |
|
|||
|
|
| Per Device Batch Size | 4 |
|
|||
|
|
| Gradient Accumulation | 4 |
|
|||
|
|
| Effective Batch Size | 16 |
|
|||
|
|
|
|||
|
|
Maximum sequence length:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
8192 tokens
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Training Infrastructure
|
|||
|
|
|
|||
|
|
Training was performed on the following hardware:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
1× NVIDIA H100 GPU
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Training optimizations included:
|
|||
|
|
|
|||
|
|
* Flash Attention 2
|
|||
|
|
* BF16 mixed precision
|
|||
|
|
* TF32 enabled
|
|||
|
|
* Gradient checkpointing
|
|||
|
|
* memory-optimized LoRA configuration
|
|||
|
|
|
|||
|
|
MoE compatibility adjustments included:
|
|||
|
|
|
|||
|
|
* LoRA applied **only to attention layers**
|
|||
|
|
* CPU offloading disabled
|
|||
|
|
* gradient checkpointing configured with `use_reentrant=False`
|
|||
|
|
|
|||
|
|
Training frameworks used:
|
|||
|
|
|
|||
|
|
* PyTorch
|
|||
|
|
* Hugging Face Transformers
|
|||
|
|
* PEFT
|
|||
|
|
* Hugging Face Datasets
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Evaluation
|
|||
|
|
|
|||
|
|
Validation was performed periodically during training using **validation loss**.
|
|||
|
|
|
|||
|
|
Metrics monitored:
|
|||
|
|
|
|||
|
|
* training loss
|
|||
|
|
* validation loss
|
|||
|
|
|
|||
|
|
The model was trained for exactly 2 epochs on the entire dataset without automated checkpoint selection. The final validation loss is 0.4039 for 2 full epochs.
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Intended Use
|
|||
|
|
|
|||
|
|
This model is intended for:
|
|||
|
|
|
|||
|
|
* mathematical reasoning research
|
|||
|
|
* educational demonstrations
|
|||
|
|
* experimentation with reasoning-focused fine-tuning
|
|||
|
|
* evaluation of math-capable language models
|
|||
|
|
|
|||
|
|
It is **not intended for high-stakes mathematical or scientific applications**.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Limitations
|
|||
|
|
|
|||
|
|
Despite improvements from fine-tuning, the model still has several limitations:
|
|||
|
|
|
|||
|
|
* The model may generate **incorrect reasoning steps**.
|
|||
|
|
* Mathematical derivations may lack **formal rigor**.
|
|||
|
|
* Some areas of mathematics may be **underrepresented in the dataset**.
|
|||
|
|
* Performance depends strongly on the **capabilities of the base model**.
|
|||
|
|
|
|||
|
|
Users should treat model outputs as **assistive suggestions rather than authoritative answers**.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Ethical Considerations
|
|||
|
|
|
|||
|
|
Language models trained for reasoning may produce **confident but incorrect explanations**.
|
|||
|
|
|
|||
|
|
For educational or academic use:
|
|||
|
|
|
|||
|
|
* outputs should be verified independently
|
|||
|
|
* the model should not be treated as an authoritative mathematical source
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Acknowledgements
|
|||
|
|
|
|||
|
|
This work builds upon the open-source ecosystem including:
|
|||
|
|
|
|||
|
|
* Hugging Face Transformers
|
|||
|
|
* the PEFT library for parameter-efficient fine-tuning
|
|||
|
|
* the NuminaMath dataset
|
|||
|
|
* research on Mixture-of-Experts transformer architectures
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Citation
|
|||
|
|
|
|||
|
|
Dataset:
|
|||
|
|
|
|||
|
|
https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
|
|||
|
|
|
|||
|
|
Training Notebook:
|
|||
|
|
|
|||
|
|
https://www.kaggle.com/code/tensorhydra/gpt-oss-20b-finetune-numinamath
|