273 lines
6.8 KiB
Markdown
273 lines
6.8 KiB
Markdown
---
|
||
license: mit
|
||
base_model: gpt-oss-20b
|
||
tags:
|
||
- math
|
||
- reasoning
|
||
- lora
|
||
- mixture-of-experts
|
||
- transformers
|
||
- peft
|
||
datasets:
|
||
- AI-MO/NuminaMath-TIR
|
||
---
|
||
|
||
# GPT-OSS-20B NuminaMath
|
||
|
||
## Overview
|
||
|
||
This repository provides GPT-OSS-20B model fine-tuned on the NuminaMath-TIR dataset which consists of 70k data points to improve mathematical olympiad reasoning and structured problem solving.
|
||
|
||
The adapters are designed to be used with the base model gpt-oss-20b, a Mixture-of-Experts (MoE) transformer architecture. Fine-tuning focuses on improving the model’s ability to generate step-by-step reasoning, symbolic manipulation, and detailed mathematical explanations when solving math problems.
|
||
|
||
Instead of updating the full model weights, parameter-efficient fine-tuning (PEFT) was used to modify only a small number of parameters in the attention layers. This allows the adapters to significantly improve reasoning ability while keeping training compute requirements relatively low.
|
||
|
||
The resulting LoRA adapters can be loaded on top of the base model to enhance its performance on mathematical olympiad reasoning tasks such as algebra, arithmetic, and problem-solving explanations.
|
||
|
||
---
|
||
|
||
# Model Details
|
||
|
||
| Field | Value |
|
||
| ------------------ | ------------------------------ |
|
||
| Base Model | gpt-oss-20b |
|
||
| Architecture | Mixture-of-Experts Transformer |
|
||
| Fine-Tuning Method | LoRA (PEFT) |
|
||
| Precision | BF16 |
|
||
| Context Length | 8192 tokens |
|
||
| Training Hardware | NVIDIA H100 |
|
||
| Framework | PyTorch + Transformers + PEFT |
|
||
|
||
---
|
||
|
||
# Training Data
|
||
|
||
## Dataset
|
||
|
||
The model was fine-tuned using the **NuminaMath-TIR dataset**, which contains mathematical problems paired with structured reasoning traces and final answers.
|
||
|
||
Dataset link:
|
||
https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
|
||
|
||
The dataset includes problems across several mathematical domains including:
|
||
|
||
* arithmetic
|
||
* algebra
|
||
* number theory
|
||
* geometry
|
||
* calculus
|
||
* reasoning-based problem solving
|
||
|
||
The dataset emphasizes **step-by-step explanations**, allowing the model to learn how to produce reasoning chains rather than only final answers.
|
||
|
||
---
|
||
|
||
## Dataset Processing
|
||
|
||
The dataset was originally provided as a **CSV file** and processed prior to training.
|
||
|
||
Processing pipeline:
|
||
|
||
1. Loaded using **pandas**
|
||
2. Columns normalized to:
|
||
* `prompt`
|
||
* `response`
|
||
3. Empty rows removed
|
||
4. Converted to **Hugging Face Dataset format**
|
||
5. Randomized train/validation split
|
||
|
||
Dataset split:
|
||
|
||
| Split | Percentage |
|
||
| ---------- | ---------- |
|
||
| Train | 95% |
|
||
| Validation | 5% |
|
||
|
||
---
|
||
|
||
# Instruction Format
|
||
|
||
Training samples were converted into the following **chat-style instruction format** compatible with the GPT-OSS tokenizer.
|
||
|
||
```
|
||
<|im_start|>user
|
||
{prompt}
|
||
<|im_end|>
|
||
<|im_start|>assistant
|
||
{response}
|
||
<|im_end|>
|
||
```
|
||
|
||
This format enables the model to learn **structured conversational reasoning** and aligns with the instruction format used in many modern LLMs.
|
||
|
||
---
|
||
|
||
# Training Procedure
|
||
|
||
The model was fine-tuned using **LoRA adapters applied only to attention layers**.
|
||
|
||
Because **gpt-oss-20b is a Mixture-of-Experts (MoE) architecture**, LoRA was intentionally **not applied to expert layers** in order to preserve the routing structure and maintain training stability.
|
||
|
||
---
|
||
|
||
## LoRA Target Modules
|
||
|
||
Adapters were applied to the following projection layers:
|
||
|
||
```
|
||
q_proj
|
||
k_proj
|
||
v_proj
|
||
o_proj
|
||
```
|
||
|
||
These correspond to the **query, key, value, and output projections** within the attention mechanism.
|
||
|
||
---
|
||
|
||
## LoRA Configuration
|
||
|
||
| Parameter | Value |
|
||
| --------- | ----- |
|
||
| Rank (r) | 64 |
|
||
| Alpha | 128 |
|
||
| Dropout | 0.05 |
|
||
| Bias | none |
|
||
|
||
Only attention projections were modified, ensuring minimal disruption to the base model while still enabling meaningful behavioral improvements.
|
||
|
||
---
|
||
|
||
# Training Hyperparameters
|
||
|
||
| Parameter | Value |
|
||
| ----------------- | ------------- |
|
||
| Epochs | 2 |
|
||
| Learning Rate | 2e-4 |
|
||
| Optimizer | AdamW (fused) |
|
||
| Adam β1 | 0.9 |
|
||
| Adam β2 | 0.95 |
|
||
| Weight Decay | 0.01 |
|
||
| Warmup Ratio | 0.03 |
|
||
| Max Gradient Norm | 1.0 |
|
||
|
||
Batch configuration:
|
||
|
||
| Parameter | Value |
|
||
| --------------------- | ----- |
|
||
| Per Device Batch Size | 4 |
|
||
| Gradient Accumulation | 4 |
|
||
| Effective Batch Size | 16 |
|
||
|
||
Maximum sequence length:
|
||
|
||
```
|
||
8192 tokens
|
||
```
|
||
|
||
---
|
||
|
||
# Training Infrastructure
|
||
|
||
Training was performed on the following hardware:
|
||
|
||
```
|
||
1× NVIDIA H100 GPU
|
||
```
|
||
|
||
Training optimizations included:
|
||
|
||
* Flash Attention 2
|
||
* BF16 mixed precision
|
||
* TF32 enabled
|
||
* Gradient checkpointing
|
||
* memory-optimized LoRA configuration
|
||
|
||
MoE compatibility adjustments included:
|
||
|
||
* LoRA applied **only to attention layers**
|
||
* CPU offloading disabled
|
||
* gradient checkpointing configured with `use_reentrant=False`
|
||
|
||
Training frameworks used:
|
||
|
||
* PyTorch
|
||
* Hugging Face Transformers
|
||
* PEFT
|
||
* Hugging Face Datasets
|
||
|
||
---
|
||
|
||
# Evaluation
|
||
|
||
Validation was performed periodically during training using **validation loss**.
|
||
|
||
Metrics monitored:
|
||
|
||
* training loss
|
||
* validation loss
|
||
|
||
The model was trained for exactly 2 epochs on the entire dataset without automated checkpoint selection. The final validation loss is 0.4039 for 2 full epochs.
|
||
|
||

|
||
|
||
---
|
||
|
||
# Intended Use
|
||
|
||
This model is intended for:
|
||
|
||
* mathematical reasoning research
|
||
* educational demonstrations
|
||
* experimentation with reasoning-focused fine-tuning
|
||
* evaluation of math-capable language models
|
||
|
||
It is **not intended for high-stakes mathematical or scientific applications**.
|
||
|
||
---
|
||
|
||
# Limitations
|
||
|
||
Despite improvements from fine-tuning, the model still has several limitations:
|
||
|
||
* The model may generate **incorrect reasoning steps**.
|
||
* Mathematical derivations may lack **formal rigor**.
|
||
* Some areas of mathematics may be **underrepresented in the dataset**.
|
||
* Performance depends strongly on the **capabilities of the base model**.
|
||
|
||
Users should treat model outputs as **assistive suggestions rather than authoritative answers**.
|
||
|
||
---
|
||
|
||
# Ethical Considerations
|
||
|
||
Language models trained for reasoning may produce **confident but incorrect explanations**.
|
||
|
||
For educational or academic use:
|
||
|
||
* outputs should be verified independently
|
||
* the model should not be treated as an authoritative mathematical source
|
||
|
||
---
|
||
|
||
# Acknowledgements
|
||
|
||
This work builds upon the open-source ecosystem including:
|
||
|
||
* Hugging Face Transformers
|
||
* the PEFT library for parameter-efficient fine-tuning
|
||
* the NuminaMath dataset
|
||
* research on Mixture-of-Experts transformer architectures
|
||
|
||
---
|
||
|
||
# Citation
|
||
|
||
Dataset:
|
||
|
||
https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
|
||
|
||
Training Notebook:
|
||
|
||
https://www.kaggle.com/code/tensorhydra/gpt-oss-20b-finetune-numinamath
|