gpt-oss-20b-numinamath/README.md

---
license: mit
base_model: gpt-oss-20b
tags:
- math
- reasoning
- lora
- mixture-of-experts
- transformers
- peft
datasets:
- AI-MO/NuminaMath-TIR
---

# GPT-OSS-20B NuminaMath

## Overview

This repository provides GPT-OSS-20B model fine-tuned on the NuminaMath-TIR dataset which consists of 70k data points to improve mathematical olympiad reasoning and structured problem solving.

The adapters are designed to be used with the base model gpt-oss-20b, a Mixture-of-Experts (MoE) transformer architecture. Fine-tuning focuses on improving the model’s ability to generate step-by-step reasoning, symbolic manipulation, and detailed mathematical explanations when solving math problems.

Instead of updating the full model weights, parameter-efficient fine-tuning (PEFT) was used to modify only a small number of parameters in the attention layers. This allows the adapters to significantly improve reasoning ability while keeping training compute requirements relatively low.

The resulting LoRA adapters can be loaded on top of the base model to enhance its performance on mathematical olympiad reasoning tasks such as algebra, arithmetic, and problem-solving explanations.

---

# Model Details

| Field              | Value                          |
| ------------------ | ------------------------------ |
| Base Model         | gpt-oss-20b                    |
| Architecture       | Mixture-of-Experts Transformer |
| Fine-Tuning Method | LoRA (PEFT)                    |
| Precision          | BF16                           |
| Context Length     | 8192 tokens                    |
| Training Hardware  | NVIDIA H100                    |
| Framework          | PyTorch + Transformers + PEFT  |

---

# Training Data

## Dataset

The model was fine-tuned using the **NuminaMath-TIR dataset**, which contains mathematical problems paired with structured reasoning traces and final answers.

Dataset link:
https://huggingface.co/datasets/AI-MO/NuminaMath-TIR

The dataset includes problems across several mathematical domains including:

* arithmetic
* algebra
* number theory
* geometry
* calculus
* reasoning-based problem solving

The dataset emphasizes **step-by-step explanations**, allowing the model to learn how to produce reasoning chains rather than only final answers.

---

## Dataset Processing

The dataset was originally provided as a **CSV file** and processed prior to training.

Processing pipeline:

1. Loaded using **pandas**
2. Columns normalized to:
   * `prompt`
   * `response`
3. Empty rows removed
4. Converted to **Hugging Face Dataset format**
5. Randomized train/validation split

Dataset split:

| Split      | Percentage |
| ---------- | ---------- |
| Train      | 95%        |
| Validation | 5%         |

---

# Instruction Format

Training samples were converted into the following **chat-style instruction format** compatible with the GPT-OSS tokenizer.

```
<|im_start|>user
{prompt}
<|im_end|>
<|im_start|>assistant
{response}
<|im_end|>
```

This format enables the model to learn **structured conversational reasoning** and aligns with the instruction format used in many modern LLMs.

---

# Training Procedure

The model was fine-tuned using **LoRA adapters applied only to attention layers**.

Because **gpt-oss-20b is a Mixture-of-Experts (MoE) architecture**, LoRA was intentionally **not applied to expert layers** in order to preserve the routing structure and maintain training stability.

---

## LoRA Target Modules

Adapters were applied to the following projection layers:

```
q_proj
k_proj
v_proj
o_proj
```

These correspond to the **query, key, value, and output projections** within the attention mechanism.

---

## LoRA Configuration

| Parameter | Value |
| --------- | ----- |
| Rank (r)  | 64    |
| Alpha     | 128   |
| Dropout   | 0.05  |
| Bias      | none  |

Only attention projections were modified, ensuring minimal disruption to the base model while still enabling meaningful behavioral improvements.

---

# Training Hyperparameters

| Parameter         | Value         |
| ----------------- | ------------- |
| Epochs            | 2             |
| Learning Rate     | 2e-4          |
| Optimizer         | AdamW (fused) |
| Adam β1           | 0.9           |
| Adam β2           | 0.95          |
| Weight Decay      | 0.01          |
| Warmup Ratio      | 0.03          |
| Max Gradient Norm | 1.0           |

Batch configuration:

| Parameter             | Value |
| --------------------- | ----- |
| Per Device Batch Size | 4     |
| Gradient Accumulation | 4     |
| Effective Batch Size  | 16    |

Maximum sequence length:

```
8192 tokens
```

---

# Training Infrastructure

Training was performed on the following hardware:

```
1× NVIDIA H100 GPU
```

Training optimizations included:

* Flash Attention 2
* BF16 mixed precision
* TF32 enabled
* Gradient checkpointing
* memory-optimized LoRA configuration

MoE compatibility adjustments included:

* LoRA applied **only to attention layers**
* CPU offloading disabled
* gradient checkpointing configured with `use_reentrant=False`

Training frameworks used:

* PyTorch
* Hugging Face Transformers
* PEFT
* Hugging Face Datasets

---

# Evaluation

Validation was performed periodically during training using **validation loss**.

Metrics monitored:

* training loss
* validation loss

The model was trained for exactly 2 epochs on the entire dataset without automated checkpoint selection. The final validation loss is 0.4039 for 2 full epochs.

![Training Loss](loss_plot.png)

---

# Intended Use

This model is intended for:

* mathematical reasoning research
* educational demonstrations
* experimentation with reasoning-focused fine-tuning
* evaluation of math-capable language models

It is **not intended for high-stakes mathematical or scientific applications**.

---

# Limitations

Despite improvements from fine-tuning, the model still has several limitations:

* The model may generate **incorrect reasoning steps**.
* Mathematical derivations may lack **formal rigor**.
* Some areas of mathematics may be **underrepresented in the dataset**.
* Performance depends strongly on the **capabilities of the base model**.

Users should treat model outputs as **assistive suggestions rather than authoritative answers**.

---

# Ethical Considerations

Language models trained for reasoning may produce **confident but incorrect explanations**.

For educational or academic use:

* outputs should be verified independently
* the model should not be treated as an authoritative mathematical source

---

# Acknowledgements

This work builds upon the open-source ecosystem including:

* Hugging Face Transformers
* the PEFT library for parameter-efficient fine-tuning
* the NuminaMath dataset
* research on Mixture-of-Experts transformer architectures

---

# Citation

Dataset:

https://huggingface.co/datasets/AI-MO/NuminaMath-TIR

Training Notebook:

https://www.kaggle.com/code/tensorhydra/gpt-oss-20b-finetune-numinamath