初始化项目,由ModelHub XC社区提供模型
Model: tensorhydra/gpt-oss-20b-numinamath Source: Original Platform
This commit is contained in:
272
README.md
Normal file
272
README.md
Normal file
@@ -0,0 +1,272 @@
|
||||
---
|
||||
license: mit
|
||||
base_model: gpt-oss-20b
|
||||
tags:
|
||||
- math
|
||||
- reasoning
|
||||
- lora
|
||||
- mixture-of-experts
|
||||
- transformers
|
||||
- peft
|
||||
datasets:
|
||||
- AI-MO/NuminaMath-TIR
|
||||
---
|
||||
|
||||
# GPT-OSS-20B NuminaMath
|
||||
|
||||
## Overview
|
||||
|
||||
This repository provides GPT-OSS-20B model fine-tuned on the NuminaMath-TIR dataset which consists of 70k data points to improve mathematical olympiad reasoning and structured problem solving.
|
||||
|
||||
The adapters are designed to be used with the base model gpt-oss-20b, a Mixture-of-Experts (MoE) transformer architecture. Fine-tuning focuses on improving the model’s ability to generate step-by-step reasoning, symbolic manipulation, and detailed mathematical explanations when solving math problems.
|
||||
|
||||
Instead of updating the full model weights, parameter-efficient fine-tuning (PEFT) was used to modify only a small number of parameters in the attention layers. This allows the adapters to significantly improve reasoning ability while keeping training compute requirements relatively low.
|
||||
|
||||
The resulting LoRA adapters can be loaded on top of the base model to enhance its performance on mathematical olympiad reasoning tasks such as algebra, arithmetic, and problem-solving explanations.
|
||||
|
||||
---
|
||||
|
||||
# Model Details
|
||||
|
||||
| Field | Value |
|
||||
| ------------------ | ------------------------------ |
|
||||
| Base Model | gpt-oss-20b |
|
||||
| Architecture | Mixture-of-Experts Transformer |
|
||||
| Fine-Tuning Method | LoRA (PEFT) |
|
||||
| Precision | BF16 |
|
||||
| Context Length | 8192 tokens |
|
||||
| Training Hardware | NVIDIA H100 |
|
||||
| Framework | PyTorch + Transformers + PEFT |
|
||||
|
||||
---
|
||||
|
||||
# Training Data
|
||||
|
||||
## Dataset
|
||||
|
||||
The model was fine-tuned using the **NuminaMath-TIR dataset**, which contains mathematical problems paired with structured reasoning traces and final answers.
|
||||
|
||||
Dataset link:
|
||||
https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
|
||||
|
||||
The dataset includes problems across several mathematical domains including:
|
||||
|
||||
* arithmetic
|
||||
* algebra
|
||||
* number theory
|
||||
* geometry
|
||||
* calculus
|
||||
* reasoning-based problem solving
|
||||
|
||||
The dataset emphasizes **step-by-step explanations**, allowing the model to learn how to produce reasoning chains rather than only final answers.
|
||||
|
||||
---
|
||||
|
||||
## Dataset Processing
|
||||
|
||||
The dataset was originally provided as a **CSV file** and processed prior to training.
|
||||
|
||||
Processing pipeline:
|
||||
|
||||
1. Loaded using **pandas**
|
||||
2. Columns normalized to:
|
||||
* `prompt`
|
||||
* `response`
|
||||
3. Empty rows removed
|
||||
4. Converted to **Hugging Face Dataset format**
|
||||
5. Randomized train/validation split
|
||||
|
||||
Dataset split:
|
||||
|
||||
| Split | Percentage |
|
||||
| ---------- | ---------- |
|
||||
| Train | 95% |
|
||||
| Validation | 5% |
|
||||
|
||||
---
|
||||
|
||||
# Instruction Format
|
||||
|
||||
Training samples were converted into the following **chat-style instruction format** compatible with the GPT-OSS tokenizer.
|
||||
|
||||
```
|
||||
<|im_start|>user
|
||||
{prompt}
|
||||
<|im_end|>
|
||||
<|im_start|>assistant
|
||||
{response}
|
||||
<|im_end|>
|
||||
```
|
||||
|
||||
This format enables the model to learn **structured conversational reasoning** and aligns with the instruction format used in many modern LLMs.
|
||||
|
||||
---
|
||||
|
||||
# Training Procedure
|
||||
|
||||
The model was fine-tuned using **LoRA adapters applied only to attention layers**.
|
||||
|
||||
Because **gpt-oss-20b is a Mixture-of-Experts (MoE) architecture**, LoRA was intentionally **not applied to expert layers** in order to preserve the routing structure and maintain training stability.
|
||||
|
||||
---
|
||||
|
||||
## LoRA Target Modules
|
||||
|
||||
Adapters were applied to the following projection layers:
|
||||
|
||||
```
|
||||
q_proj
|
||||
k_proj
|
||||
v_proj
|
||||
o_proj
|
||||
```
|
||||
|
||||
These correspond to the **query, key, value, and output projections** within the attention mechanism.
|
||||
|
||||
---
|
||||
|
||||
## LoRA Configuration
|
||||
|
||||
| Parameter | Value |
|
||||
| --------- | ----- |
|
||||
| Rank (r) | 64 |
|
||||
| Alpha | 128 |
|
||||
| Dropout | 0.05 |
|
||||
| Bias | none |
|
||||
|
||||
Only attention projections were modified, ensuring minimal disruption to the base model while still enabling meaningful behavioral improvements.
|
||||
|
||||
---
|
||||
|
||||
# Training Hyperparameters
|
||||
|
||||
| Parameter | Value |
|
||||
| ----------------- | ------------- |
|
||||
| Epochs | 2 |
|
||||
| Learning Rate | 2e-4 |
|
||||
| Optimizer | AdamW (fused) |
|
||||
| Adam β1 | 0.9 |
|
||||
| Adam β2 | 0.95 |
|
||||
| Weight Decay | 0.01 |
|
||||
| Warmup Ratio | 0.03 |
|
||||
| Max Gradient Norm | 1.0 |
|
||||
|
||||
Batch configuration:
|
||||
|
||||
| Parameter | Value |
|
||||
| --------------------- | ----- |
|
||||
| Per Device Batch Size | 4 |
|
||||
| Gradient Accumulation | 4 |
|
||||
| Effective Batch Size | 16 |
|
||||
|
||||
Maximum sequence length:
|
||||
|
||||
```
|
||||
8192 tokens
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Training Infrastructure
|
||||
|
||||
Training was performed on the following hardware:
|
||||
|
||||
```
|
||||
1× NVIDIA H100 GPU
|
||||
```
|
||||
|
||||
Training optimizations included:
|
||||
|
||||
* Flash Attention 2
|
||||
* BF16 mixed precision
|
||||
* TF32 enabled
|
||||
* Gradient checkpointing
|
||||
* memory-optimized LoRA configuration
|
||||
|
||||
MoE compatibility adjustments included:
|
||||
|
||||
* LoRA applied **only to attention layers**
|
||||
* CPU offloading disabled
|
||||
* gradient checkpointing configured with `use_reentrant=False`
|
||||
|
||||
Training frameworks used:
|
||||
|
||||
* PyTorch
|
||||
* Hugging Face Transformers
|
||||
* PEFT
|
||||
* Hugging Face Datasets
|
||||
|
||||
---
|
||||
|
||||
# Evaluation
|
||||
|
||||
Validation was performed periodically during training using **validation loss**.
|
||||
|
||||
Metrics monitored:
|
||||
|
||||
* training loss
|
||||
* validation loss
|
||||
|
||||
The model was trained for exactly 2 epochs on the entire dataset without automated checkpoint selection. The final validation loss is 0.4039 for 2 full epochs.
|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
# Intended Use
|
||||
|
||||
This model is intended for:
|
||||
|
||||
* mathematical reasoning research
|
||||
* educational demonstrations
|
||||
* experimentation with reasoning-focused fine-tuning
|
||||
* evaluation of math-capable language models
|
||||
|
||||
It is **not intended for high-stakes mathematical or scientific applications**.
|
||||
|
||||
---
|
||||
|
||||
# Limitations
|
||||
|
||||
Despite improvements from fine-tuning, the model still has several limitations:
|
||||
|
||||
* The model may generate **incorrect reasoning steps**.
|
||||
* Mathematical derivations may lack **formal rigor**.
|
||||
* Some areas of mathematics may be **underrepresented in the dataset**.
|
||||
* Performance depends strongly on the **capabilities of the base model**.
|
||||
|
||||
Users should treat model outputs as **assistive suggestions rather than authoritative answers**.
|
||||
|
||||
---
|
||||
|
||||
# Ethical Considerations
|
||||
|
||||
Language models trained for reasoning may produce **confident but incorrect explanations**.
|
||||
|
||||
For educational or academic use:
|
||||
|
||||
* outputs should be verified independently
|
||||
* the model should not be treated as an authoritative mathematical source
|
||||
|
||||
---
|
||||
|
||||
# Acknowledgements
|
||||
|
||||
This work builds upon the open-source ecosystem including:
|
||||
|
||||
* Hugging Face Transformers
|
||||
* the PEFT library for parameter-efficient fine-tuning
|
||||
* the NuminaMath dataset
|
||||
* research on Mixture-of-Experts transformer architectures
|
||||
|
||||
---
|
||||
|
||||
# Citation
|
||||
|
||||
Dataset:
|
||||
|
||||
https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
|
||||
|
||||
Training Notebook:
|
||||
|
||||
https://www.kaggle.com/code/tensorhydra/gpt-oss-20b-finetune-numinamath
|
||||
Reference in New Issue
Block a user