初始化项目，由ModelHub XC社区提供模型

Model: tensorhydra/gpt-oss-20b-numinamath Source: Original Platform
2026-04-20 11:13:05 +08:00
commit d0dfd3d9ea
20 changed files with 1637 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,272 @@
+---
+license: mit
+base_model: gpt-oss-20b
+tags:
+- math
+- reasoning
+- lora
+- mixture-of-experts
+- transformers
+- peft
+datasets:
+- AI-MO/NuminaMath-TIR
+---
+
+# GPT-OSS-20B NuminaMath
+
+## Overview
+
+This repository provides GPT-OSS-20B model fine-tuned on the NuminaMath-TIR dataset which consists of 70k data points to improve mathematical olympiad reasoning and structured problem solving.
+
+The adapters are designed to be used with the base model gpt-oss-20b, a Mixture-of-Experts (MoE) transformer architecture. Fine-tuning focuses on improving the model’s ability to generate step-by-step reasoning, symbolic manipulation, and detailed mathematical explanations when solving math problems.
+
+Instead of updating the full model weights, parameter-efficient fine-tuning (PEFT) was used to modify only a small number of parameters in the attention layers. This allows the adapters to significantly improve reasoning ability while keeping training compute requirements relatively low.
+
+The resulting LoRA adapters can be loaded on top of the base model to enhance its performance on mathematical olympiad reasoning tasks such as algebra, arithmetic, and problem-solving explanations.
+
+---
+
+# Model Details
+
+| Field              | Value                          |
+| ------------------ | ------------------------------ |
+| Base Model         | gpt-oss-20b                    |
+| Architecture       | Mixture-of-Experts Transformer |
+| Fine-Tuning Method | LoRA (PEFT)                    |
+| Precision          | BF16                           |
+| Context Length     | 8192 tokens                    |
+| Training Hardware  | NVIDIA H100                    |
+| Framework          | PyTorch + Transformers + PEFT  |
+
+---
+
+# Training Data
+
+## Dataset
+
+The model was fine-tuned using the **NuminaMath-TIR dataset**, which contains mathematical problems paired with structured reasoning traces and final answers.
+
+Dataset link:
+https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
+
+The dataset includes problems across several mathematical domains including:
+
+* arithmetic
+* algebra
+* number theory
+* geometry
+* calculus
+* reasoning-based problem solving
+
+The dataset emphasizes **step-by-step explanations**, allowing the model to learn how to produce reasoning chains rather than only final answers.
+
+---
+
+## Dataset Processing
+
+The dataset was originally provided as a **CSV file** and processed prior to training.
+
+Processing pipeline:
+
+1. Loaded using **pandas**
+2. Columns normalized to:
+   * `prompt`
+   * `response`
+3. Empty rows removed
+4. Converted to **Hugging Face Dataset format**
+5. Randomized train/validation split
+
+Dataset split:
+
+| Split      | Percentage |
+| ---------- | ---------- |
+| Train      | 95%        |
+| Validation | 5%         |
+
+---
+
+# Instruction Format
+
+Training samples were converted into the following **chat-style instruction format** compatible with the GPT-OSS tokenizer.
+
+```
+<|im_start|>user
+{prompt}
+<|im_end|>
+<|im_start|>assistant
+{response}
+<|im_end|>
+```
+
+This format enables the model to learn **structured conversational reasoning** and aligns with the instruction format used in many modern LLMs.
+
+---
+
+# Training Procedure
+
+The model was fine-tuned using **LoRA adapters applied only to attention layers**.
+
+Because **gpt-oss-20b is a Mixture-of-Experts (MoE) architecture**, LoRA was intentionally **not applied to expert layers** in order to preserve the routing structure and maintain training stability.
+
+---
+
+## LoRA Target Modules
+
+Adapters were applied to the following projection layers:
+
+```
+q_proj
+k_proj
+v_proj
+o_proj
+```
+
+These correspond to the **query, key, value, and output projections** within the attention mechanism.
+
+---
+
+## LoRA Configuration
+
+| Parameter | Value |
+| --------- | ----- |
+| Rank (r)  | 64    |
+| Alpha     | 128   |
+| Dropout   | 0.05  |
+| Bias      | none  |
+
+Only attention projections were modified, ensuring minimal disruption to the base model while still enabling meaningful behavioral improvements.
+
+---
+
+# Training Hyperparameters
+
+| Parameter         | Value         |
+| ----------------- | ------------- |
+| Epochs            | 2             |
+| Learning Rate     | 2e-4          |
+| Optimizer         | AdamW (fused) |
+| Adam β1           | 0.9           |
+| Adam β2           | 0.95          |
+| Weight Decay      | 0.01          |
+| Warmup Ratio      | 0.03          |
+| Max Gradient Norm | 1.0           |
+
+Batch configuration:
+
+| Parameter             | Value |
+| --------------------- | ----- |
+| Per Device Batch Size | 4     |
+| Gradient Accumulation | 4     |
+| Effective Batch Size  | 16    |
+
+Maximum sequence length:
+
+```
+8192 tokens
+```
+
+---
+
+# Training Infrastructure
+
+Training was performed on the following hardware:
+
+```
+1× NVIDIA H100 GPU
+```
+
+Training optimizations included:
+
+* Flash Attention 2
+* BF16 mixed precision
+* TF32 enabled
+* Gradient checkpointing
+* memory-optimized LoRA configuration
+
+MoE compatibility adjustments included:
+
+* LoRA applied **only to attention layers**
+* CPU offloading disabled
+* gradient checkpointing configured with `use_reentrant=False`
+
+Training frameworks used:
+
+* PyTorch
+* Hugging Face Transformers
+* PEFT
+* Hugging Face Datasets
+
+---
+
+# Evaluation
+
+Validation was performed periodically during training using **validation loss**.
+
+Metrics monitored:
+
+* training loss
+* validation loss
+
+The model was trained for exactly 2 epochs on the entire dataset without automated checkpoint selection. The final validation loss is 0.4039 for 2 full epochs.
+
+![Training Loss](loss_plot.png)
+
+---
+
+# Intended Use
+
+This model is intended for:
+
+* mathematical reasoning research
+* educational demonstrations
+* experimentation with reasoning-focused fine-tuning
+* evaluation of math-capable language models
+
+It is **not intended for high-stakes mathematical or scientific applications**.
+
+---
+
+# Limitations
+
+Despite improvements from fine-tuning, the model still has several limitations:
+
+* The model may generate **incorrect reasoning steps**.
+* Mathematical derivations may lack **formal rigor**.
+* Some areas of mathematics may be **underrepresented in the dataset**.
+* Performance depends strongly on the **capabilities of the base model**.
+
+Users should treat model outputs as **assistive suggestions rather than authoritative answers**.
+
+---
+
+# Ethical Considerations
+
+Language models trained for reasoning may produce **confident but incorrect explanations**.
+
+For educational or academic use:
+
+* outputs should be verified independently
+* the model should not be treated as an authoritative mathematical source
+
+---
+
+# Acknowledgements
+
+This work builds upon the open-source ecosystem including:
+
+* Hugging Face Transformers
+* the PEFT library for parameter-efficient fine-tuning
+* the NuminaMath dataset
+* research on Mixture-of-Experts transformer architectures
+
+---
+
+# Citation
+
+Dataset:
+
+https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
+
+Training Notebook:
+
+https://www.kaggle.com/code/tensorhydra/gpt-oss-20b-finetune-numinamath