384 lines
14 KiB
Markdown
384 lines
14 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
datasets:
|
|||
|
|
- AmanPriyanshu/GPT-OSS-20B-MoE-expert-activations
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
tags:
|
|||
|
|
- mixture-of-experts
|
|||
|
|
- moe
|
|||
|
|
- expert-pruning
|
|||
|
|
- gpt-oss
|
|||
|
|
- openai
|
|||
|
|
- reasoning
|
|||
|
|
- math
|
|||
|
|
- specialized
|
|||
|
|
- efficient
|
|||
|
|
- transformer
|
|||
|
|
- causal-lm
|
|||
|
|
- text-generation
|
|||
|
|
- pytorch
|
|||
|
|
- pruned-model
|
|||
|
|
- domain-specific
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Math GPT-OSS Model (20 Experts)
|
|||
|
|
|
|||
|
|
**Project**: https://amanpriyanshu.github.io/GPT-OSS-MoE-ExpertFingerprinting/
|
|||
|
|
|
|||
|
|
<div align="center">
|
|||
|
|
|
|||
|
|
### 👥 Follow the Authors
|
|||
|
|
|
|||
|
|
**Aman Priyanshu**
|
|||
|
|
[](https://www.linkedin.com/in/aman-priyanshu/)
|
|||
|
|
[](https://x.com/AmanPriyanshu6)
|
|||
|
|
[](https://amanpriyanshu.github.io/)
|
|||
|
|
|
|||
|
|
**Supriti Vijay**
|
|||
|
|
[](https://www.linkedin.com/in/supriti-vijay/)
|
|||
|
|
[](https://x.com/SupritiVijay)
|
|||
|
|
[](https://supritivijay.github.io/)
|
|||
|
|
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
## Introduction
|
|||
|
|
|
|||
|
|
This is a pruned variant of OpenAI's GPT-OSS-20B model, reduced to 20 experts per layer based on activation patterns from the [AmanPriyanshu/GPT-OSS-20B MoE Expert Activations dataset](https://huggingface.co/datasets/AmanPriyanshu/GPT-OSS-20B-MoE-expert-activations). We analyzed router decisions across evaluation benchmarks to identify and retain experts most relevant for math tasks.
|
|||
|
|
|
|||
|
|
**⚠️ Experimental Model**: This is an experimental pruned model that may not work well - check the [examples below](#model-examples) to see if the outputs meet your needs before use.
|
|||
|
|
|
|||
|
|
This pruning approach reduces the model size while attempting to preserve performance on the target domain.
|
|||
|
|
|
|||
|
|
## Model Architecture & Statistics
|
|||
|
|
|
|||
|
|
| Metric | Value |
|
|||
|
|
|--------|-------|
|
|||
|
|
| **Base Model** | openai/gpt-oss-20b |
|
|||
|
|
| **Architecture** | Mixture-of-Experts Transformer |
|
|||
|
|
| **Total Parameters** | ~13.7B (pruned from 21B) |
|
|||
|
|
| **Original Experts per Layer** | 32 |
|
|||
|
|
| **Pruned Experts per Layer** | 20 |
|
|||
|
|
| **Layers** | 24 |
|
|||
|
|
| **Top-k Routing** | 4 |
|
|||
|
|
| **Context Length** | 128K tokens |
|
|||
|
|
| **Attention Heads** | 64 (Query), 8 (Key-Value) |
|
|||
|
|
| **Residual Dimension** | 2880 |
|
|||
|
|
| **Attention Pattern** | Alternating dense & sliding window (128 tokens) |
|
|||
|
|
| **Positional Encoding** | RoPE (Rotary Position Embedding) |
|
|||
|
|
| **Normalization** | RMSNorm |
|
|||
|
|
| **Precision** | BF16 |
|
|||
|
|
| **License** | Apache 2.0 |
|
|||
|
|
| **Specialization** | Math |
|
|||
|
|
|
|||
|
|
## Pruning Methodology
|
|||
|
|
|
|||
|
|
### What is Expert Pruning?
|
|||
|
|
Mixture-of-Experts models contain multiple specialized sub-networks (experts) per layer. During inference, only a subset of experts are activated for each token. Expert pruning involves:
|
|||
|
|
|
|||
|
|
1. **Analyzing Usage Patterns**: Tracking which experts activate most frequently for specific tasks
|
|||
|
|
2. **Removing Underutilized Experts**: Discarding experts with low activation rates for the target domain
|
|||
|
|
3. **Preserving Router Functionality**: Maintaining the routing mechanism with fewer available experts
|
|||
|
|
|
|||
|
|
### Our Approach
|
|||
|
|
- **Data-Driven Selection**: Used activation patterns from math evaluation tasks
|
|||
|
|
- **Systematic Reduction**: Reduced from 32 to 20 experts per layer
|
|||
|
|
- **No Retraining**: Direct removal without additional training steps
|
|||
|
|
|
|||
|
|
## Performance & Applications
|
|||
|
|
|
|||
|
|
### Pruning Benefits
|
|||
|
|
- **Smaller Memory Footprint**: 62.5% of original expert parameters
|
|||
|
|
- **Reduced Computational Load**: Fewer routing decisions during inference
|
|||
|
|
- **Focused Capabilities**: Retains experts relevant to math tasks
|
|||
|
|
|
|||
|
|
### Use Cases
|
|||
|
|
- **Speculative Decoding**: Draft model for full GPT-OSS-20B
|
|||
|
|
- **Resource-Constrained Deployment**: Edge devices, mobile applications
|
|||
|
|
- **Research**: Study expert specialization in MoE models
|
|||
|
|
- **Fine-tuning**: Smaller base model for domain adaptation
|
|||
|
|
|
|||
|
|
*Note: Performance may vary depending on how well the pruned experts match your specific use case.*
|
|||
|
|
|
|||
|
|
## Motivation & Expert Selection
|
|||
|
|
|
|||
|
|
This mathematics-focused model utilizes experts that exhibited strong performance on mathematical reasoning tasks from MMLU mathematics subjects and quantitative sections. These experts excel at mathematical computation, proof strategies, and logical reasoning.
|
|||
|
|
|
|||
|
|
The expert selection process utilized our comprehensive analysis of router activation patterns across multiple evaluation benchmarks:
|
|||
|
|
|
|||
|
|
- **GPQA**: Graduate-level questions in physics, chemistry, biology (Diamond & Expert subsets)
|
|||
|
|
- **MMLU/MMLU-Pro**: Comprehensive knowledge across 57+ subjects including science, medicine, law
|
|||
|
|
- **SORRY-Bench**: Safety evaluation across harmful content categories
|
|||
|
|
- **Tulu3**: Persona-driven instruction following with verifiable constraints
|
|||
|
|
- **Polyglot-or-Not**: Multilingual factual completion tasks
|
|||
|
|
|
|||
|
|
By identifying experts that consistently activated for math tasks, we created this specialized model that maintains domain expertise while significantly reducing computational requirements from 32 to 20 experts per layer.
|
|||
|
|
|
|||
|
|
## Dataset & Analysis Foundation
|
|||
|
|
|
|||
|
|
This model is based on analysis from the **GPT-OSS-20B MoE Expert Activations dataset** available at:
|
|||
|
|
🔗 **https://huggingface.co/datasets/AmanPriyanshu/GPT-OSS-20B-MoE-expert-activations**
|
|||
|
|
|
|||
|
|
The dataset contains router activation patterns from OpenAI's GPT-OSS-20B model across diverse evaluation benchmarks, enabling the creation of these domain-optimized models through systematic expert pruning.
|
|||
|
|
|
|||
|
|
### Pruning Methodology
|
|||
|
|
Our approach involves:
|
|||
|
|
1. **Activation Analysis**: Comprehensive evaluation of expert usage patterns across domain-specific tasks
|
|||
|
|
2. **Expert Ranking**: Identification of the most frequently activated experts for target domains
|
|||
|
|
3. **Systematic Pruning**: Reduction from 32 to 20 experts while preserving router functionality
|
|||
|
|
4. **Quality Validation**: Testing to ensure maintained performance on target tasks
|
|||
|
|
|
|||
|
|
*This is a direct pruning approach - no additional training was performed. The model inherits all capabilities from the original GPT-OSS-20B with focused expert selection.*
|
|||
|
|
|
|||
|
|
## Usage
|
|||
|
|
|
|||
|
|
### CPU Inference
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
# Load the specialized model on CPU
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
"AmanPriyanshu/gpt-oss-13.7b-specialized-math-pruned-moe-only-20-experts",
|
|||
|
|
torch_dtype=torch.bfloat16,
|
|||
|
|
device_map="cpu",
|
|||
|
|
trust_remote_code=True
|
|||
|
|
)
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained("AmanPriyanshu/gpt-oss-13.7b-specialized-math-pruned-moe-only-20-experts")
|
|||
|
|
|
|||
|
|
# Generate with the model
|
|||
|
|
messages = [
|
|||
|
|
{"role": "user", "content": "Solve this equation: 2x + 5 = 17. Show your work step by step."}
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
inputs = tokenizer.apply_chat_template(
|
|||
|
|
messages,
|
|||
|
|
add_generation_prompt=True,
|
|||
|
|
return_tensors="pt",
|
|||
|
|
return_dict=True,
|
|||
|
|
reasoning_effort="medium"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Ensure inputs are on the same device as model
|
|||
|
|
inputs = {k: v.to(model.device) for k, v in inputs.items()}
|
|||
|
|
|
|||
|
|
outputs = model.generate(
|
|||
|
|
**inputs,
|
|||
|
|
max_new_tokens=512,
|
|||
|
|
do_sample=True,
|
|||
|
|
temperature=0.1,
|
|||
|
|
top_p=0.9,
|
|||
|
|
pad_token_id=tokenizer.eos_token_id,
|
|||
|
|
eos_token_id=tokenizer.eos_token_id
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Decode only the generated part
|
|||
|
|
input_length = inputs['input_ids'].shape[1]
|
|||
|
|
response_tokens = outputs[0][input_length:]
|
|||
|
|
response = tokenizer.decode(response_tokens, skip_special_tokens=True)
|
|||
|
|
print(response)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Apple Silicon (MPS) Inference
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
# Check MPS availability and load model
|
|||
|
|
device = "mps" if torch.backends.mps.is_available() else "cpu"
|
|||
|
|
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
"AmanPriyanshu/gpt-oss-13.7b-specialized-math-pruned-moe-only-20-experts",
|
|||
|
|
torch_dtype=torch.float16, # Better MPS compatibility
|
|||
|
|
device_map=device,
|
|||
|
|
trust_remote_code=True,
|
|||
|
|
low_cpu_mem_usage=True
|
|||
|
|
)
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained("AmanPriyanshu/gpt-oss-13.7b-specialized-math-pruned-moe-only-20-experts")
|
|||
|
|
|
|||
|
|
# Generate with the model
|
|||
|
|
messages = [
|
|||
|
|
{"role": "user", "content": "Solve this equation: 2x + 5 = 17. Show your work step by step."}
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
inputs = tokenizer.apply_chat_template(
|
|||
|
|
messages,
|
|||
|
|
add_generation_prompt=True,
|
|||
|
|
return_tensors="pt",
|
|||
|
|
return_dict=True,
|
|||
|
|
reasoning_effort="medium"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Move inputs to model device
|
|||
|
|
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
|
|||
|
|
|
|||
|
|
# Use torch.no_grad for MPS stability
|
|||
|
|
with torch.no_grad():
|
|||
|
|
outputs = model.generate(
|
|||
|
|
**inputs,
|
|||
|
|
max_new_tokens=512,
|
|||
|
|
do_sample=True,
|
|||
|
|
temperature=0.1,
|
|||
|
|
top_p=0.9,
|
|||
|
|
pad_token_id=tokenizer.eos_token_id,
|
|||
|
|
eos_token_id=tokenizer.eos_token_id,
|
|||
|
|
use_cache=True
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Decode only the generated part
|
|||
|
|
input_length = inputs['input_ids'].shape[1]
|
|||
|
|
response_tokens = outputs[0][input_length:]
|
|||
|
|
response = tokenizer.decode(response_tokens, skip_special_tokens=True)
|
|||
|
|
print(response)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### GPU Inference
|
|||
|
|
|
|||
|
|
For CUDA GPU inference, the models work with minimal modifications from the CPU example above. Simply change:
|
|||
|
|
```python
|
|||
|
|
device_map="auto" # Will automatically use GPU if available
|
|||
|
|
torch_dtype=torch.bfloat16 # or torch.float16
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Model Examples
|
|||
|
|
|
|||
|
|
The following examples demonstrate the model's performance on math tasks:
|
|||
|
|
|
|||
|
|
### Example 1
|
|||
|
|
**Question:** Solve this equation: 2x + 5 = 17. Show your work step by step.
|
|||
|
|
|
|||
|
|
**Response:**
|
|||
|
|
```
|
|||
|
|
<|channel|>analysis<|message|>We need to solve 2x+5=17. Subtract 5: 2x=12. Divide 2: x=6. Provide step-by-step.<|end|><|start|>assistant<|channel|>final<|message|>**Step 1 – Write the equation**
|
|||
|
|
|
|||
|
|
\[
|
|||
|
|
2x + 5 = 17
|
|||
|
|
\]
|
|||
|
|
|
|||
|
|
**Step 2 – Isolate the term with \(x\)**
|
|||
|
|
|
|||
|
|
Subtract 5 from both sides so that the constant term disappears:
|
|||
|
|
|
|||
|
|
\[
|
|||
|
|
2x + 5 - 5 = 17 - 5
|
|||
|
|
\]
|
|||
|
|
|
|||
|
|
\[
|
|||
|
|
2x = 12
|
|||
|
|
\]
|
|||
|
|
|
|||
|
|
**Step 3 – Solve for \(x\)**
|
|||
|
|
|
|||
|
|
Divide both sides by 2 (the coefficient of \(x\)):
|
|||
|
|
|
|||
|
|
\[
|
|||
|
|
\frac{2x}{2} = \frac{12}{2}
|
|||
|
|
\]
|
|||
|
|
|
|||
|
|
\[
|
|||
|
|
x = 6
|
|||
|
|
\]
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Answer:** \(x = 6\)
|
|||
|
|
|
|||
|
|
(You can check by plugging back: \(2(6)+5 = 12+5 = 17\), which matches the right‑hand side.)<|return|>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Example 2
|
|||
|
|
**Question:** What is the Pythagorean theorem and how is it used?
|
|||
|
|
|
|||
|
|
**Response:**
|
|||
|
|
```
|
|||
|
|
<|channel|>analysis<|message|>Need explain theorem, formula, usage.<|end|><|start|>assistant<|channel|>final<|message|>**The Pythagorean Theorem**
|
|||
|
|
|
|||
|
|
> In a right‑angled triangle, the square of the length of the hypotenuse (the side opposite the right angle) equals the sum of the squares of the lengths of the other two sides.
|
|||
|
|
|
|||
|
|
Mathematically:
|
|||
|
|
|
|||
|
|
\[
|
|||
|
|
a^{2} + b^{2} = c^{2}
|
|||
|
|
\]
|
|||
|
|
|
|||
|
|
where
|
|||
|
|
* \(a\) and \(b\) are the legs (the two sides that form the right angle),
|
|||
|
|
* \(c\) is the hypotenuse.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### How It Is Used
|
|||
|
|
|
|||
|
|
| Application | What you do | Example |
|
|||
|
|
|-------------|-------------|---------|
|
|||
|
|
| **Finding a missing side** | Rear‑solve the equation for the unknown side | In a triangle with legs 3 m and 4 m, \(c = \sqrt{3^{2}+4^{2}} = 5\) m. |
|
|||
|
|
| **Checking if a triangle is right‑angled** | Verify if \(a^{2}+b^{2}=c^{2}\) | For sides 6, 8, 10: \(6^{2}+8^{2}=36+64=100=10^{2}\) → right triangle. |
|
|||
|
|
| **Design & construction** | Ensure structural elements meet right‑angle requirements (e.g., building corners, ladders). | A carpenter uses the theorem to calculate the length of a diagonal brace. |
|
|||
|
|
| **Navigation & surveying** | Compute distances when only horizontal and vertical components are known. | A hiker measures 200 m east and 150 m north; the straight‑away distance is \(\sqrt{200^{2}+150^{2}}\). |
|
|||
|
|
| **Computer graphics & animation** | Determine pixel distances, collision detection, or rendering of right‑angled shapes. | A program calculates the distance between two points in a 2‑D grid. |
|
|||
|
|
| **Physics & engineering** | Resolve forces, velocities, or displacements into perpendicular components. | A projectile’s horizontal and vertical speeds combine to give the resultant speed. |
|
|||
|
|
| **Mathematics & education** | Introduce geometry, algebra, and proof techniques. | Students prove that the area of a square built on each side of a right triangle satisfies the theorem. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Quick “Rule of Thumb”
|
|||
|
|
|
|||
|
|
- **If you know two sides** → compute the third with \(c = \sqrt{a^{
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Example 3
|
|||
|
|
**Question:** Calculate the area of a circle with radius 7 meters.
|
|||
|
|
|
|||
|
|
**Response:**
|
|||
|
|
```
|
|||
|
|
<|channel|>analysis<|message|>Need area πr^2 = π*49 ≈ 153.938.<|end|><|start|>assistant<|channel|>final<|message|>The area \(A\) of a circle is given by
|
|||
|
|
|
|||
|
|
\[
|
|||
|
|
A = \pi r^{2}
|
|||
|
|
\]
|
|||
|
|
|
|||
|
|
where \(r\) is the radius.
|
|||
|
|
|
|||
|
|
For \(r = 7\) m:
|
|||
|
|
|
|||
|
|
\[
|
|||
|
|
A = \pi (7\ \text{m})^{2}
|
|||
|
|
= \pi \times 49\ \text{m}^{2}
|
|||
|
|
\approx 3.14159 \times 49\ \text{m}^{2}
|
|||
|
|
\approx 153.938\ \text{m}^{2}
|
|||
|
|
\]
|
|||
|
|
|
|||
|
|
So the circle has an area of about **\(154\ \text{m}^{2}\)** (rounded to the nearest square meter).<|return|>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
If you use this model in your research, please cite:
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@misc{priyanshu2025gptoss,
|
|||
|
|
title={{GPT-OSS MoE Expert Fingerprinting: Analyzing Expert Activation Patterns in Mixture of Experts Models}},
|
|||
|
|
author={Priyanshu, Aman and Vijay, Supriti},
|
|||
|
|
year={2025},
|
|||
|
|
howpublished={\url{https://amanpriyanshu.github.io/GPT-OSS-MoE-ExpertFingerprinting/}},
|
|||
|
|
note={Interactive analysis tool for expert activation patterns in MoE architectures}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## References & Resources
|
|||
|
|
|
|||
|
|
- **Original Model**: [OpenAI GPT-OSS Model Card](https://openai.com/index/introducing-gpt-oss/)
|
|||
|
|
- **Model Hub**: [GPT-OSS-20B on Hugging Face](https://huggingface.co/openai/gpt-oss-20b)
|
|||
|
|
- **Expert Analysis Dataset**: [GPT-OSS-20B MoE Expert Activations](https://huggingface.co/datasets/AmanPriyanshu/GPT-OSS-20B-MoE-expert-activations)
|
|||
|
|
- **Project Page**: [GPT-OSS MoE Expert Fingerprinting](https://amanpriyanshu.github.io/GPT-OSS-MoE-ExpertFingerprinting/)
|
|||
|
|
- **GitHub Repository**: [OpenAI GPT-OSS](https://github.com/openai/gpt-oss)
|