Files
Qwen2.5-7B-ODA-Mixture-100k/README.md

245 lines
10 KiB
Markdown
Raw Normal View History

---
base_model: Qwen/Qwen2.5-7B-Base
library_name: transformers
pipeline_tag: text-generation
datasets:
- OpenDataArena/ODA-Mixture-100k
tags:
- qwen2.5
- sft
- opendataarena
- oda-mixture-100k
license: apache-2.0
language:
- en
metrics:
- accuracy
---
# Qwen2.5-7B-ODA-Mixture-100k
<img src="performance.png" alt="Leaderboard Performance" width="1200" />
Qwen2.5-7B-ODA-Mixture-100k is a supervised fine-tuned (SFT) model built on top of **Qwen2.5-7B-Base**, trained with **[ODA-Mixture-100k](https://huggingface.co/datasets/OpenDataArena/ODA-Mixture-100k)**. This training set is curated by mixing top-performing open corpora selected via the *[OpenDataArena](https://opendataarena.github.io)* leaderboard, and refined through deduplication and benchmark decontamination, aiming to improve the models general capabilities across **General**, **Math**, **Code**, and **Reasoning** domains under a compact ~100K data budget.
---
## 🧠 Model Summary
- **Base Model**: `Qwen/Qwen2.5-7B-Base`
- **Training Data**: `OpenDataArena/ODA-Mixture-100k`
- **Domain Coverage**: General, Math, Code, Reasoning
- **Scale (selected training set)**: ~**100K** samples
- **Goal**: Achieve significant general-purpose gains with a compact curated dataset, improving multi-domain reasoning and problem-solving ability.
---
## ⚙️ Training Data Curation Pipeline
ODA-Mixture-100k is built by following a single rule: **trust the OpenDataArena leaderboard**.
### 1⃣ Data Collection
We chose **LIMO** as our foundation because it achieves a high ranking on the ODA overall leaderboard with very few samples. This efficiency allows us to establish a strong reasoning baseline. We then augment this core with **AM-Thinking-v1-Distilled-math** and **AM-Thinking-v1-Distilled-code**, the top-performing and efficient datasets on the ODA Math and Code leaderboards, to enhance specialized domain capabilities.
### 2⃣ Deduplication & Decontamination
We first perform **exact deduplication** over all questions to remove identical items, and then run **benchmark decontamination** to reduce evaluation leakage by removing overlaps with standard and competition benchmarks.
### 3⃣ Data Selection
To adhere to our ~100K data budget while maximizing the impact of each sample, we employ semantic clustering to map the overall data distribution. Within each cluster, we preferentially sample the most challenging instances, using sequence length as a practical proxy for reasoning complexity and problem difficulty.
---
## 📚 Training Data Source Composition
| Source | Count | Percentage |
|---|---:|---:|
| LIMO | 817 | 0.81% |
| AM-Thinking-Distilled-math | 50,244 | 49.59% |
| AM-Thinking-Distilled-code| 50,245 | 49.60% |
---
## 🧩 Data Format
The training data sample format is as follows (aligned with the dataset schema):
```json
{
"id": "unique_identifier",
"source": "data source",
"question": "textual question or instruction",
"response": "textual response"
}
```
---
## 📈 Performance
Qwen2.5-7B-ODA-Mixture-100k is evaluated as an SFT model built on **Qwen2.5-7B-Base** across the full ODA benchmark suite spanning four domains:
- **General (DROP, IFEVAL, AGIEVAL, MMLU-Pro)**
- **Math (GSM8K, MATH500, Omni-Math, OlympiadBench, AIME2024)**
- **Code (HumanEval, MBPP, LCB (V5), HumanEval+)**
- **Reasoning (ARC-C, BBH, CALM, KOR-BENCH)**.
We observe consistent improvements over the base checkpoint, with particularly strong gains on several benchmarks.
<div style="overflow-x: auto; font-family: sans-serif; margin-bottom: 20px;">
<table style="width: 100%; border-collapse: collapse; text-align: center; font-size: 14px; min-width: 900px; color: inherit;">
<caption style="padding: 10px; font-weight: bold;">
Leaderboard Performance Comparison. Best scores in <b>bold</b>, second-best <u>underlined</u>. Eff. denotes Data Efficiency.
</caption>
<thead>
<tr style="border-top: 2px solid currentColor; border-bottom: 1px solid currentColor;">
<th style="text-align: left; padding: 8px;">Model / Training Data</th>
<th>Size</th>
<th>Eff.</th>
<th>General</th>
<th>Math</th>
<th>Code</th>
<th>Reasoning</th>
<th style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>AVG</b></th>
</tr>
</thead>
<tbody>
<!-- ================= Qwen2.5-7B-Base ================= -->
<tr style="background-color: rgba(128, 128, 128, 0.08); font-weight: bold;">
<td colspan="8" style="text-align: center; padding: 10px 8px; letter-spacing: 1px;">Qwen2.5-7B-Base</td>
</tr>
<tr>
<td style="text-align: left; padding: 8px;">Qwen2.5-7B-Base</td>
<td>-</td><td>-</td>
<td>51.4</td><td>39.8</td><td>50.1</td><td>42.7</td>
<td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">46.0</td>
</tr>
<tr>
<td style="text-align: left; padding: 8px;">OpenThoughts3-1.2M</td>
<td>1.2M</td><td>+0.011</td>
<td>45.5</td><td>71.8</td><td><u>67.0</u></td><td>54.3</td>
<td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">59.6</td>
</tr>
<tr>
<td style="text-align: left; padding: 8px;">OmniThought-0528</td>
<td>365k</td><td>+0.027</td>
<td>47.1</td><td>71.2</td><td>47.6</td><td>57.2</td>
<td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">55.8</td>
</tr>
<tr>
<td style="text-align: left; padding: 8px;">SYNTHETIC-2-SFT-verified</td>
<td>105k</td><td>+0.086</td>
<td>51.3</td><td>69.8</td><td>40.1</td><td><u>58.9</u></td>
<td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">55.0</td>
</tr>
<tr>
<td style="text-align: left; padding: 8px;">AM-Thinking-v1-Distilled-math</td>
<td>558k</td><td>+0.016</td>
<td>57.7</td><td><b>77.4</b></td><td>39.5</td><td>44.8</td>
<td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">54.8</td>
</tr>
<tr>
<td style="text-align: left; padding: 8px;">LIMO</td>
<td>817</td><td><b>+9.920</b></td>
<td><u>60.7</u></td><td>44.0</td><td>57.9</td><td>53.8</td>
<td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">54.1</td>
</tr>
<tr>
<td style="text-align: left; padding: 8px;">MiroMind-M1-SFT-719K</td>
<td>719k</td><td>+0.006</td>
<td>52.0</td><td>71.0</td><td>26.3</td><td>51.5</td>
<td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">50.2</td>
</tr>
<tr>
<td style="text-align: left; padding: 8px;">AM-Thinking-v1-Distilled-code</td>
<td>324k</td><td>+0.024</td>
<td>49.9</td><td>52.3</td><td><b>68.7</b></td><td>44.4</td>
<td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">53.8</td>
</tr>
<tr>
<td style="text-align: left; padding: 8px;">Light-R1-SFTData</td>
<td>79k</td><td>+0.084</td>
<td>55.5</td><td>64.4</td><td>38.8</td><td>51.9</td>
<td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">52.7</td>
</tr>
<tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold;">
<td style="text-align: left; padding: 8px;">ODA-Mixture-500k</td>
<td>500k</td><td>+0.039</td>
<td><b>63.4</b></td><td><u>72.8</u></td><td>66.7</td><td><b>59.6</b></td>
<td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>65.6</b></td>
</tr>
<tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold; border-bottom: 2px solid currentColor;">
<td style="text-align: left; padding: 8px;">ODA-Mixture-100k</td>
<td>100k</td><td><u>+0.149</u></td>
<td>56.8</td><td>71.2</td><td>64.4</td><td>51.5</td>
<td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><u>61.0</u></td>
</tr>
</tbody>
</table>
</div>
---
## 🌐 About OpenDataArena
[OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to **discovering, evaluating, and advancing high-quality datasets for AI post-training**. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing.
**Key Features:**
- 🏆 **Dataset Leaderboard** — helps researchers identify **the most valuable and high-quality datasets across different domains**
- 📊 **Detailed Evaluation Scores** — provides **comprehensive metrics** to assess data quality, complexity, difficulty, etc.
- 🧰 **Data Processing Toolkit** — [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool) offers an open-source pipeline for dataset curation and scoring.
---
## 🚀 Usage
Model repo: `OpenDataArena/Qwen2.5-7B-ODA-Mixture-100k`. Below is a minimal runnable example for loading and inference:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "OpenDataArena/Qwen2.5-7B-ODA-Mixture-100k"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)
messages = [
{"role": "user", "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## 📚 Citation
If you use this model or its training data (ODA-Mixture-100k), please cite:
```bibtex
@article{gao2025closing,
title={Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets},
author={Gao, Xin and Wang, Xiaoyang and Zhu, Yun and Cai, Mengzhang and He, Conghui and Wu, Lijun},
journal={arXiv preprint arXiv:2601.09733},
year={2025}
}
```
```bibtex
@article{cai2025opendataarena,
title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
journal={arXiv preprint arXiv:2512.14051},
year={2025}
}
```