初始化项目,由ModelHub XC社区提供模型
Model: togethercomputer/Llama-3.1-8B-Instruct-MoAA-DPO Source: Original Platform
This commit is contained in:
71
README.md
Normal file
71
README.md
Normal file
@@ -0,0 +1,71 @@
|
||||
---
|
||||
library_name: transformers
|
||||
tags: []
|
||||
---
|
||||
|
||||
|
||||
|
||||
|
||||
## Model Description
|
||||
|
||||
This is the DPO model in our Mixture of Agents Alignment (MoAA) pipeline. This model is tuned on the Llama-3.1-8b-Instruct. MoAA is an approach that leverages collective intelligence from open‑source LLMs to advance alignment.
|
||||
|
||||
Two mains stages are involved in our MoAA method. In the first stage, we employ MoA to produce high-quality synthetic data for supervised fine-tuning. In the second stage, we combines multiple LLMs as a reward model to provide preference annotations.
|
||||
|
||||
Some key takeaways of our work:
|
||||
|
||||
|
||||
|
||||
- 📈**Alignment pipeline that actually works** Our MoAA method sends Llama‑3.1‑8B‑Instruct’s Arena‑Hard **19 → 48** and Gemma-2-9B-it **42→56**, handily beating GPT‑4o‑labeled sets at the time.
|
||||
|
||||
- 🏆**Ensembled rewards > single critics** An MoA reward model with dynamic criteria filtering edges out competitive ArmoRM on MT‑Bench & Arena‑Hard—all while staying 100 % open source.
|
||||
|
||||
- 🚀**Self‑improvement unlocked** Fine‑tune the strongest model inside the ensemble on MoAA data and it *surpasses its own teachers*—evidence that open models can push past proprietary ceilings without external supervision.
|
||||
|
||||
|
||||
## Model Sources
|
||||
|
||||
|
||||
For more details refer to
|
||||
|
||||
- **[Paper](https://arxiv.org/abs/2505.03059)**
|
||||
|
||||
<!-- - **[twitter](https://arxiv.org/abs/2505.03059)**
|
||||
- **[blgopost](https://arxiv.org/abs/2505.03059)** -->
|
||||
|
||||
|
||||
|
||||
## How to Get Started with the Model
|
||||
|
||||
Use the code below to get started with the model.
|
||||
|
||||
Run inference like this:
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Instruct-MoAA-DPO")
|
||||
model = AutoModelForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Instruct-MoAA-DPO")
|
||||
```
|
||||
|
||||
|
||||
## Training Data
|
||||
|
||||
We sample 5 responses from the previously trained SFT model and use a reward model to select the preferred and rejected responses for preference learning. Specifically, we utilize the reward model to identify the highest-scoring response as the "chosen" response and the lowest-scoring response as the "rejected" response for each method, and here we propose a novel technique that leverages MoA as a reward model.
|
||||
|
||||
## Evaluation & Performance
|
||||
|
||||
Refer to [Paper](https://arxiv.org/abs/2505.03059) for metrics.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## Citation
|
||||
```
|
||||
@article{wang2025improving,
|
||||
title = {Improving Model Alignment Through Collective Intelligence of Open-Source LLMS},
|
||||
author = {Junlin Wang and Roy Xie and Shang Zhu and Jue Wang and Ben Athiwaratkun and Bhuwan Dhingra and Shuaiwen Leon Song and Ce Zhang and James Zou},
|
||||
year = {2025},
|
||||
journal = {arXiv preprint arXiv: 2505.03059}
|
||||
}
|
||||
```
|
||||
Reference in New Issue
Block a user