初始化项目，由ModelHub XC社区提供模型

Model: togethercomputer/Llama-3.1-8B-Instruct-MoAA-DPO Source: Original Platform
2026-05-29 02:46:13 +08:00
commit 1fcbb2a714
16 changed files with 2588 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,71 @@
+---
+library_name: transformers
+tags: []
+---
+
+
+
+
+## Model Description
+
+This is the DPO model in our Mixture of Agents Alignment (MoAA) pipeline. This model is tuned on the Llama-3.1-8b-Instruct. MoAA is an approach that leverages collective intelligence from open‑source LLMs to advance alignment.
+
+Two mains stages are involved in our MoAA method. In the first stage, we employ MoA  to produce high-quality synthetic data for supervised fine-tuning. In the second stage, we combines multiple LLMs as a reward model to provide preference annotations. 
+
+Some key takeaways of our work:
+
+
+
+- 📈**Alignment pipeline that actually works** Our MoAA method sends Llama‑3.1‑8B‑Instruct’s Arena‑Hard **19 → 48** and Gemma-2-9B-it **42→56**, handily beating GPT‑4o‑labeled sets at the time. 
+
+- 🏆**Ensembled rewards > single critics** An MoA reward model with dynamic criteria filtering edges out competitive ArmoRM on MT‑Bench & Arena‑Hard—all while staying 100 % open source. 
+
+- 🚀**Self‑improvement unlocked** Fine‑tune the strongest model inside the ensemble on MoAA data and it *surpasses its own teachers*—evidence that open models can push past proprietary ceilings without external supervision.
+
+
+## Model Sources
+
+
+For more details refer to 
+
+- **[Paper](https://arxiv.org/abs/2505.03059)**
+
+<!-- - **[twitter](https://arxiv.org/abs/2505.03059)**
+- **[blgopost](https://arxiv.org/abs/2505.03059)** -->
+
+
+
+## How to Get Started with the Model
+
+Use the code below to get started with the model.
+
+Run inference like this:
+```
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Instruct-MoAA-DPO")
+model = AutoModelForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Instruct-MoAA-DPO")
+```
+
+
+## Training Data
+
+We sample 5 responses from the previously trained SFT model and use a reward model to select the preferred and rejected responses for preference learning. Specifically, we utilize the reward model to identify the highest-scoring response as the "chosen" response and the lowest-scoring response as the "rejected" response for each method, and here we propose a novel technique that leverages MoA as a reward model.
+
+## Evaluation & Performance
+
+Refer to [Paper](https://arxiv.org/abs/2505.03059) for metrics.
+
+
+
+
+
+## Citation
+```
+@article{wang2025improving,
+title   = {Improving Model Alignment Through Collective Intelligence of Open-Source LLMS},
+author  = {Junlin Wang and Roy Xie and Shang Zhu and Jue Wang and Ben Athiwaratkun and Bhuwan Dhingra and Shuaiwen Leon Song and Ce Zhang and James Zou},
+year    = {2025},
+journal = {arXiv preprint arXiv: 2505.03059}
+}
+```