Files
Mistral-7B-ReMax-v0.1/README.md
ModelHub XC 16a0c52790 初始化项目,由ModelHub XC社区提供模型
Model: ziniuli/Mistral-7B-ReMax-v0.1
Source: Original Platform
2026-05-06 01:16:04 +08:00

154 lines
6.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
license: apache-2.0
library_name: transformers
---
# Model Card for Mistral-7B-ReMax-v0.1
The Mistral-7B-ReMax-v0.1 Large Language Model (LLM) is a Reinforcement Learning from Human Preference (RLHF) fine-tuned version of [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
The fine-tuning algorithm is ReMax and please find algorithm details in the [paper](https://arxiv.org/abs/2310.10505).
## Model Details
- Developed by: Ziniu Li
- License: apache-2.0
- Finetuned from [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
## Uses
### Direct Use
The instruction format is the same as Mistral-7B-Instruct-v0.2. Specifically, the prompt should be surrounded by \[INST\] and \[/INST\] tokens.
```
text = "<s>[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> "
"[INST] Do you have mayonnaise recipes? [/INST]"
```
## Training Details
### Training Data
10k prompts from the [lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) dataset. Note that no responses from this dataset are used in training.
### Reward Model
The reward model is based on [UltraRM-13b](https://huggingface.co/openbmb/UltraRM-13b).
**Important**: UltraRM-13b uses a different instruction template with Mistral-7B. To address this issue, we change the chat template to that of UltraRM-13b when evaluating the reward score in RLHF fine-tuning.
### Training Procedure
The training algorithm is ReMax, whose details are available in the [paper](https://arxiv.org/abs/2310.10505), and implementations are available in the [repository](https://github.com/liziniu/ReMax).
### Training Hyperparameters
```
max_prompt_seq: 384
max_answer_seq: 384
learning_rate: 5e-7
num_train_epochs: 1
kl_penalty: full_step
kl_ctl: 0.1
reward_clip_value: 1.0
temperature for sampling: 0.7
top_p for sampling: 0.9
```
## Evaluation
### Chat1M
Since we fine-tune Mistral-7B-Instruct-v0.2 with the lmsys-chat-1m dataset, we first evaluate the quality of responses. We randomly select 500 prompts from the lmsys-chat-1m dataset and lets the models to generate responses. Then, we let GPT-4 to evaluate the binary win rate. The average win-rate of Mistral-7B-ReMax-v0.1 is 57.8%.
| | Mistral-7B-ReMax-v0.1 Wins | Mistral-7B-ReMax-v0.1 Wins |
| ------------------------ | -------------------------- | -------------------------- |
| Mistral-7B-Instruct-v0.2 | 289 | 211 |
Samples of responses are provided below.
```
==================Instruction================
你好
==================Mistral-7B-Instruct-v0.2================
你好!我是一位帮助您解答问题的机器人。如果您有任何问题,请告诉我,我会尽力回答。
你好Hello是中文语言中的一个常用词语常用于问候或表示亲和意思是“你好你很好”。
在我帮助您之前,我想先问一下,您是否知道什么是机器学习?机器学习是人工智能领域中的一种技术,它允许计算机自动学习和改进其行为,以便从数据中提取出有价值的信息,并使计算机能够做出更智能的决策。
如果您已经熟悉机器学习,那么我可以帮助您深入探讨相关主题,如神经网络、支持向量机、决策树等。但是,如果您对机器学习不太熟悉,我可以简单地解释它的基本原理,以及它在现实生活中的应用。
请告诉我您的需求,我会尽力满足您的期望。
再次问候您:你好!
==================Mistral-7B-ReMax-v0.1================
好的欢迎使用我的帮助Hello, welcome to use my help!
是的我可以以中文回答你的问题或者帮你完成一些简单的任务。Yes, I can answer your questions in Chinese or help you with some simple tasks.
请问有什么特定的问题需要我帮忙呢What specific question do you need my help with?
或者你可以告诉我什么方面我可以提供帮助。Or you can tell me which area I can provide help in.
希望我能帮到你I hope I can help you!
```
### AlpacaEval
[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval) provides 805 test questions that the model has not been trained on. We provide the evaluation results below:
| | AlpacaEval | AlpacaEval 2 | Response Length |
| ------------------------ | ---------- | ------------ | --------------- |
| Mistral-7B-Instruct-v0.2 | 92.78 | 14.72 | 1676 |
| Mistral-7B-ReMax-v0.1 | 94.40 | 16.00 | 1478 |
| LLaMa2 Chat 7B | 71.37 | 4.96 | 1479 |
| LLaMa2 Chat 70B | 92.66 | 13.87 | 1790 |
| GPT-4 | 95.28 | 23.58 | 2049 |
Samples of responses are provided below
```
==================Instruction================
Marie is at the music store for a store day. She bought two $50 combos. Then, for $25 she bought two t-shirts for $5 each and five hoodies for $2 each. Then, she bought sneakers for $8 each. How much did she spend total?
==================Mistral-7B-Instruct-v0.2================
Marie bought 2 combos at $50 each, so that's $100. She bought 2 t-shirts at $5 each, which is $10. She bought 5 hoodies at $2 each, which is $10. The sneakers cost $8. So Marie spent a total of $118. ($100 + $10 + $8 = $118)
==================Mistral-7B-ReMax-v0.1================
Marie bought two combos at $50 each, so that's $100. She bought two t-shirts for $5 each, which totals $10. She bought five hoodies for $2 each, so that's $10. The sneakers cost $8.
So Marie spent $100 + $10 + $10 + $8 = $128 in total.
```
## Limitations
The Mistral-7B-ReMax-v0.1 model demonstrates that the optimization algorithm ReMax is effective. This model shares the same limitations as the Mistral-7B-Instruct-v0.2 model. In particular, it does not have any moderation mechanisms.
## Citation
**BibTeX:**
```
@article{li2023remax,
title={Remax: A simple, effective, and efficient method for aligning large language models},
author={Li, Ziniu and Xu, Tian and Zhang, Yushun and Yu, Yang and Sun, Ruoyu and Luo, Zhi-Quan},
journal={arXiv preprint arXiv:2310.10505},
year={2023}
}
```