---
library_name: transformers
base_model: mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332
tags:
- alignment-handbook
- epsilon-dpo
- generated_from_trainer
datasets:
- Anthropic/hh-rlhf
model-index:
- name: mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# mistral-7b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-015332

This model is a fine-tuned version of [mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332](https://huggingface.co/mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332) on the Anthropic/hh-rlhf dataset.
It achieves the following results on the evaluation set:
- Loss: 0.5823
- Epsilon Dpo/beta: 0.0009
- Epsilon Dpo/loss Margin Mean: 478.8302
- Epsilon Dpo/beta Margin Mean: 0.4461
- Epsilon Dpo/beta Margin Std: 0.8791
- Epsilon Dpo/beta Margin Grad Mean: -0.4072
- Epsilon Dpo/beta Margin Grad Std: 0.1767
- Rewards/chosen: -0.6675
- Rewards/rejected: -1.1135
- Rewards/accuracies: 0.7038
- Rewards/margins: 0.4461
- Logps/chosen: -771.9094
- Logps/rejected: -1274.5287
- Logps/ref Chosen: -68.2911
- Logps/ref Rejected: -92.0804
- Logits/chosen: -3.8786
- Logits/rejected: -4.7234
- Kl/p Epsilon Steps: 0.6819
- Kl/n Epsilon Steps: 0.3176

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-07
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 2
- total_train_batch_size: 64
- total_eval_batch_size: 32
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1

### Training results

| Training Loss | Epoch  | Step | Validation Loss | Epsilon Dpo/beta | Epsilon Dpo/loss Margin Mean | Epsilon Dpo/beta Margin Mean | Epsilon Dpo/beta Margin Std | Epsilon Dpo/beta Margin Grad Mean | Epsilon Dpo/beta Margin Grad Std | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/chosen | Logps/rejected | Logps/ref Chosen | Logps/ref Rejected | Logits/chosen | Logits/rejected | Kl/p Epsilon Steps | Kl/n Epsilon Steps |
|:-------------:|:------:|:----:|:---------------:|:----------------:|:----------------------------:|:----------------------------:|:---------------------------:|:---------------------------------:|:--------------------------------:|:--------------:|:----------------:|:------------------:|:---------------:|:------------:|:--------------:|:----------------:|:------------------:|:-------------:|:---------------:|:------------------:|:------------------:|
| 0.6663        | 0.1468 | 100  | 0.4936          | 0.0513           | 29.5905                      | 1.5046                       | 2.1537                      | -0.3036                           | 0.2585                           | -1.4541        | -2.9587          | 0.7556             | 1.5046          | -96.5378     | -149.9176      | -68.2911         | -92.0804           | -3.2766       | -3.2684         | 0.7226             | 0.2765             |
| 0.559         | 0.2937 | 200  | 0.4014          | 0.0224           | 78.6564                      | 1.7522                       | 1.9999                      | -0.2627                           | 0.2420                           | -2.4182        | -4.1704          | 0.7984             | 1.7522          | -175.9333    | -278.3791      | -68.2911         | -92.0804           | -2.2068       | -2.1055         | 0.7821             | 0.2175             |
| 0.4996        | 0.4405 | 300  | 0.4131          | 0.0097           | 122.2527                     | 1.1844                       | 1.3051                      | -0.2961                           | 0.2077                           | -1.5383        | -2.7227          | 0.7928             | 1.1844          | -225.6753    | -371.7173      | -68.2911         | -92.0804           | -1.9427       | -1.7790         | 0.7783             | 0.2217             |
| 0.5665        | 0.5874 | 400  | 0.4863          | 0.0042           | 180.0219                     | 0.7587                       | 0.9839                      | -0.3501                           | 0.1877                           | -1.0773        | -1.8360          | 0.7564             | 0.7587          | -320.8791    | -524.6903      | -68.2911         | -92.0804           | -2.2465       | -2.1282         | 0.7419             | 0.2581             |
| 0.8394        | 0.7342 | 500  | 0.5502          | 0.0019           | 226.7985                     | 0.4281                       | 0.6344                      | -0.4041                           | 0.1391                           | -0.6045        | -1.0326          | 0.7453             | 0.4281          | -384.0945    | -634.6824      | -68.2911         | -92.0804           | -2.9733       | -3.1795         | 0.7290             | 0.2701             |
| 0.8895        | 0.8811 | 600  | 0.5823          | 0.0009           | 478.8302                     | 0.4461                       | 0.8791                      | -0.4072                           | 0.1767                           | -0.6675        | -1.1135          | 0.7038             | 0.4461          | -771.9094    | -1274.5287     | -68.2911         | -92.0804           | -3.8786       | -4.7234         | 0.6819             | 0.3176             |


### Framework versions

- Transformers 4.51.0
- Pytorch 2.3.1+cu121
- Datasets 2.21.0
- Tokenizers 0.21.4