Files
mistral-7b-base-epsilon-dpo…/README.md
ModelHub XC 4ced85fa16 初始化项目,由ModelHub XC社区提供模型
Model: W-61/mistral-7b-base-epsilon-dpo-hh-harmless-4xh200-batch-64
Source: Original Platform
2026-04-22 12:12:09 +08:00

5.8 KiB

library_name, base_model, tags, datasets, model-index
library_name base_model tags datasets model-index
transformers mistral-7b-base-sft-hh-harmless-4xh200-batch-64-20260418-015332
alignment-handbook
epsilon-dpo
generated_from_trainer
Anthropic/hh-rlhf
name results
mistral-7b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260418-015332

mistral-7b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260418-015332

This model is a fine-tuned version of mistral-7b-base-sft-hh-harmless-4xh200-batch-64-20260418-015332 on the Anthropic/hh-rlhf dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5935
  • Epsilon Dpo/beta: 0.0044
  • Epsilon Dpo/loss Margin Mean: 65.2805
  • Epsilon Dpo/beta Margin Mean: 0.2810
  • Epsilon Dpo/beta Margin Std: 0.5001
  • Epsilon Dpo/beta Margin Grad Mean: -0.4338
  • Epsilon Dpo/beta Margin Grad Std: 0.1154
  • Rewards/chosen: -0.4127
  • Rewards/rejected: -0.6936
  • Rewards/accuracies: 0.7196
  • Rewards/margins: 0.2810
  • Logps/chosen: -171.7436
  • Logps/rejected: -233.1436
  • Logps/ref Chosen: -77.4087
  • Logps/ref Rejected: -73.5282
  • Logits/chosen: -3.1674
  • Logits/rejected: -2.9705
  • Kl/p Epsilon Steps: 0.7117
  • Kl/n Epsilon Steps: 0.2870

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Epsilon Dpo/beta Epsilon Dpo/loss Margin Mean Epsilon Dpo/beta Margin Mean Epsilon Dpo/beta Margin Std Epsilon Dpo/beta Margin Grad Mean Epsilon Dpo/beta Margin Grad Std Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/chosen Logps/rejected Logps/ref Chosen Logps/ref Rejected Logits/chosen Logits/rejected Kl/p Epsilon Steps Kl/n Epsilon Steps
0.9822 0.1512 100 0.6127 0.0682 6.4966 0.4365 1.0172 -0.4126 0.1971 -0.5724 -1.0089 0.6646 0.4365 -85.7551 -88.3712 -77.4087 -73.5282 -3.5599 -3.5776 0.6721 0.3275
0.8262 0.3023 200 0.5655 0.0393 16.0515 0.6241 1.1422 -0.3816 0.2119 -0.8102 -1.4343 0.7007 0.6241 -97.9143 -110.0853 -77.4087 -73.5282 -3.5815 -3.5644 0.7073 0.2918
0.949 0.4535 300 0.5616 0.0224 23.5752 0.5225 0.9256 -0.3915 0.1875 -0.6680 -1.1905 0.7121 0.5225 -107.0630 -126.7577 -77.4087 -73.5282 -3.5920 -3.5350 0.7069 0.2918
0.9633 0.6047 400 0.5351 0.0130 47.5045 0.6097 0.9587 -0.3752 0.1913 -0.7854 -1.3951 0.7267 0.6097 -137.7568 -181.3808 -77.4087 -73.5282 -3.4438 -3.3090 0.7284 0.2698
1.0329 0.7559 500 0.5608 0.0073 59.4602 0.4316 0.7150 -0.4038 0.1553 -0.5800 -1.0116 0.7258 0.4316 -156.1025 -211.6822 -77.4087 -73.5282 -3.2563 -3.0824 0.7192 0.2790
1.0833 0.9070 600 0.5935 0.0044 65.2805 0.2810 0.5001 -0.4338 0.1154 -0.4127 -0.6936 0.7196 0.2810 -171.7436 -233.1436 -77.4087 -73.5282 -3.1674 -2.9705 0.7117 0.2870

Framework versions

  • Transformers 4.51.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4