ModelHub XC eb326fd6d8 初始化项目,由ModelHub XC社区提供模型
Model: W-61/mistral-7b-base-beta-dpo-hh-harmless-4xh200-batch-64
Source: Original Platform
2026-05-30 22:35:26 +08:00

library_name, base_model, tags, datasets, model-index
library_name base_model tags datasets model-index
transformers mistral-7b-base-sft-hh-harmless-4xh200-batch-64-20260418-015332
alignment-handbook
beta-dpo
generated_from_trainer
Anthropic/hh-rlhf
name results
mistral-7b-base-beta-dpo-hh-harmless-4xh200-batch-64-20260418-015332

mistral-7b-base-beta-dpo-hh-harmless-4xh200-batch-64-20260418-015332

This model is a fine-tuned version of mistral-7b-base-sft-hh-harmless-4xh200-batch-64-20260418-015332 on the Anthropic/hh-rlhf dataset. It achieves the following results on the evaluation set:

  • Loss: 1.2926
  • Beta Dpo/beta: 0.1601
  • Beta Dpo/loss Margin Mean: 39.2063
  • Beta Dpo/beta Margin Mean: 8.1315
  • Beta Dpo/beta Margin Std: 10.1723
  • Beta Dpo/beta Margin Grad Mean: -0.3823
  • Beta Dpo/beta Margin Grad Std: 0.1665
  • Beta Dpo/gap Mean: 43.6826
  • Beta Dpo/gap Std: 65.1157
  • Beta Dpo/beta Used Raw: -0.1686
  • Beta Dpo/beta Used: 0.1601
  • Beta Dpo/mask Keep Frac: 1.0
  • Logits/chosen: -2.7393
  • Logits/rejected: -2.7541

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Beta Dpo/beta Beta Dpo/loss Margin Mean Beta Dpo/beta Margin Mean Beta Dpo/beta Margin Std Beta Dpo/beta Margin Grad Mean Beta Dpo/beta Margin Grad Std Beta Dpo/gap Mean Beta Dpo/gap Std Beta Dpo/beta Used Raw Beta Dpo/beta Used Beta Dpo/mask Keep Frac Logits/chosen Logits/rejected
0.9953 0.1512 100 0.6485 0.0613 9.1290 0.7724 1.1312 -0.4201 0.1227 11.2317 17.3662 -0.0262 0.0613 1.0 -3.3065 -3.3155
1.0952 0.3023 200 0.6746 0.0160 14.8251 0.3434 0.4801 -0.4659 0.0456 21.0985 27.0643 -0.2764 0.0160 1.0 -3.2893 -3.2944
0.957 0.4535 300 0.6778 0.0089 18.5486 0.2473 0.3133 -0.4818 0.0249 27.7148 34.2356 -0.4500 0.0089 1.0 -3.1462 -3.1521
1.3632 0.6047 400 0.7275 0.0294 21.1309 0.9101 1.1957 -0.4574 0.0566 28.7664 38.3785 -0.3581 0.0294 1.0 -3.0744 -3.0811
1.2415 0.7559 500 0.8583 0.0607 35.3740 2.9334 3.5974 -0.4353 0.0818 44.1570 58.8353 -0.4270 0.0607 1.0 -2.7623 -2.7786
2.3016 0.9070 600 1.2926 0.1601 39.2063 8.1315 10.1723 -0.3823 0.1665 43.6826 65.1157 -0.1686 0.1601 1.0 -2.7393 -2.7541

Framework versions

  • Transformers 4.51.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4
Description
Model synced from source: W-61/mistral-7b-base-beta-dpo-hh-harmless-4xh200-batch-64
Readme 975 KiB