ModelHub XC 097e33be47 初始化项目,由ModelHub XC社区提供模型
Model: W-61/mistral-7b-base-beta-dpo-hh-helpful-4xh200-batch-64
Source: Original Platform
2026-04-22 11:00:42 +08:00

library_name, base_model, tags, datasets, model-index
library_name base_model tags datasets model-index
transformers mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332
alignment-handbook
beta-dpo
generated_from_trainer
Anthropic/hh-rlhf
name results
mistral-7b-base-beta-dpo-hh-helpful-4xh200-batch-64-20260418-015332

mistral-7b-base-beta-dpo-hh-helpful-4xh200-batch-64-20260418-015332

This model is a fine-tuned version of mistral-7b-base-sft-hh-helpful-4xh200-batch-64-20260418-015332 on the Anthropic/hh-rlhf dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6015
  • Beta Dpo/beta: 0.0010
  • Beta Dpo/loss Margin Mean: 243.4043
  • Beta Dpo/beta Margin Mean: 0.2434
  • Beta Dpo/beta Margin Std: 0.4217
  • Beta Dpo/beta Margin Grad Mean: -0.4422
  • Beta Dpo/beta Margin Grad Std: 0.0983
  • Beta Dpo/gap Mean: 404.4037
  • Beta Dpo/gap Std: 357.4069
  • Beta Dpo/beta Used Raw: -9.5600
  • Beta Dpo/beta Used: 0.0010
  • Beta Dpo/mask Keep Frac: 1.0
  • Logits/chosen: -2.7813
  • Logits/rejected: -2.8108

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Beta Dpo/beta Beta Dpo/loss Margin Mean Beta Dpo/beta Margin Mean Beta Dpo/beta Margin Std Beta Dpo/beta Margin Grad Mean Beta Dpo/beta Margin Grad Std Beta Dpo/gap Mean Beta Dpo/gap Std Beta Dpo/beta Used Raw Beta Dpo/beta Used Beta Dpo/mask Keep Frac Logits/chosen Logits/rejected
1.3346 0.1468 100 0.7825 0.0211 38.6966 1.4685 2.0475 -0.4727 0.0403 60.6513 63.8526 -1.2173 0.0211 1.0 -2.9129 -2.9033
1.265 0.2937 200 1.2116 0.0416 108.9061 8.0746 10.4591 -0.4594 0.0608 175.9197 183.7102 -3.9208 0.0416 1.0 -2.3116 -2.3059
0.5857 0.4405 300 0.6708 0.0032 165.3890 0.8039 1.0106 -0.4553 0.0715 284.4015 265.4041 -7.0408 0.0032 1.0 -2.3951 -2.3756
3.7878 0.5874 400 0.6122 0.0010 205.4126 0.2054 0.3571 -0.4506 0.0845 362.1024 333.2912 -9.3014 0.0010 1.0 -2.4431 -2.4332
6.7444 0.7342 500 0.6026 0.0010 233.9227 0.2339 0.3910 -0.4441 0.0919 390.5113 345.8571 -9.2953 0.0010 1.0 -2.6421 -2.6564
0.5388 0.8811 600 0.6015 0.0010 243.4043 0.2434 0.4217 -0.4422 0.0983 404.4037 357.4069 -9.5600 0.0010 1.0 -2.7813 -2.8108

Framework versions

  • Transformers 4.51.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4
Description
Model synced from source: W-61/mistral-7b-base-beta-dpo-hh-helpful-4xh200-batch-64
Readme 976 KiB