ModelHub XC a3cf714e1e 初始化项目,由ModelHub XC社区提供模型
Model: W-61/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920
Source: Original Platform
2026-04-25 06:50:01 +08:00

library_name, base_model, tags, datasets, model-index
library_name base_model tags datasets model-index
transformers llama-3-8b-base-sft-hh-helpful-4xh200-batch-64
alignment-handbook
epsilon-dpo
generated_from_trainer
Anthropic/hh-rlhf
name results
llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920

llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920

This model is a fine-tuned version of llama-3-8b-base-sft-hh-helpful-4xh200-batch-64 on the Anthropic/hh-rlhf dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5941
  • Epsilon Dpo/beta: 0.0022
  • Epsilon Dpo/loss Margin Mean: 124.0911
  • Epsilon Dpo/beta Margin Mean: 0.2718
  • Epsilon Dpo/beta Margin Std: 0.4719
  • Epsilon Dpo/beta Margin Grad Mean: -0.4365
  • Epsilon Dpo/beta Margin Grad Std: 0.1083
  • Rewards/chosen: -0.9073
  • Rewards/rejected: -1.1791
  • Rewards/accuracies: 0.7183
  • Rewards/margins: 0.2718
  • Logps/chosen: -487.8939
  • Logps/rejected: -619.7319
  • Logps/ref Chosen: -79.0510
  • Logps/ref Rejected: -86.7979
  • Logits/chosen: 0.9628
  • Logits/rejected: 1.3823
  • Kl/p Epsilon Steps: 0.6956
  • Kl/n Epsilon Steps: 0.3022

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Epsilon Dpo/beta Epsilon Dpo/loss Margin Mean Epsilon Dpo/beta Margin Mean Epsilon Dpo/beta Margin Std Epsilon Dpo/beta Margin Grad Mean Epsilon Dpo/beta Margin Grad Std Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/chosen Logps/rejected Logps/ref Chosen Logps/ref Rejected Logits/chosen Logits/rejected Kl/p Epsilon Steps Kl/n Epsilon Steps
0.9809 0.1468 100 0.5819 0.0612 10.0016 0.6045 1.2197 -0.3930 0.2078 -0.7143 -1.3188 0.6926 0.6045 -90.6611 -108.4097 -79.0510 -86.7979 -0.9442 -0.8901 0.6370 0.3617
0.7763 0.2937 200 0.5340 0.0331 25.3103 0.8303 1.4036 -0.3643 0.2158 -1.1029 -1.9332 0.7149 0.8303 -112.2424 -145.2996 -79.0510 -86.7979 -0.9108 -0.8737 0.6759 0.3232
0.7955 0.4405 300 0.4928 0.0169 54.8226 0.9212 1.3268 -0.3414 0.2134 -1.7179 -2.6391 0.7453 0.9212 -180.2318 -242.8013 -79.0510 -86.7979 -0.4338 -0.2913 0.7282 0.2710
0.6919 0.5874 400 0.5234 0.0084 77.8526 0.6459 0.9831 -0.3725 0.1852 -1.3445 -1.9904 0.7333 0.6459 -239.2477 -324.8472 -79.0510 -86.7979 0.0554 0.2763 0.7183 0.2808
0.9214 0.7342 500 0.5470 0.0042 108.2775 0.4514 0.6818 -0.4010 0.1444 -1.4275 -1.8789 0.7487 0.4514 -417.7782 -533.8026 -79.0510 -86.7979 0.5317 0.8735 0.7235 0.2753
1.0729 0.8811 600 0.5941 0.0022 124.0911 0.2718 0.4719 -0.4365 0.1083 -0.9073 -1.1791 0.7183 0.2718 -487.8939 -619.7319 -79.0510 -86.7979 0.9628 1.3823 0.6956 0.3022

Framework versions

  • Transformers 4.51.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4
Description
Model synced from source: W-61/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920
Readme 469 KiB