Files
ModelHub XC 121c6f2962 初始化项目,由ModelHub XC社区提供模型
Model: W-61/llama-3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260418-003215
Source: Original Platform
2026-06-13 14:08:31 +08:00

5.7 KiB

library_name, base_model, tags, datasets, model-index
library_name base_model tags datasets model-index
transformers llama-3-8b-base-sft-hh-harmless-4xh200-batch-64
alignment-handbook
epsilon-dpo
generated_from_trainer
Anthropic/hh-rlhf
name results
llama-3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260418-003215

llama-3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260418-003215

This model is a fine-tuned version of llama-3-8b-base-sft-hh-harmless-4xh200-batch-64 on the Anthropic/hh-rlhf dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5778
  • Epsilon Dpo/beta: 0.0075
  • Epsilon Dpo/loss Margin Mean: 44.6202
  • Epsilon Dpo/beta Margin Mean: 0.3297
  • Epsilon Dpo/beta Margin Std: 0.5411
  • Epsilon Dpo/beta Margin Grad Mean: -0.4237
  • Epsilon Dpo/beta Margin Grad Std: 0.1239
  • Rewards/chosen: -0.6980
  • Rewards/rejected: -1.0277
  • Rewards/accuracies: 0.7192
  • Rewards/margins: 0.3297
  • Logps/chosen: -168.1115
  • Logps/rejected: -217.4212
  • Logps/ref Chosen: -74.8595
  • Logps/ref Rejected: -79.5490
  • Logits/chosen: 0.0396
  • Logits/rejected: -0.0641
  • Kl/p Epsilon Steps: 0.7196
  • Kl/n Epsilon Steps: 0.2799

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Epsilon Dpo/beta Epsilon Dpo/loss Margin Mean Epsilon Dpo/beta Margin Mean Epsilon Dpo/beta Margin Std Epsilon Dpo/beta Margin Grad Mean Epsilon Dpo/beta Margin Grad Std Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/chosen Logps/rejected Logps/ref Chosen Logps/ref Rejected Logits/chosen Logits/rejected Kl/p Epsilon Steps Kl/n Epsilon Steps
1.3305 0.1512 100 0.6591 0.0821 1.4524 0.1163 0.4195 -0.4725 0.0969 -0.4073 -0.5236 0.6298 0.1163 -79.8024 -85.9443 -74.8595 -79.5490 -0.2133 -0.2955 0.6109 0.3882
0.9619 0.3023 200 0.5464 0.0537 12.5090 0.6642 1.1185 -0.3726 0.2095 -1.0244 -1.6886 0.7183 0.6642 -93.8833 -111.0817 -74.8595 -79.5490 -0.1899 -0.3034 0.7240 0.2751
1.0235 0.4535 300 0.5323 0.0324 22.9568 0.7358 1.1722 -0.3618 0.2142 -1.6265 -2.3623 0.7284 0.7358 -124.9662 -152.6126 -74.8595 -79.5490 -0.0481 -0.1742 0.7293 0.2698
1.132 0.6047 400 0.5402 0.0198 30.5708 0.5994 0.9731 -0.3799 0.1895 -1.2023 -1.8017 0.7302 0.5994 -135.4181 -170.6785 -74.8595 -79.5490 -0.0432 -0.1614 0.7293 0.2698
1.0834 0.7559 500 0.5507 0.0121 40.9401 0.4894 0.7940 -0.3952 0.1670 -1.0111 -1.5005 0.7210 0.4894 -158.3591 -203.9887 -74.8595 -79.5490 0.0215 -0.0872 0.7227 0.2764
1.1666 0.9070 600 0.5778 0.0075 44.6202 0.3297 0.5411 -0.4237 0.1239 -0.6980 -1.0277 0.7192 0.3297 -168.1115 -217.4212 -74.8595 -79.5490 0.0396 -0.0641 0.7196 0.2799

Framework versions

  • Transformers 4.51.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4