Files
llama-3-8b-base-beta-dpo-hh…/README.md
ModelHub XC 8162a32e06 初始化项目,由ModelHub XC社区提供模型
Model: W-61/llama-3-8b-base-beta-dpo-hh-harmless-4xh200-batch-64-20260417-233539
Source: Original Platform
2026-05-30 22:45:37 +08:00

4.8 KiB

library_name, base_model, tags, datasets, model-index
library_name base_model tags datasets model-index
transformers llama-3-8b-base-sft-hh-harmless-4xh200-batch-64
alignment-handbook
beta-dpo
generated_from_trainer
Anthropic/hh-rlhf
name results
llama-3-8b-base-beta-dpo-hh-harmless-4xh200-batch-64-20260417-233539

llama-3-8b-base-beta-dpo-hh-harmless-4xh200-batch-64-20260417-233539

This model is a fine-tuned version of llama-3-8b-base-sft-hh-harmless-4xh200-batch-64 on the Anthropic/hh-rlhf dataset. It achieves the following results on the evaluation set:

  • Loss: 0.8203
  • Beta Dpo/beta: 0.1705
  • Beta Dpo/loss Margin Mean: 16.6192
  • Beta Dpo/beta Margin Mean: 3.5151
  • Beta Dpo/beta Margin Std: 4.7567
  • Beta Dpo/beta Margin Grad Mean: -0.3392
  • Beta Dpo/beta Margin Grad Std: 0.2229
  • Beta Dpo/gap Mean: 16.4768
  • Beta Dpo/gap Std: 28.4131
  • Beta Dpo/beta Used Raw: 0.1085
  • Beta Dpo/beta Used: 0.1705
  • Beta Dpo/mask Keep Frac: 1.0
  • Logits/chosen: 0.5021
  • Logits/rejected: 0.4487

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Beta Dpo/beta Beta Dpo/loss Margin Mean Beta Dpo/beta Margin Mean Beta Dpo/beta Margin Std Beta Dpo/beta Margin Grad Mean Beta Dpo/beta Margin Grad Std Beta Dpo/gap Mean Beta Dpo/gap Std Beta Dpo/beta Used Raw Beta Dpo/beta Used Beta Dpo/mask Keep Frac Logits/chosen Logits/rejected
1.3014 0.1512 100 0.6391 0.1183 1.3224 0.1789 0.4749 -0.4595 0.1057 1.0180 3.3360 0.1183 0.1183 1.0 0.2572 0.2207
0.9318 0.3023 200 0.5939 0.0752 9.0802 0.8670 1.1566 -0.3975 0.1333 10.1709 15.2637 0.0346 0.0752 1.0 0.4250 0.3786
1.1289 0.4535 300 0.6938 0.1151 14.7143 2.1435 2.8181 -0.3684 0.1767 15.7264 23.2621 0.0393 0.1151 1.0 0.5036 0.4522
1.3777 0.6047 400 0.6486 0.0698 13.2713 1.2326 1.6702 -0.4066 0.1228 15.5137 23.8374 -0.0345 0.0698 1.0 0.4392 0.3876
1.1911 0.7559 500 0.6888 0.0936 16.0572 1.9620 2.5727 -0.3866 0.1471 17.9087 28.7161 -0.0111 0.0936 1.0 0.5027 0.4490
1.0347 0.9070 600 0.8203 0.1705 16.6192 3.5151 4.7567 -0.3392 0.2229 16.4768 28.4131 0.1085 0.1705 1.0 0.5021 0.4487

Framework versions

  • Transformers 4.51.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4