ModelHub XC 7ea33623c3 初始化项目,由ModelHub XC社区提供模型
Model: W-61/llama-3-8b-base-beta-dpo-hh-helpful-4xh200-batch-64-20260417-230753
Source: Original Platform
2026-05-26 09:02:17 +08:00

library_name, base_model, tags, datasets, model-index
library_name base_model tags datasets model-index
transformers llama-3-8b-base-sft-hh-helpful-4xh200-batch-64
alignment-handbook
beta-dpo
generated_from_trainer
Anthropic/hh-rlhf
name results
llama-3-8b-base-beta-dpo-hh-helpful-4xh200-batch-64-20260417-230753

llama-3-8b-base-beta-dpo-hh-helpful-4xh200-batch-64-20260417-230753

This model is a fine-tuned version of llama-3-8b-base-sft-hh-helpful-4xh200-batch-64 on the Anthropic/hh-rlhf dataset. It achieves the following results on the evaluation set:

  • Loss: 1.7101
  • Beta Dpo/beta: 0.0691
  • Beta Dpo/loss Margin Mean: 86.8606
  • Beta Dpo/beta Margin Mean: 10.0274
  • Beta Dpo/beta Margin Std: 12.8117
  • Beta Dpo/beta Margin Grad Mean: -0.4550
  • Beta Dpo/beta Margin Grad Std: 0.0744
  • Beta Dpo/gap Mean: 130.0152
  • Beta Dpo/gap Std: 165.0541
  • Beta Dpo/beta Used Raw: -2.4893
  • Beta Dpo/beta Used: 0.0691
  • Beta Dpo/mask Keep Frac: 1.0
  • Logits/chosen: -0.2789
  • Logits/rejected: -0.2575

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Beta Dpo/beta Beta Dpo/loss Margin Mean Beta Dpo/beta Margin Mean Beta Dpo/beta Margin Std Beta Dpo/beta Margin Grad Mean Beta Dpo/beta Margin Grad Std Beta Dpo/gap Mean Beta Dpo/gap Std Beta Dpo/beta Used Raw Beta Dpo/beta Used Beta Dpo/mask Keep Frac Logits/chosen Logits/rejected
0.9552 0.1468 100 0.6786 0.0046 9.8098 0.0725 0.1096 -0.4895 0.0167 17.6954 22.1843 -0.3731 0.0046 1.0 -0.6698 -0.6418
0.8706 0.2937 200 0.6904 0.0046 27.7458 0.2199 0.3260 -0.4903 0.0228 50.6913 68.2433 -1.2767 0.0046 1.0 -0.6064 -0.5873
2.8698 0.4405 300 0.8542 0.0215 46.8593 1.7761 2.5216 -0.4710 0.0500 79.1242 110.1004 -1.8359 0.0215 1.0 -0.4178 -0.4010
18.5063 0.5874 400 0.7607 0.0093 66.8920 1.0762 1.4305 -0.4753 0.0447 116.2162 143.8824 -2.8595 0.0093 1.0 -0.4157 -0.3938
1.1587 0.7342 500 1.3024 0.0541 78.1021 7.2488 9.0766 -0.4557 0.0679 118.3478 142.3098 -2.3147 0.0541 1.0 -0.3590 -0.3353
4.9835 0.8811 600 1.7101 0.0691 86.8606 10.0274 12.8117 -0.4550 0.0744 130.0152 165.0541 -2.4893 0.0691 1.0 -0.2789 -0.2575

Framework versions

  • Transformers 4.51.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4
Description
Model synced from source: W-61/llama-3-8b-base-beta-dpo-hh-helpful-4xh200-batch-64-20260417-230753
Readme 351 KiB