ModelHub XC 6dfe4c8f03 初始化项目,由ModelHub XC社区提供模型
Model: jackf857/qwen3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64
Source: Original Platform
2026-05-26 01:10:34 +08:00

library_name, tags, datasets, model-index
library_name tags datasets model-index
transformers
alignment-handbook
epsilon-dpo
generated_from_trainer
Anthropic/hh-rlhf
name results
qwen3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260418-172841

qwen3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260418-172841

This model is a fine-tuned version of /scratch/qu.yang1/dynamic-dpo-v4/outputs/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452 on the Anthropic/hh-rlhf dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5753
  • Epsilon Dpo/beta: 0.0120
  • Epsilon Dpo/loss Margin Mean: 30.1703
  • Epsilon Dpo/beta Margin Mean: 0.3587
  • Epsilon Dpo/beta Margin Std: 0.6142
  • Epsilon Dpo/beta Margin Grad Mean: -0.4187
  • Epsilon Dpo/beta Margin Grad Std: 0.1376
  • Rewards/chosen: -0.5348
  • Rewards/rejected: -0.8935
  • Rewards/accuracies: 0.7236
  • Rewards/margins: 0.3587
  • Logps/chosen: -131.7627
  • Logps/rejected: -178.7413
  • Logps/ref Chosen: -87.4272
  • Logps/ref Rejected: -104.2355
  • Logits/chosen: -0.2089
  • Logits/rejected: -0.3424
  • Kl/p Epsilon Steps: 0.7170
  • Kl/n Epsilon Steps: 0.2817

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Epsilon Dpo/beta Epsilon Dpo/loss Margin Mean Epsilon Dpo/beta Margin Mean Epsilon Dpo/beta Margin Std Epsilon Dpo/beta Margin Grad Mean Epsilon Dpo/beta Margin Grad Std Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/chosen Logps/rejected Logps/ref Chosen Logps/ref Rejected Logits/chosen Logits/rejected Kl/p Epsilon Steps Kl/n Epsilon Steps
1.3499 0.1512 100 0.6692 0.0877 0.6693 0.0574 0.1823 -0.4858 0.0449 -0.0275 -0.0849 0.6312 0.0574 -87.7332 -105.2108 -87.4272 -104.2355 0.1122 0.1240 0.6241 0.3759
1.1283 0.3023 200 0.6028 0.0636 4.8746 0.3057 0.6498 -0.4313 0.1445 -0.1604 -0.4661 0.6831 0.3057 -89.9090 -111.5920 -87.4272 -104.2355 -0.3390 -0.3863 0.6734 0.3261
1.2376 0.4535 300 0.5699 0.0434 10.8082 0.4637 0.8475 -0.4021 0.1759 -0.4198 -0.8835 0.7086 0.4637 -97.0338 -124.6503 -87.4272 -104.2355 -0.3510 -0.4271 0.6941 0.3050
1.2438 0.6047 400 0.5624 0.0290 16.4413 0.4712 0.8253 -0.4003 0.1706 -0.4065 -0.8777 0.7179 0.4712 -101.3740 -134.6236 -87.4272 -104.2355 -0.3535 -0.4572 0.7064 0.2923
1.1147 0.7559 500 0.5588 0.0187 26.6232 0.4919 0.8457 -0.3960 0.1755 -0.8059 -1.2978 0.7205 0.4919 -130.4120 -173.8436 -87.4272 -104.2355 -0.1937 -0.3318 0.7157 0.2843
1.1925 0.9070 600 0.5753 0.0120 30.1703 0.3587 0.6142 -0.4187 0.1376 -0.5348 -0.8935 0.7236 0.3587 -131.7627 -178.7413 -87.4272 -104.2355 -0.2089 -0.3424 0.7170 0.2817

Framework versions

  • Transformers 4.51.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4
Description
Model synced from source: jackf857/qwen3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64
Readme 2.2 MiB