ModelHub XC 42e9e465a7 初始化项目,由ModelHub XC社区提供模型
Model: jackf857/qwen3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260424-040415
Source: Original Platform
2026-05-14 00:59:58 +08:00

library_name, base_model, tags, datasets, model-index
library_name base_model tags datasets model-index
transformers jackf857/qwen3-8b-base-sft-hh-harmless-4xh200-batch-64-20260417-214452
alignment-handbook
epsilon-dpo
generated_from_trainer
Anthropic/hh-rlhf
name results
qwen3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260424-040415

qwen3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260424-040415

This model is a fine-tuned version of jackf857/qwen3-8b-base-sft-hh-harmless-4xh200-batch-64-20260417-214452 on the Anthropic/hh-rlhf dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5633
  • Rewards/chosen: -0.6512
  • Rewards/rejected: -1.0488
  • Rewards/accuracies: 0.7328
  • Rewards/margins: 0.3976
  • Logps/chosen: -149.9445
  • Logps/rejected: -198.7271
  • Logps/ref Chosen: -86.9018
  • Logps/ref Rejected: -96.6964
  • Logits/chosen: -1.3052
  • Logits/rejected: -1.4319
  • Kl/p Epsilon Steps: 0.7289
  • Kl/n Epsilon Steps: 0.2689

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/chosen Logps/rejected Logps/ref Chosen Logps/ref Rejected Logits/chosen Logits/rejected Kl/p Epsilon Steps Kl/n Epsilon Steps
1.3249 0.1512 100 0.6600 0.0570 -0.0181 0.6822 0.0751 -86.2224 -96.9165 -86.9018 -96.6964 -0.7700 -0.7965 0.6761 0.3226
1.1555 0.3023 200 0.5709 -0.2620 -0.7050 0.7091 0.4431 -91.2424 -108.5375 -86.9018 -96.6964 -1.2620 -1.3218 0.6923 0.3050
1.1837 0.4535 300 0.5444 -0.9429 -1.5861 0.7377 0.6432 -110.7582 -137.0542 -86.9018 -96.6964 -1.3035 -1.4158 0.7306 0.2689
1.2239 0.6047 400 0.5372 -0.9501 -1.5951 0.7430 0.6451 -124.1300 -159.5326 -86.9018 -96.6964 -1.2494 -1.3749 0.7372 0.2614
1.0454 0.7559 500 0.5415 -0.9468 -1.5190 0.7350 0.5722 -145.2611 -190.7861 -86.9018 -96.6964 -1.2671 -1.3971 0.7293 0.2680
1.1764 0.9070 600 0.5633 -0.6512 -1.0488 0.7328 0.3976 -149.9445 -198.7271 -86.9018 -96.6964 -1.3052 -1.4319 0.7289 0.2689

Framework versions

  • Transformers 4.51.0
  • Pytorch 2.3.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.21.4
Description
Model synced from source: jackf857/qwen3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260424-040415
Readme 2.3 MiB