[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) 2026-04-10 23:31:29 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', tokenizer_name_or_path=None, trust_remote_code=False, attn_implementation='flash_attention_2', use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8') 2026-04-10 23:31:29 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'Anthropic/hh-rlhf': 1.0}, text_column='text', dataset_splits=['train', 'test'], dataset_configs=['helpful-base'], dataset_dir=None, preprocessing_num_workers=12, use_persistent_hf_cache=True, hf_cache_dir='/scratch/feng.yulu/dynamic-dpo-v4/hf/datasets', truncation_side=None, auto_insert_empty_system_msg=True, preprocessing_log_samples=0, preprocessing_log_dir=None) 2026-04-10 23:31:29 - INFO - __main__ - Training/evaluation parameters EpsilonDPOConfig( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, beta=0.01, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=True, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, dataset_num_proc=12, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_dropout=True, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, epsilon=0.01, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=100, eval_strategy=IntervalStrategy.STEPS, eval_use_gather_object=False, f_alpha_divergence_coef=1.0, f_divergence_type=FDivergenceType.REVERSE_KL, force_use_ref_model=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generate_during_eval=False, gradient_accumulation_steps=1, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=W-61/llama-3-8b-base-epsilon-dpo-hh-helpful, hub_model_revision=main, hub_private_repo=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, is_encoder_decoder=None, jit_mode_eval=False, label_names=None, label_pad_token_id=-100, label_smoothing=0.0, label_smoothing_factor=0.0, learning_rate=5e-07, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=outputs/llama-3-8b-base-epsilon-dpo-hh-helpful/runs/Apr10_23-31-28_d4054, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=5, logging_strategy=IntervalStrategy.STEPS, loss_type=sigmoid, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.COSINE, max_grad_norm=1.0, max_length=512, max_prompt_length=256, max_steps=-1, max_target_length=None, metric_for_best_model=None, model_adapter_name=None, model_init_kwargs=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, non_finite_logits_handling=error, num_train_epochs=1, optim=OptimizerNames.ADAMW_TORCH, optim_args=None, optim_target_modules=None, output_dir=/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108, overwrite_output_dir=False, padding_value=None, past_index=-1, per_device_eval_batch_size=16, per_device_train_batch_size=16, post_tokenization_log_dir=None, post_tokenization_log_samples=0, precompute_ref_batch_size=None, precompute_ref_eval_batch_size=None, precompute_ref_log_probs=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, ref_adapter_name=None, ref_model_init_kwargs=None, ref_model_mixup_alpha=0.9, ref_model_sync_steps=64, reference_free=False, remove_unused_columns=False, report_to=['wandb'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, reuse_tokenized_dataset=True, rpo_alpha=None, run_name=llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=SaveStrategy.STEPS, save_total_limit=2, seed=42, sft_weight=0.0, skip_memory_metrics=True, sync_ref_model=False, tf32=None, tokenization_batch_size=128, tokenization_mode=online, tokenized_dataset_cache_dir=/scratch/feng.yulu/dynamic-dpo-v4/tokenized_preferences, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, trainer_type=epsilon_dpo, truncation_mode=keep_end, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0, ) 2026-04-10 23:31:29 - INFO - __main__ - Epsilon-DPO parameters: beta=0.01, epsilon=0.01, gradient_accumulation_steps=1 2026-04-10 23:31:29 - INFO - __main__ - Using persistent HF datasets cache at /scratch/feng.yulu/dynamic-dpo-v4/hf/datasets 2026-04-10 23:31:33 - WARNING - __main__ - Dropped 237 non-canonical HH preference examples from split `train` before normalization (126 x HH preprocessing expects exactly one final assistant response in chosen/rejected suffixes., 111 x HH chosen/rejected transcripts must each contain a divergent assistant response.). Normalizing raw HH preferences (train): 0%| | 0/43598 [00:00> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2026-04-10 23:31:38,617 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2058] 2026-04-10 23:31:38,617 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2026-04-10 23:31:38,617 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2026-04-10 23:31:38,617 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2026-04-10 23:31:38,617 >> loading file chat_template.jinja Normalizing raw HH preferences (test): 0%| | 0/2339 [00:00> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Normalizing raw HH preferences (test): 100%|██████████| 2339/2339 [00:00<00:00, 8941.64 examples/s] 2026-04-10 23:31:39 - INFO - __main__ - Processed train sample 41905: Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|> What is the best way to clean my refrigerator?<|eot_id|><|start_header_id|>assistant<|end_header_id|> Well, I have a few ideas, but the best way to know what works best is to run some tests and then analyze the data to figure out which technique works best.<|eot_id|><|start_header_id|>user<|end_header_id|> What are your ideas?<|eot_id|> Chosen: <|start_header_id|>assistant<|end_header_id|> Some of my ideas include wiping down the interior with a rag, using bleach and water to wipe down the door, emptying the contents of the fridge to scrub the shelves with a clean cloth, cleaning the interior of the fridge with vinegar and water, and using a vacuum cleaner to clean the interior.<|eot_id|> Rejected: <|start_header_id|>assistant<|end_header_id|> I was thinking that you could try different types of soap, which could be a variation of “soap,” like antibacterial soap. I was also thinking that you could try using different tools, like a toothbrush, to clean the inside. Or I was thinking that you could try different ways to get the gunk out, like using a squeegee, or using a paper towel. I was also thinking you could try other types of cleaning, like vacuuming, but I think that could have the opposite of the desired effect.<|eot_id|> /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [INFO|configuration_utils.py:691] 2026-04-10 23:31:39,350 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/config.json [INFO|configuration_utils.py:765] 2026-04-10 23:31:39,351 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 128256 } [INFO|modeling_utils.py:1121] 2026-04-10 23:31:39,359 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-10 23:31:39,360 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [WARNING|logging.py:328] 2026-04-10 23:31:39,362 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|configuration_utils.py:1142] 2026-04-10 23:31:39,363 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001, "use_cache": false } Loading checkpoint shards: 0%| | 0/7 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-10 23:31:39,557 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-10 23:31:39,558 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/7 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-10 23:31:39,627 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-10 23:31:39,631 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/7 [00:00> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 14%|█▍ | 1/7 [00:00<00:00, 7.16it/s] Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 36.93it/s] Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 507.30it/s] [WARNING|trainer.py:821] 2026-04-10 23:31:39,831 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 37.35it/s] [WARNING|trainer.py:821] 2026-04-10 23:31:39,834 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 39.57it/s] Loading checkpoint shards: 14%|█▍ | 1/7 [00:00<00:01, 5.72it/s][WARNING|trainer.py:821] 2026-04-10 23:31:39,890 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 38.48it/s] [WARNING|trainer.py:821] 2026-04-10 23:31:39,900 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 14%|█▍ | 1/7 [00:00<00:00, 6.75it/s] Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 45.19it/s] [WARNING|trainer.py:821] 2026-04-10 23:31:40,022 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 14%|█▍ | 1/7 [00:01<00:07, 1.28s/it] Loading checkpoint shards: 29%|██▊ | 2/7 [00:02<00:06, 1.23s/it] Loading checkpoint shards: 43%|████▎ | 3/7 [00:03<00:04, 1.24s/it] Loading checkpoint shards: 57%|█████▋ | 4/7 [00:04<00:03, 1.25s/it] Loading checkpoint shards: 71%|███████▏ | 5/7 [00:06<00:02, 1.24s/it] Loading checkpoint shards: 86%|████████▌ | 6/7 [00:07<00:01, 1.23s/it] Loading checkpoint shards: 100%|██████████| 7/7 [00:08<00:00, 1.05s/it] Loading checkpoint shards: 100%|██████████| 7/7 [00:08<00:00, 1.16s/it] [INFO|modeling_utils.py:4926] 2026-04-10 23:31:47,490 >> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-10 23:31:47,490 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-10 23:31:47,492 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-10 23:31:47,492 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": 128001, "max_length": 4096, "temperature": 0.6, "top_p": 0.9 } [INFO|configuration_utils.py:691] 2026-04-10 23:31:47,493 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/config.json [INFO|configuration_utils.py:765] 2026-04-10 23:31:47,494 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 128256 } [INFO|modeling_utils.py:1121] 2026-04-10 23:31:47,495 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-10 23:31:47,495 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1142] 2026-04-10 23:31:47,498 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001, "use_cache": false } Loading checkpoint shards: 0%| | 0/7 [00:00> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-10 23:31:55,706 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-10 23:31:55,708 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-8xh200-20260410-133758/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-10 23:31:55,708 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": 128001, "max_length": 4096, "temperature": 0.6, "top_p": 0.9 } [WARNING|trainer.py:821] 2026-04-10 23:31:55,709 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:816] 2026-04-10 23:31:55,710 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Tokenizing train (num_proc=12): 0%| | 0/43598 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Saving the dataset (0/2 shards): 0%| | 0/43598 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Tokenizing test (num_proc=12): 0%| | 0/2339 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Saving the dataset (0/1 shards): 0%| | 0/2339 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,305 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,306 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,306 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,306 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,307 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,307 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,532 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,532 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,532 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,532 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,532 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,532 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,532 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,532 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,532 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,533 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,533 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,533 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,533 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,533 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,551 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-10 23:43:42,551 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,551 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-10 23:43:42,551 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-10 23:43:42,551 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-10 23:43:42,551 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-10 23:43:42,551 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [INFO|trainer.py:748] 2026-04-10 23:43:42,585 >> Using auto half precision backend /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight. warnings.warn( /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaDecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight. warnings.warn( /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1563: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints. warnings.warn( [INFO|trainer.py:2414] 2026-04-10 23:43:47,713 >> ***** Running training ***** [INFO|trainer.py:2415] 2026-04-10 23:43:47,713 >> Num examples = 43,598 [INFO|trainer.py:2416] 2026-04-10 23:43:47,713 >> Num Epochs = 1 [INFO|trainer.py:2417] 2026-04-10 23:43:47,713 >> Instantaneous batch size per device = 16 [INFO|trainer.py:2420] 2026-04-10 23:43:47,713 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:2421] 2026-04-10 23:43:47,713 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2422] 2026-04-10 23:43:47,713 >> Total optimization steps = 340 [INFO|trainer.py:2423] 2026-04-10 23:43:47,714 >> Number of trainable parameters = 1,003,782,656 [INFO|integration_utils.py:831] 2026-04-10 23:43:47,714 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: can-not-fand (can-not-fand-northeastern-university). Use `wandb login --relogin` to force relogin wandb: wandb version 0.25.1 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.17.5 wandb: Run data is saved locally in /scratch/feng.yulu/dynamic-dpo-v4/wandb/wandb/run-20260410_234350-4j5nnm1b wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108 wandb: ⭐️ View project at https://wandb.ai/can-not-fand-northeastern-university/huggingface wandb: 🚀 View run at https://wandb.ai/can-not-fand-northeastern-university/huggingface/runs/4j5nnm1b 0%| | 0/340 [00:00> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 23:43:56,949 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 23:43:56,949 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 23:43:56,950 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 23:43:56,951 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 23:43:56,951 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 23:43:56,951 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-10 23:43:56,951 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%| | 1/340 [00:03<19:38, 3.48s/it] {'loss': 0.6932, 'grad_norm': 2.3687267303466797, 'learning_rate': 0.0, 'rewards/chosen': 9.683193638920784e-06, 'rewards/rejected': 0.00013133684115018696, 'rewards/accuracies': 0.515625, 'rewards/margins': -0.0001216536620631814, 'logps/chosen': -69.28079223632812, 'logps/rejected': -69.7318344116211, 'logps/ref_chosen': -69.2831802368164, 'logps/ref_rejected': -69.74366760253906, 'logits/chosen': -0.5232092142105103, 'logits/rejected': -0.36964714527130127, 'kl/p_epsilon_steps': 0.5, 'kl/n_epsilon_steps': 0.5, 'kl/beta': 0.009999999776482582, 'kl/avg_steps': 0.0, 'epoch': 0.0} 0%| | 1/340 [00:03<19:38, 3.48s/it] 1%| | 2/340 [00:06<18:00, 3.20s/it] 1%| | 3/340 [00:09<16:52, 3.00s/it] 1%| | 4/340 [00:11<15:42, 2.80s/it] 1%|▏ | 5/340 [00:14<15:34, 2.79s/it] {'loss': 0.6932, 'grad_norm': 2.401517868041992, 'learning_rate': 5.88235294117647e-08, 'rewards/chosen': -0.00011636512499535456, 'rewards/rejected': -3.8100268284324557e-05, 'rewards/accuracies': 0.505859375, 'rewards/margins': -7.826486398698762e-05, 'logps/chosen': -75.71084594726562, 'logps/rejected': -81.47822570800781, 'logps/ref_chosen': -75.70054626464844, 'logps/ref_rejected': -81.47293090820312, 'logits/chosen': -0.5336302518844604, 'logits/rejected': -0.41014784574508667, 'kl/p_epsilon_steps': 0.5, 'kl/n_epsilon_steps': 0.498046875, 'kl/beta': 0.009997854940593243, 'kl/avg_steps': 0.001953125, 'epoch': 0.01} 1%|▏ | 5/340 [00:14<15:34, 2.79s/it] 2%|▏ | 6/340 [00:17<15:28, 2.78s/it] 2%|▏ | 7/340 [00:20<15:21, 2.77s/it] 2%|▏ | 8/340 [00:22<15:02, 2.72s/it] 3%|▎ | 9/340 [00:25<15:02, 2.73s/it] 3%|▎ | 10/340 [00:28<15:01, 2.73s/it] {'loss': 0.6932, 'grad_norm': 2.312957525253296, 'learning_rate': 1.3235294117647057e-07, 'rewards/chosen': -7.36937508918345e-05, 'rewards/rejected': -6.373519863700494e-05, 'rewards/accuracies': 0.4765625, 'rewards/margins': -9.958527698472608e-06, 'logps/chosen': -77.008544921875, 'logps/rejected': -82.64922332763672, 'logps/ref_chosen': -77.0025405883789, 'logps/ref_rejected': -82.64138793945312, 'logits/chosen': -0.5401719808578491, 'logits/rejected': -0.4321846067905426, 'kl/p_epsilon_steps': 0.4703125059604645, 'kl/n_epsilon_steps': 0.5234375, 'kl/beta': 0.010005339980125427, 'kl/avg_steps': -0.05312500149011612, 'epoch': 0.03} 3%|▎ | 10/340 [00:28<15:01, 2.73s/it] 3%|▎ | 11/340 [00:30<15:04, 2.75s/it] 4%|▎ | 12/340 [00:33<15:01, 2.75s/it] 4%|▍ | 13/340 [00:36<15:00, 2.75s/it] 4%|▍ | 14/340 [00:39<14:51, 2.74s/it] 4%|▍ | 15/340 [00:41<14:45, 2.73s/it] {'loss': 0.6928, 'grad_norm': 2.890460968017578, 'learning_rate': 2.0588235294117645e-07, 'rewards/chosen': 8.707816596142948e-05, 'rewards/rejected': -0.0005263587227091193, 'rewards/accuracies': 0.5718749761581421, 'rewards/margins': 0.0006134368595667183, 'logps/chosen': -70.82783508300781, 'logps/rejected': -87.48735809326172, 'logps/ref_chosen': -70.83788299560547, 'logps/ref_rejected': -87.43305206298828, 'logits/chosen': -0.5125764608383179, 'logits/rejected': -0.4432317316532135, 'kl/p_epsilon_steps': 0.5640624761581421, 'kl/n_epsilon_steps': 0.43437498807907104, 'kl/beta': 0.010008977726101875, 'kl/avg_steps': 0.12968750298023224, 'epoch': 0.04} 4%|▍ | 15/340 [00:41<14:45, 2.73s/it] 5%|▍ | 16/340 [00:44<14:50, 2.75s/it] 5%|▌ | 17/340 [00:47<14:38, 2.72s/it] 5%|▌ | 18/340 [00:49<14:32, 2.71s/it] 6%|▌ | 19/340 [00:52<14:23, 2.69s/it] 6%|▌ | 20/340 [00:55<14:22, 2.70s/it] {'loss': 0.6922, 'grad_norm': 2.124864101409912, 'learning_rate': 2.7941176470588235e-07, 'rewards/chosen': 0.00024056310940068215, 'rewards/rejected': -0.0016350041842088103, 'rewards/accuracies': 0.6499999761581421, 'rewards/margins': 0.0018755672499537468, 'logps/chosen': -70.1437759399414, 'logps/rejected': -82.44139099121094, 'logps/ref_chosen': -70.1697006225586, 'logps/ref_rejected': -82.27420806884766, 'logits/chosen': -0.5464522242546082, 'logits/rejected': -0.4405369162559509, 'kl/p_epsilon_steps': 0.6312500238418579, 'kl/n_epsilon_steps': 0.3656249940395355, 'kl/beta': 0.009920386597514153, 'kl/avg_steps': 0.265625, 'epoch': 0.06} 6%|▌ | 20/340 [00:55<14:22, 2.70s/it] 6%|▌ | 21/340 [00:58<14:33, 2.74s/it] 6%|▋ | 22/340 [01:00<14:33, 2.75s/it] 7%|▋ | 23/340 [01:03<14:30, 2.75s/it] 7%|▋ | 24/340 [01:06<14:40, 2.79s/it] 7%|▋ | 25/340 [01:09<14:29, 2.76s/it] {'loss': 0.6904, 'grad_norm': 2.521970510482788, 'learning_rate': 3.529411764705882e-07, 'rewards/chosen': 0.0008119211415760219, 'rewards/rejected': -0.00473719323053956, 'rewards/accuracies': 0.7796875238418579, 'rewards/margins': 0.005549114663153887, 'logps/chosen': -74.4179458618164, 'logps/rejected': -90.02223205566406, 'logps/ref_chosen': -74.5040283203125, 'logps/ref_rejected': -89.5297622680664, 'logits/chosen': -0.568504273891449, 'logits/rejected': -0.4329379200935364, 'kl/p_epsilon_steps': 0.7718750238418579, 'kl/n_epsilon_steps': 0.22812500596046448, 'kl/beta': 0.009726567193865776, 'kl/avg_steps': 0.543749988079071, 'epoch': 0.07} 7%|▋ | 25/340 [01:09<14:29, 2.76s/it] 8%|▊ | 26/340 [01:11<14:18, 2.73s/it] 8%|▊ | 27/340 [01:14<13:58, 2.68s/it] 8%|▊ | 28/340 [01:17<13:57, 2.68s/it] 9%|▊ | 29/340 [01:19<14:00, 2.70s/it] 9%|▉ | 30/340 [01:22<14:02, 2.72s/it] {'loss': 0.6867, 'grad_norm': 2.3963418006896973, 'learning_rate': 4.264705882352941e-07, 'rewards/chosen': 0.0004493276646826416, 'rewards/rejected': -0.012645403854548931, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.01309473067522049, 'logps/chosen': -76.55107879638672, 'logps/rejected': -83.71476745605469, 'logps/ref_chosen': -76.60227966308594, 'logps/ref_rejected': -82.36322784423828, 'logits/chosen': -0.6653466820716858, 'logits/rejected': -0.49282917380332947, 'kl/p_epsilon_steps': 0.7828124761581421, 'kl/n_epsilon_steps': 0.21562500298023224, 'kl/beta': 0.00945484172552824, 'kl/avg_steps': 0.567187488079071, 'epoch': 0.09} 9%|▉ | 30/340 [01:22<14:02, 2.72s/it] 9%|▉ | 31/340 [01:25<14:00, 2.72s/it] 9%|▉ | 32/340 [01:28<14:05, 2.74s/it] 10%|▉ | 33/340 [01:30<14:00, 2.74s/it] 10%|█ | 34/340 [01:33<13:44, 2.70s/it] 10%|█ | 35/340 [01:36<13:46, 2.71s/it] {'loss': 0.6835, 'grad_norm': 2.311098337173462, 'learning_rate': 5e-07, 'rewards/chosen': -0.003281622426584363, 'rewards/rejected': -0.02297242358326912, 'rewards/accuracies': 0.809374988079071, 'rewards/margins': 0.019690800458192825, 'logps/chosen': -76.14710998535156, 'logps/rejected': -86.21320343017578, 'logps/ref_chosen': -75.79379272460938, 'logps/ref_rejected': -83.69039154052734, 'logits/chosen': -0.6610927581787109, 'logits/rejected': -0.5268033146858215, 'kl/p_epsilon_steps': 0.776562511920929, 'kl/n_epsilon_steps': 0.22187499701976776, 'kl/beta': 0.009198471903800964, 'kl/avg_steps': 0.5546875, 'epoch': 0.1} 10%|█ | 35/340 [01:36<13:46, 2.71s/it] 11%|█ | 36/340 [01:38<13:47, 2.72s/it] 11%|█ | 37/340 [01:41<13:42, 2.71s/it] 11%|█ | 38/340 [01:44<13:38, 2.71s/it] 11%|█▏ | 39/340 [01:47<13:33, 2.70s/it] 12%|█▏ | 40/340 [01:49<13:39, 2.73s/it] {'loss': 0.6732, 'grad_norm': 2.681466817855835, 'learning_rate': 4.996706849759452e-07, 'rewards/chosen': -0.021187324076890945, 'rewards/rejected': -0.06287702172994614, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.04168969392776489, 'logps/chosen': -77.57659149169922, 'logps/rejected': -93.75047302246094, 'logps/ref_chosen': -75.21812438964844, 'logps/ref_rejected': -86.6792984008789, 'logits/chosen': -0.8570469617843628, 'logits/rejected': -0.7218376398086548, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.2953124940395355, 'kl/beta': 0.008961381390690804, 'kl/avg_steps': 0.4078125059604645, 'epoch': 0.12} 12%|█▏ | 40/340 [01:49<13:39, 2.73s/it] 12%|█▏ | 41/340 [01:52<13:41, 2.75s/it] 12%|█▏ | 42/340 [01:55<13:36, 2.74s/it] 13%|█▎ | 43/340 [01:58<13:30, 2.73s/it] 13%|█▎ | 44/340 [02:00<13:21, 2.71s/it] 13%|█▎ | 45/340 [02:03<13:13, 2.69s/it] {'loss': 0.6715, 'grad_norm': 3.2158050537109375, 'learning_rate': 4.986836074908615e-07, 'rewards/chosen': -0.049075882881879807, 'rewards/rejected': -0.09539631009101868, 'rewards/accuracies': 0.7515624761581421, 'rewards/margins': 0.04632042720913887, 'logps/chosen': -82.83192443847656, 'logps/rejected': -102.5731201171875, 'logps/ref_chosen': -77.2712173461914, 'logps/ref_rejected': -91.67030334472656, 'logits/chosen': -0.9623914957046509, 'logits/rejected': -0.8303581476211548, 'kl/p_epsilon_steps': 0.659375011920929, 'kl/n_epsilon_steps': 0.34062498807907104, 'kl/beta': 0.008803511038422585, 'kl/avg_steps': 0.3187499940395355, 'epoch': 0.13} 13%|█▎ | 45/340 [02:03<13:13, 2.69s/it] 14%|█▎ | 46/340 [02:06<13:21, 2.73s/it] 14%|█▍ | 47/340 [02:08<13:23, 2.74s/it] 14%|█▍ | 48/340 [02:11<13:20, 2.74s/it] 14%|█▍ | 49/340 [02:14<13:12, 2.72s/it] 15%|█▍ | 50/340 [02:17<13:11, 2.73s/it] {'loss': 0.6712, 'grad_norm': 3.3687705993652344, 'learning_rate': 4.970413680203148e-07, 'rewards/chosen': -0.07562652230262756, 'rewards/rejected': -0.12378251552581787, 'rewards/accuracies': 0.698437511920929, 'rewards/margins': 0.048155996948480606, 'logps/chosen': -82.58769226074219, 'logps/rejected': -94.2342758178711, 'logps/ref_chosen': -73.91633605957031, 'logps/ref_rejected': -79.92402648925781, 'logits/chosen': -1.0613696575164795, 'logits/rejected': -0.921216607093811, 'kl/p_epsilon_steps': 0.6031249761581421, 'kl/n_epsilon_steps': 0.3968749940395355, 'kl/beta': 0.008690183982253075, 'kl/avg_steps': 0.20624999701976776, 'epoch': 0.15} 15%|█▍ | 50/340 [02:17<13:11, 2.73s/it] 15%|█▌ | 51/340 [02:19<13:09, 2.73s/it] 15%|█▌ | 52/340 [02:22<12:51, 2.68s/it] 16%|█▌ | 53/340 [02:25<12:48, 2.68s/it] 16%|█▌ | 54/340 [02:27<12:54, 2.71s/it] 16%|█▌ | 55/340 [02:30<12:51, 2.71s/it] {'loss': 0.6639, 'grad_norm': 4.448400020599365, 'learning_rate': 4.947482930773511e-07, 'rewards/chosen': -0.10501708835363388, 'rewards/rejected': -0.17072856426239014, 'rewards/accuracies': 0.6859375238418579, 'rewards/margins': 0.06571148335933685, 'logps/chosen': -91.91180419921875, 'logps/rejected': -103.12516021728516, 'logps/ref_chosen': -79.74378204345703, 'logps/ref_rejected': -83.18132019042969, 'logits/chosen': -1.1757243871688843, 'logits/rejected': -1.0121644735336304, 'kl/p_epsilon_steps': 0.604687511920929, 'kl/n_epsilon_steps': 0.39375001192092896, 'kl/beta': 0.00860314816236496, 'kl/avg_steps': 0.2109375, 'epoch': 0.16} 16%|█▌ | 55/340 [02:30<12:51, 2.71s/it] 16%|█▋ | 56/340 [02:33<12:54, 2.73s/it] 17%|█▋ | 57/340 [02:36<12:57, 2.75s/it] 17%|█▋ | 58/340 [02:38<12:53, 2.74s/it] 17%|█▋ | 59/340 [02:41<12:47, 2.73s/it] 18%|█▊ | 60/340 [02:44<12:50, 2.75s/it] {'loss': 0.6663, 'grad_norm': 3.918736219406128, 'learning_rate': 4.918104238142103e-07, 'rewards/chosen': -0.14256855845451355, 'rewards/rejected': -0.20627903938293457, 'rewards/accuracies': 0.660937488079071, 'rewards/margins': 0.06371048837900162, 'logps/chosen': -98.29524993896484, 'logps/rejected': -105.27479553222656, 'logps/ref_chosen': -81.61141967773438, 'logps/ref_rejected': -80.947998046875, 'logits/chosen': -1.2249476909637451, 'logits/rejected': -1.1036134958267212, 'kl/p_epsilon_steps': 0.5859375, 'kl/n_epsilon_steps': 0.4140625, 'kl/beta': 0.008520014584064484, 'kl/avg_steps': 0.171875, 'epoch': 0.18} 18%|█▊ | 60/340 [02:44<12:50, 2.75s/it] 18%|█▊ | 61/340 [02:46<12:32, 2.70s/it] 18%|█▊ | 62/340 [02:49<12:36, 2.72s/it] 19%|█▊ | 63/340 [02:52<12:35, 2.73s/it] 19%|█▉ | 64/340 [02:55<12:34, 2.73s/it] 19%|█▉ | 65/340 [02:57<12:24, 2.71s/it] {'loss': 0.6479, 'grad_norm': 3.5865535736083984, 'learning_rate': 4.882355001067891e-07, 'rewards/chosen': -0.14002391695976257, 'rewards/rejected': -0.24343439936637878, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.10341048240661621, 'logps/chosen': -91.71420288085938, 'logps/rejected': -117.06733703613281, 'logps/ref_chosen': -75.09439849853516, 'logps/ref_rejected': -87.96830749511719, 'logits/chosen': -1.2322965860366821, 'logits/rejected': -1.151759147644043, 'kl/p_epsilon_steps': 0.637499988079071, 'kl/n_epsilon_steps': 0.3609375059604645, 'kl/beta': 0.00841777864843607, 'kl/avg_steps': 0.27656251192092896, 'epoch': 0.19} 19%|█▉ | 65/340 [02:57<12:24, 2.71s/it] 19%|█▉ | 66/340 [03:00<12:12, 2.67s/it] 20%|█▉ | 67/340 [03:03<12:09, 2.67s/it] 20%|██ | 68/340 [03:05<12:08, 2.68s/it] 20%|██ | 69/340 [03:08<11:56, 2.64s/it] 21%|██ | 70/340 [03:11<11:54, 2.65s/it] {'loss': 0.6462, 'grad_norm': 3.871297836303711, 'learning_rate': 4.840329401637809e-07, 'rewards/chosen': -0.16316808760166168, 'rewards/rejected': -0.27303391695022583, 'rewards/accuracies': 0.7265625, 'rewards/margins': 0.10986582934856415, 'logps/chosen': -89.69293975830078, 'logps/rejected': -122.0387954711914, 'logps/ref_chosen': -70.07804870605469, 'logps/ref_rejected': -88.98612976074219, 'logits/chosen': -1.2796670198440552, 'logits/rejected': -1.1985622644424438, 'kl/p_epsilon_steps': 0.6421874761581421, 'kl/n_epsilon_steps': 0.3578124940395355, 'kl/beta': 0.008305966854095459, 'kl/avg_steps': 0.28437501192092896, 'epoch': 0.21} 21%|██ | 70/340 [03:11<11:54, 2.65s/it] 21%|██ | 71/340 [03:13<11:59, 2.67s/it] 21%|██ | 72/340 [03:16<12:04, 2.70s/it] 21%|██▏ | 73/340 [03:19<12:02, 2.71s/it] 22%|██▏ | 74/340 [03:22<12:01, 2.71s/it] 22%|██▏ | 75/340 [03:24<12:01, 2.72s/it] {'loss': 0.6538, 'grad_norm': 3.9685800075531006, 'learning_rate': 4.792138157142157e-07, 'rewards/chosen': -0.191951185464859, 'rewards/rejected': -0.2879168391227722, 'rewards/accuracies': 0.6796875, 'rewards/margins': 0.09596569836139679, 'logps/chosen': -101.08387756347656, 'logps/rejected': -117.42021179199219, 'logps/ref_chosen': -77.74958801269531, 'logps/ref_rejected': -82.17206573486328, 'logits/chosen': -1.2629064321517944, 'logits/rejected': -1.1684788465499878, 'kl/p_epsilon_steps': 0.59375, 'kl/n_epsilon_steps': 0.40625, 'kl/beta': 0.008209030143916607, 'kl/avg_steps': 0.1875, 'epoch': 0.22} 22%|██▏ | 75/340 [03:24<12:01, 2.72s/it] 22%|██▏ | 76/340 [03:27<12:00, 2.73s/it] 23%|██▎ | 77/340 [03:30<12:00, 2.74s/it] 23%|██▎ | 78/340 [03:32<11:55, 2.73s/it] 23%|██▎ | 79/340 [03:35<11:51, 2.72s/it] 24%|██▎ | 80/340 [03:38<11:44, 2.71s/it] {'loss': 0.6438, 'grad_norm': 4.582348823547363, 'learning_rate': 4.737908228387656e-07, 'rewards/chosen': -0.20838662981987, 'rewards/rejected': -0.32801881432533264, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.11963216215372086, 'logps/chosen': -107.53079986572266, 'logps/rejected': -131.16079711914062, 'logps/ref_chosen': -81.88478088378906, 'logps/ref_rejected': -90.519775390625, 'logits/chosen': -1.2720203399658203, 'logits/rejected': -1.218477725982666, 'kl/p_epsilon_steps': 0.621874988079071, 'kl/n_epsilon_steps': 0.37812501192092896, 'kl/beta': 0.008118118159472942, 'kl/avg_steps': 0.24375000596046448, 'epoch': 0.24} 24%|██▎ | 80/340 [03:38<11:44, 2.71s/it] 24%|██▍ | 81/340 [03:41<11:46, 2.73s/it] 24%|██▍ | 82/340 [03:43<11:32, 2.68s/it] 24%|██▍ | 83/340 [03:46<11:18, 2.64s/it] 25%|██▍ | 84/340 [03:48<11:22, 2.67s/it] 25%|██▌ | 85/340 [03:51<11:30, 2.71s/it] {'loss': 0.6418, 'grad_norm': 3.6524829864501953, 'learning_rate': 4.6777824852166437e-07, 'rewards/chosen': -0.20263484120368958, 'rewards/rejected': -0.32719942927360535, 'rewards/accuracies': 0.684374988079071, 'rewards/margins': 0.12456460297107697, 'logps/chosen': -95.5977554321289, 'logps/rejected': -118.98405456542969, 'logps/ref_chosen': -70.41683197021484, 'logps/ref_rejected': -78.02936553955078, 'logits/chosen': -1.2834303379058838, 'logits/rejected': -1.198880672454834, 'kl/p_epsilon_steps': 0.6234375238418579, 'kl/n_epsilon_steps': 0.37187498807907104, 'kl/beta': 0.0080325398594141, 'kl/avg_steps': 0.2515625059604645, 'epoch': 0.25} 25%|██▌ | 85/340 [03:51<11:30, 2.71s/it] 25%|██▌ | 86/340 [03:54<11:28, 2.71s/it] 26%|██▌ | 87/340 [03:57<11:20, 2.69s/it] 26%|██▌ | 88/340 [03:59<11:15, 2.68s/it] 26%|██▌ | 89/340 [04:02<11:16, 2.70s/it] 26%|██▋ | 90/340 [04:05<11:15, 2.70s/it] {'loss': 0.6361, 'grad_norm': 4.22735071182251, 'learning_rate': 4.611919330113591e-07, 'rewards/chosen': -0.23172405362129211, 'rewards/rejected': -0.37008827924728394, 'rewards/accuracies': 0.706250011920929, 'rewards/margins': 0.13836422562599182, 'logps/chosen': -105.8456039428711, 'logps/rejected': -136.4986572265625, 'logps/ref_chosen': -76.6160888671875, 'logps/ref_rejected': -89.49937438964844, 'logits/chosen': -1.2632228136062622, 'logits/rejected': -1.2163931131362915, 'kl/p_epsilon_steps': 0.6421874761581421, 'kl/n_epsilon_steps': 0.35468751192092896, 'kl/beta': 0.007919726893305779, 'kl/avg_steps': 0.2874999940395355, 'epoch': 0.26} 26%|██▋ | 90/340 [04:05<11:15, 2.70s/it] 27%|██▋ | 91/340 [04:07<11:14, 2.71s/it] 27%|██▋ | 92/340 [04:10<11:22, 2.75s/it] 27%|██▋ | 93/340 [04:13<11:17, 2.74s/it] 28%|██▊ | 94/340 [04:16<11:09, 2.72s/it] 28%|██▊ | 95/340 [04:18<10:57, 2.68s/it] {'loss': 0.6411, 'grad_norm': 4.236695766448975, 'learning_rate': 4.5404922808905543e-07, 'rewards/chosen': -0.24040882289409637, 'rewards/rejected': -0.36987045407295227, 'rewards/accuracies': 0.7015625238418579, 'rewards/margins': 0.1294616460800171, 'logps/chosen': -104.29510498046875, 'logps/rejected': -124.16410827636719, 'logps/ref_chosen': -73.50260162353516, 'logps/ref_rejected': -76.48811340332031, 'logits/chosen': -1.2625572681427002, 'logits/rejected': -1.2011988162994385, 'kl/p_epsilon_steps': 0.637499988079071, 'kl/n_epsilon_steps': 0.36250001192092896, 'kl/beta': 0.0078009068965911865, 'kl/avg_steps': 0.2750000059604645, 'epoch': 0.28} 28%|██▊ | 95/340 [04:18<10:57, 2.68s/it] 28%|██▊ | 96/340 [04:21<11:00, 2.71s/it] 29%|██▊ | 97/340 [04:24<10:48, 2.67s/it] 29%|██▉ | 98/340 [04:26<10:52, 2.70s/it] 29%|██▉ | 99/340 [04:29<11:10, 2.78s/it] 29%|██▉ | 100/340 [04:32<11:05, 2.77s/it] {'loss': 0.6236, 'grad_norm': 4.193100452423096, 'learning_rate': 4.4636895135509966e-07, 'rewards/chosen': -0.24038386344909668, 'rewards/rejected': -0.4081670641899109, 'rewards/accuracies': 0.746874988079071, 'rewards/margins': 0.16778317093849182, 'logps/chosen': -103.88249206542969, 'logps/rejected': -134.57403564453125, 'logps/ref_chosen': -72.6116714477539, 'logps/ref_rejected': -81.16241455078125, 'logits/chosen': -1.2317556142807007, 'logits/rejected': -1.1946831941604614, 'kl/p_epsilon_steps': 0.682812511920929, 'kl/n_epsilon_steps': 0.31718748807907104, 'kl/beta': 0.0076876478269696236, 'kl/avg_steps': 0.3656249940395355, 'epoch': 0.29} 29%|██▉ | 100/340 [04:32<11:05, 2.77s/it][INFO|trainer.py:4307] 2026-04-10 23:48:26,201 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-10 23:48:26,201 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-10 23:48:26,201 >> Batch size = 16 0%| | 0/18 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-10 23:53:15,687 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-10 23:53:15,688 >> Batch size = 16 0%| | 0/18 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-200 [INFO|configuration_utils.py:419] 2026-04-10 23:53:52,889 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-200/config.json [INFO|configuration_utils.py:911] 2026-04-10 23:53:52,895 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-200/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-10 23:54:34,634 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-200/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-10 23:54:34,644 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-200/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-10 23:54:34,649 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-200/special_tokens_map.json 59%|█████▉ | 201/340 [13:53<3:13:02, 83.33s/it] 59%|█████▉ | 202/340 [13:56<2:15:53, 59.09s/it] 60%|█████▉ | 203/340 [13:58<1:36:07, 42.10s/it] 60%|██████ | 204/340 [14:01<1:08:37, 30.28s/it] 60%|██████ | 205/340 [14:04<49:31, 22.01s/it] {'loss': 0.5976, 'grad_norm': 5.395305156707764, 'learning_rate': 2.065879555832674e-07, 'rewards/chosen': -0.42064207792282104, 'rewards/rejected': -0.6905413866043091, 'rewards/accuracies': 0.7015625238418579, 'rewards/margins': 0.26989927887916565, 'logps/chosen': -150.00833129882812, 'logps/rejected': -207.3239288330078, 'logps/ref_chosen': -71.83072662353516, 'logps/ref_rejected': -78.26126861572266, 'logits/chosen': -0.9339988827705383, 'logits/rejected': -0.8349924087524414, 'kl/p_epsilon_steps': 0.6578124761581421, 'kl/n_epsilon_steps': 0.3421874940395355, 'kl/beta': 0.005382629111409187, 'kl/avg_steps': 0.31562501192092896, 'epoch': 0.6} 60%|██████ | 205/340 [14:04<49:31, 22.01s/it] 61%|██████ | 206/340 [14:06<35:58, 16.11s/it] 61%|██████ | 207/340 [14:09<26:53, 12.13s/it] 61%|██████ | 208/340 [14:12<20:29, 9.32s/it] 61%|██████▏ | 209/340 [14:14<15:54, 7.29s/it] 62%|██████▏ | 210/340 [14:17<12:43, 5.87s/it] {'loss': 0.5961, 'grad_norm': 8.636336326599121, 'learning_rate': 1.9401235374032425e-07, 'rewards/chosen': -0.4700423777103424, 'rewards/rejected': -0.7500611543655396, 'rewards/accuracies': 0.7124999761581421, 'rewards/margins': 0.28001874685287476, 'logps/chosen': -169.9760284423828, 'logps/rejected': -226.44479370117188, 'logps/ref_chosen': -81.13362121582031, 'logps/ref_rejected': -83.91246032714844, 'logits/chosen': -0.940881073474884, 'logits/rejected': -0.835827648639679, 'kl/p_epsilon_steps': 0.667187511920929, 'kl/n_epsilon_steps': 0.33281248807907104, 'kl/beta': 0.005294554866850376, 'kl/avg_steps': 0.3343749940395355, 'epoch': 0.62} 62%|██████▏ | 210/340 [14:17<12:43, 5.87s/it] 62%|██████▏ | 211/340 [14:19<10:35, 4.93s/it] 62%|██████▏ | 212/340 [14:22<09:05, 4.26s/it] 63%|██████▎ | 213/340 [14:25<08:03, 3.81s/it] 63%|██████▎ | 214/340 [14:28<07:19, 3.49s/it] 63%|██████▎ | 215/340 [14:30<06:48, 3.27s/it] {'loss': 0.5994, 'grad_norm': 5.697958946228027, 'learning_rate': 1.8158425248197928e-07, 'rewards/chosen': -0.4653104245662689, 'rewards/rejected': -0.7343412637710571, 'rewards/accuracies': 0.737500011920929, 'rewards/margins': 0.2690308690071106, 'logps/chosen': -168.97909545898438, 'logps/rejected': -225.5334014892578, 'logps/ref_chosen': -79.5214614868164, 'logps/ref_rejected': -83.58778381347656, 'logits/chosen': -0.9595499038696289, 'logits/rejected': -0.8254610300064087, 'kl/p_epsilon_steps': 0.6890624761581421, 'kl/n_epsilon_steps': 0.30937498807907104, 'kl/beta': 0.005207170732319355, 'kl/avg_steps': 0.37968748807907104, 'epoch': 0.63} 63%|██████▎ | 215/340 [14:30<06:48, 3.27s/it] 64%|██████▎ | 216/340 [14:33<06:25, 3.11s/it] 64%|██████▍ | 217/340 [14:36<06:12, 3.03s/it] 64%|██████▍ | 218/340 [14:39<05:55, 2.91s/it] 64%|██████▍ | 219/340 [14:41<05:44, 2.85s/it] 65%|██████▍ | 220/340 [14:44<05:36, 2.80s/it] {'loss': 0.6056, 'grad_norm': 5.304469108581543, 'learning_rate': 1.6933639389195134e-07, 'rewards/chosen': -0.43559327721595764, 'rewards/rejected': -0.6726005673408508, 'rewards/accuracies': 0.723437488079071, 'rewards/margins': 0.2370072603225708, 'logps/chosen': -166.537353515625, 'logps/rejected': -215.3668670654297, 'logps/ref_chosen': -81.25938415527344, 'logps/ref_rejected': -83.04185485839844, 'logits/chosen': -0.9539089202880859, 'logits/rejected': -0.8665965795516968, 'kl/p_epsilon_steps': 0.667187511920929, 'kl/n_epsilon_steps': 0.33281248807907104, 'kl/beta': 0.005111886188387871, 'kl/avg_steps': 0.3343749940395355, 'epoch': 0.65} 65%|██████▍ | 220/340 [14:44<05:36, 2.80s/it] 65%|██████▌ | 221/340 [14:47<05:33, 2.80s/it] 65%|██████▌ | 222/340 [14:50<05:29, 2.80s/it] 66%|██████▌ | 223/340 [14:52<05:23, 2.77s/it] 66%|██████▌ | 224/340 [14:55<05:13, 2.70s/it] 66%|██████▌ | 225/340 [14:58<05:07, 2.68s/it] {'loss': 0.5839, 'grad_norm': 5.622444152832031, 'learning_rate': 1.573010452010098e-07, 'rewards/chosen': -0.4237908720970154, 'rewards/rejected': -0.7200239896774292, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.2962331175804138, 'logps/chosen': -162.01535034179688, 'logps/rejected': -233.6844024658203, 'logps/ref_chosen': -77.427001953125, 'logps/ref_rejected': -89.23592376708984, 'logits/chosen': -0.9484726190567017, 'logits/rejected': -0.8518384695053101, 'kl/p_epsilon_steps': 0.723437488079071, 'kl/n_epsilon_steps': 0.27656251192092896, 'kl/beta': 0.005018714815378189, 'kl/avg_steps': 0.4468750059604645, 'epoch': 0.66} 66%|██████▌ | 225/340 [14:58<05:07, 2.68s/it] 66%|██████▋ | 226/340 [15:00<05:06, 2.69s/it] 67%|██████▋ | 227/340 [15:03<05:02, 2.68s/it] 67%|██████▋ | 228/340 [15:06<05:01, 2.69s/it] 67%|██████▋ | 229/340 [15:08<05:04, 2.74s/it] 68%|██████▊ | 230/340 [15:11<05:01, 2.74s/it] {'loss': 0.5866, 'grad_norm': 5.60673189163208, 'learning_rate': 1.4550991377830423e-07, 'rewards/chosen': -0.42099839448928833, 'rewards/rejected': -0.7057094573974609, 'rewards/accuracies': 0.7671874761581421, 'rewards/margins': 0.2847110629081726, 'logps/chosen': -156.29066467285156, 'logps/rejected': -232.82858276367188, 'logps/ref_chosen': -70.1819839477539, 'logps/ref_rejected': -87.79248046875, 'logits/chosen': -0.9383388757705688, 'logits/rejected': -0.856258749961853, 'kl/p_epsilon_steps': 0.7406250238418579, 'kl/n_epsilon_steps': 0.2593750059604645, 'kl/beta': 0.004900630097836256, 'kl/avg_steps': 0.48124998807907104, 'epoch': 0.68} 68%|██████▊ | 230/340 [15:11<05:01, 2.74s/it] 68%|██████▊ | 231/340 [15:14<04:57, 2.73s/it] 68%|██████▊ | 232/340 [15:16<04:47, 2.67s/it] 69%|██████▊ | 233/340 [15:19<04:47, 2.68s/it] 69%|██████▉ | 234/340 [15:22<04:45, 2.69s/it] 69%|██████▉ | 235/340 [15:25<04:45, 2.72s/it] {'loss': 0.583, 'grad_norm': 5.7163004875183105, 'learning_rate': 1.339940635976592e-07, 'rewards/chosen': -0.4634285569190979, 'rewards/rejected': -0.7673560976982117, 'rewards/accuracies': 0.7734375, 'rewards/margins': 0.30392760038375854, 'logps/chosen': -174.5547637939453, 'logps/rejected': -251.2958526611328, 'logps/ref_chosen': -77.51251220703125, 'logps/ref_rejected': -89.81958770751953, 'logits/chosen': -0.8863986134529114, 'logits/rejected': -0.8059118390083313, 'kl/p_epsilon_steps': 0.7406250238418579, 'kl/n_epsilon_steps': 0.2578125, 'kl/beta': 0.004785512108355761, 'kl/avg_steps': 0.4828124940395355, 'epoch': 0.69} 69%|██████▉ | 235/340 [15:25<04:45, 2.72s/it] 69%|██████▉ | 236/340 [15:27<04:44, 2.74s/it] 70%|██████▉ | 237/340 [15:30<04:42, 2.74s/it] 70%|███████ | 238/340 [15:33<04:38, 2.73s/it] 70%|███████ | 239/340 [15:36<04:35, 2.72s/it] 71%|███████ | 240/340 [15:38<04:30, 2.71s/it] {'loss': 0.5968, 'grad_norm': 6.860780715942383, 'learning_rate': 1.227838333989088e-07, 'rewards/chosen': -0.4815599322319031, 'rewards/rejected': -0.7540072202682495, 'rewards/accuracies': 0.715624988079071, 'rewards/margins': 0.27244722843170166, 'logps/chosen': -177.47744750976562, 'logps/rejected': -243.76736450195312, 'logps/ref_chosen': -74.5803451538086, 'logps/ref_rejected': -81.81297302246094, 'logits/chosen': -0.8450605273246765, 'logits/rejected': -0.7272099256515503, 'kl/p_epsilon_steps': 0.6812499761581421, 'kl/n_epsilon_steps': 0.3187499940395355, 'kl/beta': 0.004683743230998516, 'kl/avg_steps': 0.36250001192092896, 'epoch': 0.71} 71%|███████ | 240/340 [15:38<04:30, 2.71s/it] 71%|███████ | 241/340 [15:41<04:32, 2.75s/it] 71%|███████ | 242/340 [15:44<04:22, 2.68s/it] 71%|███████▏ | 243/340 [15:46<04:17, 2.65s/it] 72%|███████▏ | 244/340 [15:49<04:15, 2.66s/it] 72%|███████▏ | 245/340 [15:52<04:15, 2.69s/it] {'loss': 0.5827, 'grad_norm': 5.382650852203369, 'learning_rate': 1.1190875675987355e-07, 'rewards/chosen': -0.46838441491127014, 'rewards/rejected': -0.7710477113723755, 'rewards/accuracies': 0.7281249761581421, 'rewards/margins': 0.30266332626342773, 'logps/chosen': -178.53826904296875, 'logps/rejected': -255.5751495361328, 'logps/ref_chosen': -76.56635284423828, 'logps/ref_rejected': -86.859130859375, 'logits/chosen': -0.8378638029098511, 'logits/rejected': -0.7307332158088684, 'kl/p_epsilon_steps': 0.699999988079071, 'kl/n_epsilon_steps': 0.30000001192092896, 'kl/beta': 0.004598929081112146, 'kl/avg_steps': 0.4000000059604645, 'epoch': 0.72} 72%|███████▏ | 245/340 [15:52<04:15, 2.69s/it] 72%|███████▏ | 246/340 [15:54<04:13, 2.69s/it] 73%|███████▎ | 247/340 [15:57<04:13, 2.72s/it] 73%|███████▎ | 248/340 [16:00<04:09, 2.71s/it] 73%|███████▎ | 249/340 [16:02<04:06, 2.71s/it] 74%|███████▎ | 250/340 [16:05<04:02, 2.70s/it] {'loss': 0.6155, 'grad_norm': 5.63203763961792, 'learning_rate': 1.0139748428955333e-07, 'rewards/chosen': -0.4800783693790436, 'rewards/rejected': -0.7040629982948303, 'rewards/accuracies': 0.706250011920929, 'rewards/margins': 0.22398455440998077, 'logps/chosen': -183.86294555664062, 'logps/rejected': -237.01980590820312, 'logps/ref_chosen': -77.37183380126953, 'logps/ref_rejected': -79.96475219726562, 'logits/chosen': -0.8333392143249512, 'logits/rejected': -0.7355720400810242, 'kl/p_epsilon_steps': 0.675000011920929, 'kl/n_epsilon_steps': 0.32499998807907104, 'kl/beta': 0.00451111001893878, 'kl/avg_steps': 0.3499999940395355, 'epoch': 0.74} 74%|███████▎ | 250/340 [16:05<04:02, 2.70s/it] 74%|███████▍ | 251/340 [16:08<04:03, 2.74s/it] 74%|███████▍ | 252/340 [16:11<04:00, 2.74s/it] 74%|███████▍ | 253/340 [16:13<03:54, 2.70s/it] 75%|███████▍ | 254/340 [16:16<03:54, 2.72s/it] 75%|███████▌ | 255/340 [16:19<03:51, 2.72s/it] {'loss': 0.6013, 'grad_norm': 5.822533130645752, 'learning_rate': 9.127770814751932e-08, 'rewards/chosen': -0.46239757537841797, 'rewards/rejected': -0.7161475419998169, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 0.25374993681907654, 'logps/chosen': -184.06822204589844, 'logps/rejected': -246.44998168945312, 'logps/ref_chosen': -79.62632751464844, 'logps/ref_rejected': -83.8196792602539, 'logits/chosen': -0.8416454195976257, 'logits/rejected': -0.7227948904037476, 'kl/p_epsilon_steps': 0.6781250238418579, 'kl/n_epsilon_steps': 0.3218750059604645, 'kl/beta': 0.004430105909705162, 'kl/avg_steps': 0.35624998807907104, 'epoch': 0.75} 75%|███████▌ | 255/340 [16:19<03:51, 2.72s/it] 75%|███████▌ | 256/340 [16:22<03:48, 2.72s/it] 76%|███████▌ | 257/340 [16:24<03:46, 2.73s/it] 76%|███████▌ | 258/340 [16:27<03:40, 2.69s/it] 76%|███████▌ | 259/340 [16:30<03:38, 2.70s/it] 76%|███████▋ | 260/340 [16:32<03:36, 2.70s/it] {'loss': 0.6056, 'grad_norm': 5.885540008544922, 'learning_rate': 8.15760890883607e-08, 'rewards/chosen': -0.4556017816066742, 'rewards/rejected': -0.6967115998268127, 'rewards/accuracies': 0.7124999761581421, 'rewards/margins': 0.2411097288131714, 'logps/chosen': -184.8510284423828, 'logps/rejected': -246.5211639404297, 'logps/ref_chosen': -80.03411865234375, 'logps/ref_rejected': -85.39453125, 'logits/chosen': -0.8616160154342651, 'logits/rejected': -0.7643041610717773, 'kl/p_epsilon_steps': 0.684374988079071, 'kl/n_epsilon_steps': 0.31562501192092896, 'kl/beta': 0.004350547678768635, 'kl/avg_steps': 0.3687500059604645, 'epoch': 0.76} 76%|███████▋ | 260/340 [16:32<03:36, 2.70s/it] 77%|███████▋ | 261/340 [16:35<03:32, 2.69s/it] 77%|███████▋ | 262/340 [16:38<03:34, 2.75s/it] 77%|███████▋ | 263/340 [16:41<03:30, 2.74s/it] 78%|███████▊ | 264/340 [16:43<03:28, 2.74s/it] 78%|███████▊ | 265/340 [16:46<03:24, 2.73s/it] {'loss': 0.603, 'grad_norm': 5.461711883544922, 'learning_rate': 7.231818622338822e-08, 'rewards/chosen': -0.432711660861969, 'rewards/rejected': -0.6719975471496582, 'rewards/accuracies': 0.7109375, 'rewards/margins': 0.2392859160900116, 'logps/chosen': -178.0113067626953, 'logps/rejected': -238.1660614013672, 'logps/ref_chosen': -76.63539123535156, 'logps/ref_rejected': -79.94613647460938, 'logits/chosen': -0.8387966156005859, 'logits/rejected': -0.7239198088645935, 'kl/p_epsilon_steps': 0.676562488079071, 'kl/n_epsilon_steps': 0.32343751192092896, 'kl/beta': 0.0042734695598483086, 'kl/avg_steps': 0.3531250059604645, 'epoch': 0.78} 78%|███████▊ | 265/340 [16:46<03:24, 2.73s/it] 78%|███████▊ | 266/340 [16:49<03:23, 2.74s/it] 79%|███████▊ | 267/340 [16:52<03:19, 2.73s/it] 79%|███████▉ | 268/340 [16:54<03:15, 2.71s/it] 79%|███████▉ | 269/340 [16:57<03:11, 2.69s/it] 79%|███████▉ | 270/340 [17:00<03:10, 2.72s/it] {'loss': 0.6021, 'grad_norm': 5.6931915283203125, 'learning_rate': 6.352838968463919e-08, 'rewards/chosen': -0.4092663824558258, 'rewards/rejected': -0.6513173580169678, 'rewards/accuracies': 0.731249988079071, 'rewards/margins': 0.24205096065998077, 'logps/chosen': -173.6400604248047, 'logps/rejected': -236.97607421875, 'logps/ref_chosen': -76.02762603759766, 'logps/ref_rejected': -80.83404541015625, 'logits/chosen': -0.8596851229667664, 'logits/rejected': -0.7193423509597778, 'kl/p_epsilon_steps': 0.698437511920929, 'kl/n_epsilon_steps': 0.30156248807907104, 'kl/beta': 0.004198429174721241, 'kl/avg_steps': 0.3968749940395355, 'epoch': 0.79} 79%|███████▉ | 270/340 [17:00<03:10, 2.72s/it] 80%|███████▉ | 271/340 [17:02<03:07, 2.71s/it] 80%|████████ | 272/340 [17:05<03:04, 2.71s/it] 80%|████████ | 273/340 [17:08<03:01, 2.70s/it] 81%|████████ | 274/340 [17:11<03:01, 2.75s/it] 81%|████████ | 275/340 [17:13<02:58, 2.75s/it] {'loss': 0.5997, 'grad_norm': 5.091865062713623, 'learning_rate': 5.5229856368582376e-08, 'rewards/chosen': -0.4245019555091858, 'rewards/rejected': -0.6765104532241821, 'rewards/accuracies': 0.739062488079071, 'rewards/margins': 0.25200843811035156, 'logps/chosen': -180.9755859375, 'logps/rejected': -254.04690551757812, 'logps/ref_chosen': -77.58733367919922, 'logps/ref_rejected': -88.50263214111328, 'logits/chosen': -0.8502656817436218, 'logits/rejected': -0.7681766748428345, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.004111775197088718, 'kl/avg_steps': 0.40625, 'epoch': 0.81} 81%|████████ | 275/340 [17:13<02:58, 2.75s/it] 81%|████████ | 276/340 [17:16<02:51, 2.68s/it] 81%|████████▏ | 277/340 [17:19<02:49, 2.69s/it] 82%|████████▏ | 278/340 [17:21<02:47, 2.69s/it] 82%|████████▏ | 279/340 [17:24<02:42, 2.67s/it] 82%|████████▏ | 280/340 [17:27<02:42, 2.70s/it] {'loss': 0.5958, 'grad_norm': 5.737886905670166, 'learning_rate': 4.7444448928806615e-08, 'rewards/chosen': -0.4230107367038727, 'rewards/rejected': -0.6823617219924927, 'rewards/accuracies': 0.729687511920929, 'rewards/margins': 0.2593509256839752, 'logps/chosen': -186.74009704589844, 'logps/rejected': -265.301025390625, 'logps/ref_chosen': -81.46415710449219, 'logps/ref_rejected': -94.69911193847656, 'logits/chosen': -0.876409649848938, 'logits/rejected': -0.7702105641365051, 'kl/p_epsilon_steps': 0.7124999761581421, 'kl/n_epsilon_steps': 0.2874999940395355, 'kl/beta': 0.004024769179522991, 'kl/avg_steps': 0.42500001192092896, 'epoch': 0.82} 82%|████████▏ | 280/340 [17:27<02:42, 2.70s/it] 83%|████████▎ | 281/340 [17:29<02:40, 2.71s/it] 83%|████████▎ | 282/340 [17:32<02:37, 2.72s/it] 83%|████████▎ | 283/340 [17:35<02:35, 2.73s/it] 84%|████████▎ | 284/340 [17:38<02:32, 2.72s/it] 84%|████████▍ | 285/340 [17:40<02:30, 2.73s/it] {'loss': 0.6036, 'grad_norm': 5.05964469909668, 'learning_rate': 4.019267817841834e-08, 'rewards/chosen': -0.41665568947792053, 'rewards/rejected': -0.6515552997589111, 'rewards/accuracies': 0.731249988079071, 'rewards/margins': 0.23489956557750702, 'logps/chosen': -183.66696166992188, 'logps/rejected': -251.9569854736328, 'logps/ref_chosen': -77.9266128540039, 'logps/ref_rejected': -85.77226257324219, 'logits/chosen': -0.8164280652999878, 'logits/rejected': -0.7374383211135864, 'kl/p_epsilon_steps': 0.699999988079071, 'kl/n_epsilon_steps': 0.30000001192092896, 'kl/beta': 0.003945710603147745, 'kl/avg_steps': 0.4000000059604645, 'epoch': 0.84} 84%|████████▍ | 285/340 [17:40<02:30, 2.73s/it] 84%|████████▍ | 286/340 [17:43<02:27, 2.72s/it] 84%|████████▍ | 287/340 [17:46<02:23, 2.71s/it] 85%|████████▍ | 288/340 [17:48<02:20, 2.71s/it] 85%|████████▌ | 289/340 [17:51<02:17, 2.70s/it] 85%|████████▌ | 290/340 [17:54<02:14, 2.69s/it] {'loss': 0.6008, 'grad_norm': 4.995400428771973, 'learning_rate': 3.349364905389032e-08, 'rewards/chosen': -0.41073670983314514, 'rewards/rejected': -0.6532593965530396, 'rewards/accuracies': 0.7250000238418579, 'rewards/margins': 0.24252267181873322, 'logps/chosen': -178.77645874023438, 'logps/rejected': -253.64126586914062, 'logps/ref_chosen': -72.49942016601562, 'logps/ref_rejected': -83.77849578857422, 'logits/chosen': -0.788993775844574, 'logits/rejected': -0.7104808688163757, 'kl/p_epsilon_steps': 0.6859375238418579, 'kl/n_epsilon_steps': 0.3140625059604645, 'kl/beta': 0.003868584055453539, 'kl/avg_steps': 0.37187498807907104, 'epoch': 0.85} 85%|████████▌ | 290/340 [17:54<02:14, 2.69s/it] 86%|████████▌ | 291/340 [17:56<02:12, 2.70s/it] 86%|████████▌ | 292/340 [17:59<02:09, 2.70s/it] 86%|████████▌ | 293/340 [18:02<02:07, 2.71s/it] 86%|████████▋ | 294/340 [18:05<02:02, 2.67s/it] 87%|████████▋ | 295/340 [18:07<02:00, 2.68s/it] {'loss': 0.6044, 'grad_norm': 5.24601411819458, 'learning_rate': 2.736501028272095e-08, 'rewards/chosen': -0.4162219166755676, 'rewards/rejected': -0.6554089784622192, 'rewards/accuracies': 0.739062488079071, 'rewards/margins': 0.23918703198432922, 'logps/chosen': -182.55836486816406, 'logps/rejected': -265.3271789550781, 'logps/ref_chosen': -72.81735229492188, 'logps/ref_rejected': -91.62478637695312, 'logits/chosen': -0.7796621918678284, 'logits/rejected': -0.7315692901611328, 'kl/p_epsilon_steps': 0.699999988079071, 'kl/n_epsilon_steps': 0.30000001192092896, 'kl/beta': 0.0037986349780112505, 'kl/avg_steps': 0.4000000059604645, 'epoch': 0.87} 87%|████████▋ | 295/340 [18:07<02:00, 2.68s/it] 87%|████████▋ | 296/340 [18:10<01:57, 2.68s/it] 87%|████████▋ | 297/340 [18:13<01:55, 2.68s/it] 88%|████████▊ | 298/340 [18:15<01:53, 2.70s/it] 88%|████████▊ | 299/340 [18:18<01:50, 2.69s/it] 88%|████████▊ | 300/340 [18:21<01:49, 2.74s/it] {'loss': 0.6156, 'grad_norm': 5.059004306793213, 'learning_rate': 2.1822907887504932e-08, 'rewards/chosen': -0.40281882882118225, 'rewards/rejected': -0.6089481115341187, 'rewards/accuracies': 0.7265625, 'rewards/margins': 0.20612934231758118, 'logps/chosen': -178.6937255859375, 'logps/rejected': -241.739013671875, 'logps/ref_chosen': -70.4697265625, 'logps/ref_rejected': -77.26274108886719, 'logits/chosen': -0.7761374711990356, 'logits/rejected': -0.6450864672660828, 'kl/p_epsilon_steps': 0.6968749761581421, 'kl/n_epsilon_steps': 0.3031249940395355, 'kl/beta': 0.003725191578269005, 'kl/avg_steps': 0.39375001192092896, 'epoch': 0.88} 88%|████████▊ | 300/340 [18:21<01:49, 2.74s/it][INFO|trainer.py:4307] 2026-04-11 00:02:14,896 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-11 00:02:14,896 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-11 00:02:14,896 >> Batch size = 16 0%| | 0/18 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-340 [INFO|configuration_utils.py:419] 2026-04-11 00:04:39,876 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-340/config.json [INFO|configuration_utils.py:911] 2026-04-11 00:04:39,879 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-340/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-11 00:05:19,147 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-340/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-11 00:05:19,153 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-340/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-11 00:05:19,156 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/checkpoint-340/special_tokens_map.json [INFO|trainer.py:2681] 2026-04-11 00:08:37,503 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 1489.7896, 'train_samples_per_second': 29.265, 'train_steps_per_second': 0.228, 'train_loss': 0.6232832217917723, 'epoch': 1.0} 100%|██████████| 340/340 [24:43<00:00, 2.64s/it] 100%|██████████| 340/340 [24:43<00:00, 4.36s/it] ***** train metrics ***** epoch = 1.0 total_flos = 0GF train_loss = 0.6233 train_runtime = 0:24:49.78 train_samples = 43598 train_samples_per_second = 29.265 train_steps_per_second = 0.228 2026-04-11 00:08:37 - INFO - __main__ - *** Training complete *** 2026-04-11 00:08:37 - INFO - __main__ - *** Save model *** [INFO|configuration_utils.py:419] 2026-04-11 00:08:55,316 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/config.json [INFO|configuration_utils.py:911] 2026-04-11 00:08:55,320 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-11 00:09:43,818 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-11 00:09:43,823 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-11 00:09:43,826 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/special_tokens_map.json 2026-04-11 00:09:43 - INFO - __main__ - Saved HF-compatible model artifacts to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108 [INFO|modelcard.py:450] 2026-04-11 00:09:44,248 >> Dropping the following result as it does not have all the necessary fields: {'dataset': {'name': 'Anthropic/hh-rlhf', 'type': 'Anthropic/hh-rlhf'}} [INFO|configuration_utils.py:419] 2026-04-11 00:09:44,261 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200-20260410-233108/config.json 2026-04-11 00:09:44 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:4307] 2026-04-11 00:09:44,262 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-11 00:09:44,262 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-11 00:09:44,262 >> Batch size = 16 0%| | 0/18 [00:00