2026-04-17 21:24:20 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', tokenizer_name_or_path=None, trust_remote_code=False, attn_implementation='flash_attention_2', use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8') 2026-04-17 21:24:20 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'Anthropic/hh-rlhf': 1.0}, text_column='text', dataset_splits=['train', 'test'], dataset_configs=['helpful-base'], dataset_dir=None, preprocessing_num_workers=12, use_persistent_hf_cache=True, hf_cache_dir='/scratch/feng.yulu/dynamic-dpo-v4/hf/datasets', truncation_side=None, auto_insert_empty_system_msg=True, preprocessing_log_samples=0, preprocessing_log_dir=None) 2026-04-17 21:24:20 - INFO - __main__ - Training/evaluation parameters MarginDPOConfig( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, beta=0.1, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=True, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, dataset_num_proc=12, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_dropout=True, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=100, eval_strategy=IntervalStrategy.STEPS, eval_use_gather_object=False, f_alpha_divergence_coef=1.0, f_divergence_type=reverse_kl, force_use_ref_model=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generate_during_eval=False, gradient_accumulation_steps=2, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_margin_dataset_id=W-61/llama-3-8b-base-margin-dpo-hh-helpful-margin-log, hub_model_id=W-61/llama-3-8b-base-margin-dpo-hh-helpful, hub_model_revision=main, hub_private_repo=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, is_encoder_decoder=None, jit_mode_eval=False, label_names=None, label_pad_token_id=-100, label_smoothing=0.0, label_smoothing_factor=0.0, learning_rate=5e-07, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=outputs/llama-3-8b-base-margin-dpo-hh-helpful/runs/Apr17_21-24-20_d4052, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=1, logging_strategy=IntervalStrategy.STEPS, loss_type=sigmoid, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.COSINE, margin_dataset_private=None, margin_dataset_split=train, margin_log_path=/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/margin_logs, margin_log_steps=1, margin_save_full=True, max_grad_norm=1.0, max_length=512, max_prompt_length=256, max_steps=-1, max_target_length=None, metric_for_best_model=None, model_adapter_name=None, model_init_kwargs=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, non_finite_logits_handling=error, num_train_epochs=1, optim=OptimizerNames.ADAMW_TORCH, optim_args=None, optim_target_modules=None, output_dir=/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312, overwrite_output_dir=False, padding_value=None, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, post_tokenization_log_dir=None, post_tokenization_log_samples=0, precompute_ref_batch_size=None, precompute_ref_eval_batch_size=None, precompute_ref_log_probs=False, prediction_loss_only=False, push_margin_dataset=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, ref_adapter_name=None, ref_model_init_kwargs=None, ref_model_mixup_alpha=0.9, ref_model_sync_steps=64, reference_free=False, remove_unused_columns=False, report_to=['wandb'], require_explicit_ref_model=True, restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, reuse_tokenized_dataset=True, rpo_alpha=None, run_name=llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=SaveStrategy.STEPS, save_total_limit=2, seed=42, sft_weight=0.0, skip_memory_metrics=True, sync_ref_model=False, tf32=None, tokenization_batch_size=128, tokenization_mode=online, tokenized_dataset_cache_dir=/scratch/feng.yulu/dynamic-dpo-v4/tokenized_preferences, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, trainer_type=margin_dpo, truncation_mode=keep_end, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, wandb_project=ood-run-4xh200, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0, ) 2026-04-17 21:24:20 - INFO - __main__ - W&B project: ood-run-4xh200 2026-04-17 21:24:20 - INFO - __main__ - Margin-DPO parameters: beta=0.1, f_divergence_type=reverse_kl, margin_log_steps=1 2026-04-17 21:24:20 - INFO - __main__ - Using persistent HF datasets cache at /scratch/feng.yulu/dynamic-dpo-v4/hf/datasets 2026-04-17 21:24:23 - WARNING - __main__ - Dropped 237 non-canonical HH preference examples from split `train` before normalization (126 x HH preprocessing expects exactly one final assistant response in chosen/rejected suffixes., 111 x HH chosen/rejected transcripts must each contain a divergent assistant response.). Normalizing raw HH preferences (train): 0%| | 0/43598 [00:00> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2026-04-17 21:24:29,059 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2058] 2026-04-17 21:24:29,059 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2026-04-17 21:24:29,059 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2026-04-17 21:24:29,059 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2026-04-17 21:24:29,059 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2323] 2026-04-17 21:24:29,510 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2026-04-17 21:24:30 - INFO - __main__ - Processed train sample 41905: Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|> What is the best way to clean my refrigerator?<|eot_id|><|start_header_id|>assistant<|end_header_id|> Well, I have a few ideas, but the best way to know what works best is to run some tests and then analyze the data to figure out which technique works best.<|eot_id|><|start_header_id|>user<|end_header_id|> What are your ideas?<|eot_id|> Chosen: <|start_header_id|>assistant<|end_header_id|> Some of my ideas include wiping down the interior with a rag, using bleach and water to wipe down the door, emptying the contents of the fridge to scrub the shelves with a clean cloth, cleaning the interior of the fridge with vinegar and water, and using a vacuum cleaner to clean the interior.<|eot_id|> Rejected: <|start_header_id|>assistant<|end_header_id|> I was thinking that you could try different types of soap, which could be a variation of “soap,” like antibacterial soap. I was also thinking that you could try using different tools, like a toothbrush, to clean the inside. Or I was thinking that you could try different ways to get the gunk out, like using a squeegee, or using a paper towel. I was also thinking you could try other types of cleaning, like vacuuming, but I think that could have the opposite of the desired effect.<|eot_id|> /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [INFO|configuration_utils.py:691] 2026-04-17 21:24:30,161 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/config.json [INFO|configuration_utils.py:765] 2026-04-17 21:24:30,180 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 128256 } [INFO|modeling_utils.py:1121] 2026-04-17 21:24:31,292 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-17 21:24:31,297 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [WARNING|logging.py:328] 2026-04-17 21:24:31,299 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-17 21:24:31,299 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-17 21:24:31,299 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|configuration_utils.py:1142] 2026-04-17 21:24:31,300 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001, "use_cache": false } [WARNING|logging.py:328] 2026-04-17 21:24:31,300 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:821] 2026-04-17 21:24:31,560 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:821] 2026-04-17 21:24:31,560 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 14%|███████▊ | 1/7 [00:17<01:42, 17.09s/it] Loading checkpoint shards: 29%|███████████████▋ | 2/7 [00:32<01:21, 16.32s/it] Loading checkpoint shards: 43%|███████████████████████▌ | 3/7 [00:44<00:56, 14.15s/it] Loading checkpoint shards: 57%|███████████████████████████████▍ | 4/7 [00:59<00:43, 14.35s/it] Loading checkpoint shards: 71%|███████████████████████████████████████▎ | 5/7 [01:11<00:26, 13.48s/it] Loading checkpoint shards: 86%|███████████████████████████████████████████████▏ | 6/7 [01:25<00:13, 13.67s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 7/7 [01:32<00:00, 11.56s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 7/7 [01:32<00:00, 13.19s/it] [INFO|modeling_utils.py:4926] 2026-04-17 21:26:03,755 >> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-17 21:26:03,755 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-17 21:26:03,759 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-17 21:26:03,759 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": 128001, "max_length": 4096, "temperature": 0.6, "top_p": 0.9 } [INFO|configuration_utils.py:691] 2026-04-17 21:26:03,761 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/config.json [INFO|configuration_utils.py:765] 2026-04-17 21:26:03,761 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 128256 } [INFO|modeling_utils.py:1121] 2026-04-17 21:26:03,765 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-17 21:26:03,765 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1142] 2026-04-17 21:26:03,771 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001, "use_cache": false } Loading checkpoint shards: 0%| | 0/7 [00:00> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-17 21:26:15,141 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-17 21:26:15,145 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-17 21:26:15,145 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": 128001, "max_length": 4096, "temperature": 0.6, "top_p": 0.9 } [WARNING|trainer.py:821] 2026-04-17 21:26:15,146 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:816] 2026-04-17 21:26:15,147 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:15,408 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:15,481 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:15,553 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:521: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-17 21:26:17,411 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:17,412 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:17,415 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:17,421 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:17,421 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:17,423 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:17,423 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:17,426 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:17,426 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-17 21:26:17,428 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:521: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-17 21:26:17,429 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:521: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-17 21:26:17,431 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:521: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MarginDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [INFO|trainer.py:748] 2026-04-17 21:26:18,110 >> Using auto half precision backend /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight. warnings.warn( /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaDecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight. warnings.warn( /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1563: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints. warnings.warn( [INFO|trainer.py:2414] 2026-04-17 21:26:22,444 >> ***** Running training ***** [INFO|trainer.py:2415] 2026-04-17 21:26:22,444 >> Num examples = 43,598 [INFO|trainer.py:2416] 2026-04-17 21:26:22,444 >> Num Epochs = 1 [INFO|trainer.py:2417] 2026-04-17 21:26:22,444 >> Instantaneous batch size per device = 8 [INFO|trainer.py:2420] 2026-04-17 21:26:22,444 >> Total train batch size (w. parallel, distributed & accumulation) = 64 [INFO|trainer.py:2421] 2026-04-17 21:26:22,444 >> Gradient Accumulation steps = 2 [INFO|trainer.py:2422] 2026-04-17 21:26:22,444 >> Total optimization steps = 681 [INFO|trainer.py:2423] 2026-04-17 21:26:22,445 >> Number of trainable parameters = 2,007,565,312 [INFO|integration_utils.py:831] 2026-04-17 21:26:22,445 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: can-not-fand (can-not-fand-northeastern-university). Use `wandb login --relogin` to force relogin wandb: wandb version 0.26.0 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.17.5 wandb: Run data is saved locally in /scratch/feng.yulu/dynamic-dpo-v4/wandb/wandb/run-20260417_212625-f4hzpnwr wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312 wandb: ⭐️ View project at https://wandb.ai/can-not-fand-northeastern-university/ood-run-4xh200 wandb: 🚀 View run at https://wandb.ai/can-not-fand-northeastern-university/ood-run-4xh200/runs/f4hzpnwr 0%| | 0/681 [00:00> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-17 21:26:33,470 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-17 21:26:33,470 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-17 21:26:33,471 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%| | 1/681 [00:04<51:18, 4.53s/it] {'loss': 1.389, 'grad_norm': 83.525146484375, 'learning_rate': 0.0, 'margin_dpo/margin_mean': -0.02287048101425171, 'margin_dpo/margin_std': 0.41920793056488037, 'logps/chosen': -50.1435661315918, 'logps/rejected': -74.09991455078125, 'logps/ref_chosen': -50.14883804321289, 'logps/ref_rejected': -74.1280517578125, 'logits/chosen': -0.4974287748336792, 'logits/rejected': -0.43299180269241333, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': -0.02287006378173828, 'margin_dpo/beta_margin_mean': -0.0022870064713060856, 'margin_dpo/beta_margin_std': 0.0420234240591526, 'margin_dpo/beta_margin_grad_mean': -0.5005706548690796, 'margin_dpo/beta_margin_grad_std': 0.010499694384634495, 'epoch': 0.0} 0%| | 1/681 [00:04<51:18, 4.53s/it] 0%|▏ | 2/681 [00:07<42:19, 3.74s/it] {'loss': 1.3932, 'grad_norm': 72.20420837402344, 'learning_rate': 7.246376811594203e-09, 'margin_dpo/margin_mean': -0.06572240591049194, 'margin_dpo/margin_std': 0.35048407316207886, 'logps/chosen': -52.65569305419922, 'logps/rejected': -75.27340698242188, 'logps/ref_chosen': -52.620704650878906, 'logps/ref_rejected': -75.30413818359375, 'logits/chosen': -0.4953641891479492, 'logits/rejected': -0.4594460129737854, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': -0.06572261452674866, 'margin_dpo/beta_margin_mean': -0.006572261452674866, 'margin_dpo/beta_margin_std': 0.03523966670036316, 'margin_dpo/beta_margin_grad_mean': -0.5016425848007202, 'margin_dpo/beta_margin_grad_std': 0.008806563913822174, 'epoch': 0.0} 0%|▏ | 2/681 [00:07<42:19, 3.74s/it] 0%|▎ | 3/681 [00:10<39:41, 3.51s/it] {'loss': 1.3882, 'grad_norm': 70.93851470947266, 'learning_rate': 1.4492753623188406e-08, 'margin_dpo/margin_mean': -0.01640373468399048, 'margin_dpo/margin_std': 0.33020099997520447, 'logps/chosen': -60.9985466003418, 'logps/rejected': -68.67314147949219, 'logps/ref_chosen': -60.98159408569336, 'logps/ref_rejected': -68.67259216308594, 'logits/chosen': -0.4816606044769287, 'logits/rejected': -0.44218793511390686, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': -0.01640462875366211, 'margin_dpo/beta_margin_mean': -0.001640463131479919, 'margin_dpo/beta_margin_std': 0.03315068036317825, 'margin_dpo/beta_margin_grad_mean': -0.5004101395606995, 'margin_dpo/beta_margin_grad_std': 0.008285283111035824, 'epoch': 0.0} 0%|▎ | 3/681 [00:11<39:41, 3.51s/it] 1%|▍ | 4/681 [00:14<39:12, 3.47s/it] {'loss': 1.3857, 'grad_norm': 71.9634780883789, 'learning_rate': 2.1739130434782606e-08, 'margin_dpo/margin_mean': 0.0101853609085083, 'margin_dpo/margin_std': 0.40629148483276367, 'logps/chosen': -56.74000930786133, 'logps/rejected': -86.62959289550781, 'logps/ref_chosen': -56.76771545410156, 'logps/ref_rejected': -86.64710998535156, 'logits/chosen': -0.4688633680343628, 'logits/rejected': -0.4411826729774475, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.01018574833869934, 'margin_dpo/beta_margin_mean': 0.0010185746941715479, 'margin_dpo/beta_margin_std': 0.04087061062455177, 'margin_dpo/beta_margin_grad_mean': -0.49974533915519714, 'margin_dpo/beta_margin_grad_std': 0.010213336907327175, 'epoch': 0.01} 1%|▍ | 4/681 [00:14<39:12, 3.47s/it] 1%|▌ | 5/681 [00:17<37:45, 3.35s/it] {'loss': 1.3838, 'grad_norm': 89.44969940185547, 'learning_rate': 2.898550724637681e-08, 'margin_dpo/margin_mean': 0.02979910373687744, 'margin_dpo/margin_std': 0.4284527897834778, 'logps/chosen': -53.81106185913086, 'logps/rejected': -84.13066864013672, 'logps/ref_chosen': -53.859375, 'logps/ref_rejected': -84.14918518066406, 'logits/chosen': -0.5144953727722168, 'logits/rejected': -0.4707370400428772, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.029798835515975952, 'margin_dpo/beta_margin_mean': 0.0029798836912959814, 'margin_dpo/beta_margin_std': 0.043392810970544815, 'margin_dpo/beta_margin_grad_mean': -0.49925631284713745, 'margin_dpo/beta_margin_grad_std': 0.010840461589396, 'epoch': 0.01} 1%|▌ | 5/681 [00:17<37:45, 3.35s/it] 1%|▋ | 6/681 [00:19<34:05, 3.03s/it] {'loss': 1.3862, 'grad_norm': 91.85087585449219, 'learning_rate': 3.6231884057971014e-08, 'margin_dpo/margin_mean': 0.0043981969356536865, 'margin_dpo/margin_std': 0.37970417737960815, 'logps/chosen': -63.01681137084961, 'logps/rejected': -92.65907287597656, 'logps/ref_chosen': -63.007484436035156, 'logps/ref_rejected': -92.64534759521484, 'logits/chosen': -0.5226503610610962, 'logits/rejected': -0.48189258575439453, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.004398524761199951, 'margin_dpo/beta_margin_mean': 0.00043985259253531694, 'margin_dpo/beta_margin_std': 0.03865039348602295, 'margin_dpo/beta_margin_grad_mean': -0.499889999628067, 'margin_dpo/beta_margin_grad_std': 0.009657730348408222, 'epoch': 0.01} 1%|▋ | 6/681 [00:19<34:05, 3.03s/it] 1%|▊ | 7/681 [00:22<31:50, 2.83s/it] {'loss': 1.3851, 'grad_norm': 82.43697357177734, 'learning_rate': 4.347826086956521e-08, 'margin_dpo/margin_mean': 0.01658591628074646, 'margin_dpo/margin_std': 0.4064858555793762, 'logps/chosen': -57.743560791015625, 'logps/rejected': -103.90592193603516, 'logps/ref_chosen': -57.774818420410156, 'logps/ref_rejected': -103.92059326171875, 'logits/chosen': -0.5088996887207031, 'logits/rejected': -0.4749848246574402, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.016585499048233032, 'margin_dpo/beta_margin_mean': 0.0016585501143708825, 'margin_dpo/beta_margin_std': 0.04097241163253784, 'margin_dpo/beta_margin_grad_mean': -0.4995860159397125, 'margin_dpo/beta_margin_grad_std': 0.01023741252720356, 'epoch': 0.01} 1%|▊ | 7/681 [00:22<31:50, 2.83s/it] 1%|▉ | 8/681 [00:24<30:29, 2.72s/it] {'loss': 1.3896, 'grad_norm': 79.04316711425781, 'learning_rate': 5.0724637681159424e-08, 'margin_dpo/margin_mean': -0.028907448053359985, 'margin_dpo/margin_std': 0.37828418612480164, 'logps/chosen': -58.70497512817383, 'logps/rejected': -79.27145385742188, 'logps/ref_chosen': -58.716033935546875, 'logps/ref_rejected': -79.3114242553711, 'logits/chosen': -0.5012874007225037, 'logits/rejected': -0.4746849238872528, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': -0.028907686471939087, 'margin_dpo/beta_margin_mean': -0.0028907686937600374, 'margin_dpo/beta_margin_std': 0.038289591670036316, 'margin_dpo/beta_margin_grad_mean': -0.5007215142250061, 'margin_dpo/beta_margin_grad_std': 0.009568445384502411, 'epoch': 0.01} 1%|▉ | 8/681 [00:24<30:29, 2.72s/it] 1%|█ | 9/681 [00:27<30:35, 2.73s/it] {'loss': 1.3856, 'grad_norm': 85.21879577636719, 'learning_rate': 5.797101449275362e-08, 'margin_dpo/margin_mean': 0.011951416730880737, 'margin_dpo/margin_std': 0.4246274530887604, 'logps/chosen': -69.87384033203125, 'logps/rejected': -99.62161254882812, 'logps/ref_chosen': -69.8668441772461, 'logps/ref_rejected': -99.6026611328125, 'logits/chosen': -0.4914604127407074, 'logits/rejected': -0.44458478689193726, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.011951535940170288, 'margin_dpo/beta_margin_mean': 0.0011951536871492863, 'margin_dpo/beta_margin_std': 0.04292509704828262, 'margin_dpo/beta_margin_grad_mean': -0.49970191717147827, 'margin_dpo/beta_margin_grad_std': 0.010726687498390675, 'epoch': 0.01} 1%|█ | 9/681 [00:27<30:35, 2.73s/it] 1%|█▏ | 10/681 [00:30<30:39, 2.74s/it] {'loss': 1.3808, 'grad_norm': 70.79057312011719, 'learning_rate': 6.521739130434782e-08, 'margin_dpo/margin_mean': 0.05922728776931763, 'margin_dpo/margin_std': 0.425285279750824, 'logps/chosen': -48.30955505371094, 'logps/rejected': -80.38316345214844, 'logps/ref_chosen': -48.35768508911133, 'logps/ref_rejected': -80.37206268310547, 'logits/chosen': -0.5021112561225891, 'logits/rejected': -0.45928800106048584, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.05922754108905792, 'margin_dpo/beta_margin_mean': 0.005922754295170307, 'margin_dpo/beta_margin_std': 0.04276762157678604, 'margin_dpo/beta_margin_grad_mean': -0.4985186755657196, 'margin_dpo/beta_margin_grad_std': 0.010679498314857483, 'epoch': 0.01} 1%|█▏ | 10/681 [00:30<30:39, 2.74s/it] 2%|█▎ | 11/681 [00:33<30:48, 2.76s/it] {'loss': 1.382, 'grad_norm': 68.34065246582031, 'learning_rate': 7.246376811594203e-08, 'margin_dpo/margin_mean': 0.04697957634925842, 'margin_dpo/margin_std': 0.3766877055168152, 'logps/chosen': -52.98234558105469, 'logps/rejected': -87.7928466796875, 'logps/ref_chosen': -53.01685333251953, 'logps/ref_rejected': -87.78038024902344, 'logits/chosen': -0.46157172322273254, 'logits/rejected': -0.4366176128387451, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.04697933793067932, 'margin_dpo/beta_margin_mean': 0.004697933793067932, 'margin_dpo/beta_margin_std': 0.03841574117541313, 'margin_dpo/beta_margin_grad_mean': -0.4988263249397278, 'margin_dpo/beta_margin_grad_std': 0.009599917568266392, 'epoch': 0.02} 2%|█▎ | 11/681 [00:33<30:48, 2.76s/it] 2%|█▍ | 12/681 [00:35<30:31, 2.74s/it] {'loss': 1.383, 'grad_norm': 90.25657653808594, 'learning_rate': 7.971014492753623e-08, 'margin_dpo/margin_mean': 0.03697209060192108, 'margin_dpo/margin_std': 0.3801400065422058, 'logps/chosen': -61.82605743408203, 'logps/rejected': -104.91586303710938, 'logps/ref_chosen': -61.80543518066406, 'logps/ref_rejected': -104.85826873779297, 'logits/chosen': -0.5372684001922607, 'logits/rejected': -0.5010780096054077, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.036972567439079285, 'margin_dpo/beta_margin_mean': 0.003697256790474057, 'margin_dpo/beta_margin_std': 0.03862835466861725, 'margin_dpo/beta_margin_grad_mean': -0.49907633662223816, 'margin_dpo/beta_margin_grad_std': 0.009649958461523056, 'epoch': 0.02} 2%|█▍ | 12/681 [00:35<30:31, 2.74s/it] 2%|█▌ | 13/681 [00:38<30:59, 2.78s/it] {'loss': 1.3865, 'grad_norm': 79.32652282714844, 'learning_rate': 8.695652173913042e-08, 'margin_dpo/margin_mean': 0.0019735991954803467, 'margin_dpo/margin_std': 0.4049326777458191, 'logps/chosen': -64.28887176513672, 'logps/rejected': -87.23356628417969, 'logps/ref_chosen': -64.26036071777344, 'logps/ref_rejected': -87.20307922363281, 'logits/chosen': -0.4902585744857788, 'logits/rejected': -0.46292757987976074, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.0019729435443878174, 'margin_dpo/beta_margin_mean': 0.00019729437190108, 'margin_dpo/beta_margin_std': 0.04214153066277504, 'margin_dpo/beta_margin_grad_mean': -0.49995219707489014, 'margin_dpo/beta_margin_grad_std': 0.010526234284043312, 'epoch': 0.02} 2%|█▌ | 13/681 [00:38<30:59, 2.78s/it] 2%|█▌ | 14/681 [00:41<30:21, 2.73s/it] {'loss': 1.3863, 'grad_norm': 85.4604263305664, 'learning_rate': 9.420289855072464e-08, 'margin_dpo/margin_mean': 0.005887240171432495, 'margin_dpo/margin_std': 0.47125041484832764, 'logps/chosen': -58.152305603027344, 'logps/rejected': -104.09505462646484, 'logps/ref_chosen': -58.11021423339844, 'logps/ref_rejected': -104.04708099365234, 'logits/chosen': -0.489965558052063, 'logits/rejected': -0.4511108696460724, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.005887240171432495, 'margin_dpo/beta_margin_mean': 0.000588723982218653, 'margin_dpo/beta_margin_std': 0.047432418912649155, 'margin_dpo/beta_margin_grad_mean': -0.4998512864112854, 'margin_dpo/beta_margin_grad_std': 0.011847623623907566, 'epoch': 0.02} 2%|█▌ | 14/681 [00:41<30:21, 2.73s/it] 2%|█▋ | 15/681 [00:43<30:04, 2.71s/it] {'loss': 1.3824, 'grad_norm': 64.13221740722656, 'learning_rate': 1.0144927536231885e-07, 'margin_dpo/margin_mean': 0.042571812868118286, 'margin_dpo/margin_std': 0.39672398567199707, 'logps/chosen': -56.97354507446289, 'logps/rejected': -80.85784912109375, 'logps/ref_chosen': -56.96691131591797, 'logps/ref_rejected': -80.80863952636719, 'logits/chosen': -0.46068376302719116, 'logits/rejected': -0.44027313590049744, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.042571574449539185, 'margin_dpo/beta_margin_mean': 0.0042571574449539185, 'margin_dpo/beta_margin_std': 0.03996788337826729, 'margin_dpo/beta_margin_grad_mean': -0.49893510341644287, 'margin_dpo/beta_margin_grad_std': 0.009985481388866901, 'epoch': 0.02} 2%|█▋ | 15/681 [00:44<30:04, 2.71s/it] 2%|█▊ | 16/681 [00:46<29:30, 2.66s/it] {'loss': 1.3848, 'grad_norm': 84.14559173583984, 'learning_rate': 1.0869565217391303e-07, 'margin_dpo/margin_mean': 0.01766011118888855, 'margin_dpo/margin_std': 0.3431432843208313, 'logps/chosen': -61.73296356201172, 'logps/rejected': -84.38020324707031, 'logps/ref_chosen': -61.739891052246094, 'logps/ref_rejected': -84.36947631835938, 'logits/chosen': -0.52532559633255, 'logits/rejected': -0.4843023419380188, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.017660528421401978, 'margin_dpo/beta_margin_mean': 0.001766052795574069, 'margin_dpo/beta_margin_std': 0.03466500714421272, 'margin_dpo/beta_margin_grad_mean': -0.49955853819847107, 'margin_dpo/beta_margin_grad_std': 0.008663349784910679, 'epoch': 0.02} 2%|█▊ | 16/681 [00:46<29:30, 2.66s/it] 2%|█▉ | 17/681 [00:49<29:03, 2.63s/it] {'loss': 1.3816, 'grad_norm': 78.68696594238281, 'learning_rate': 1.1594202898550725e-07, 'margin_dpo/margin_mean': 0.04995712637901306, 'margin_dpo/margin_std': 0.3325832486152649, 'logps/chosen': -67.70388793945312, 'logps/rejected': -85.42217254638672, 'logps/ref_chosen': -67.71033477783203, 'logps/ref_rejected': -85.37865447998047, 'logits/chosen': -0.5094451308250427, 'logits/rejected': -0.4733882546424866, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.0499575138092041, 'margin_dpo/beta_margin_mean': 0.0049957516603171825, 'margin_dpo/beta_margin_std': 0.034035272896289825, 'margin_dpo/beta_margin_grad_mean': -0.49875107407569885, 'margin_dpo/beta_margin_grad_std': 0.008506165817379951, 'epoch': 0.02} 2%|█▉ | 17/681 [00:49<29:03, 2.63s/it] 3%|██ | 18/681 [00:51<29:00, 2.62s/it] {'loss': 1.3794, 'grad_norm': 81.91975402832031, 'learning_rate': 1.2318840579710146e-07, 'margin_dpo/margin_mean': 0.0720413327217102, 'margin_dpo/margin_std': 0.3442285656929016, 'logps/chosen': -47.723114013671875, 'logps/rejected': -75.5279541015625, 'logps/ref_chosen': -47.7394905090332, 'logps/ref_rejected': -75.4722900390625, 'logits/chosen': -0.4996645152568817, 'logits/rejected': -0.4448869228363037, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.07204136252403259, 'margin_dpo/beta_margin_mean': 0.007204136345535517, 'margin_dpo/beta_margin_std': 0.03471643477678299, 'margin_dpo/beta_margin_grad_mean': -0.49819934368133545, 'margin_dpo/beta_margin_grad_std': 0.008676947094500065, 'epoch': 0.03} 3%|██ | 18/681 [00:51<29:00, 2.62s/it] 3%|██▏ | 19/681 [00:54<28:57, 2.62s/it] {'loss': 1.3833, 'grad_norm': 73.45258331298828, 'learning_rate': 1.3043478260869563e-07, 'margin_dpo/margin_mean': 0.03309273719787598, 'margin_dpo/margin_std': 0.3704480528831482, 'logps/chosen': -70.22134399414062, 'logps/rejected': -89.80667114257812, 'logps/ref_chosen': -70.20535278320312, 'logps/ref_rejected': -89.75758361816406, 'logits/chosen': -0.5062457323074341, 'logits/rejected': -0.45754408836364746, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.033092111349105835, 'margin_dpo/beta_margin_mean': 0.003309211228042841, 'margin_dpo/beta_margin_std': 0.03776707127690315, 'margin_dpo/beta_margin_grad_mean': -0.49917319416999817, 'margin_dpo/beta_margin_grad_std': 0.009437629953026772, 'epoch': 0.03} 3%|██▏ | 19/681 [00:54<28:57, 2.62s/it] 3%|██▎ | 20/681 [00:57<29:07, 2.64s/it] {'loss': 1.3825, 'grad_norm': 73.92622375488281, 'learning_rate': 1.3768115942028986e-07, 'margin_dpo/margin_mean': 0.0407865047454834, 'margin_dpo/margin_std': 0.29486507177352905, 'logps/chosen': -50.828826904296875, 'logps/rejected': -78.88971710205078, 'logps/ref_chosen': -50.80324172973633, 'logps/ref_rejected': -78.8233413696289, 'logits/chosen': -0.5687921643257141, 'logits/rejected': -0.5141441226005554, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.040786296129226685, 'margin_dpo/beta_margin_mean': 0.004078629892319441, 'margin_dpo/beta_margin_std': 0.03044736012816429, 'margin_dpo/beta_margin_grad_mean': -0.4989803433418274, 'margin_dpo/beta_margin_grad_std': 0.007609857711941004, 'epoch': 0.03} 3%|██▎ | 20/681 [00:57<29:07, 2.64s/it] 3%|██▍ | 21/681 [00:59<28:46, 2.62s/it] {'loss': 1.375, 'grad_norm': 77.78363037109375, 'learning_rate': 1.4492753623188405e-07, 'margin_dpo/margin_mean': 0.11629366874694824, 'margin_dpo/margin_std': 0.34371477365493774, 'logps/chosen': -50.0500373840332, 'logps/rejected': -77.97210693359375, 'logps/ref_chosen': -50.063018798828125, 'logps/ref_rejected': -77.86878967285156, 'logits/chosen': -0.49086394906044006, 'logits/rejected': -0.4666551351547241, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.1162932813167572, 'margin_dpo/beta_margin_mean': 0.01162932813167572, 'margin_dpo/beta_margin_std': 0.03486839681863785, 'margin_dpo/beta_margin_grad_mean': -0.4970940947532654, 'margin_dpo/beta_margin_grad_std': 0.008713486604392529, 'epoch': 0.03} 3%|██▍ | 21/681 [00:59<28:46, 2.62s/it] 3%|██▌ | 22/681 [01:02<30:20, 2.76s/it] {'loss': 1.3615, 'grad_norm': 84.3017349243164, 'learning_rate': 1.5217391304347825e-07, 'margin_dpo/margin_mean': 0.2547217905521393, 'margin_dpo/margin_std': 0.4430729150772095, 'logps/chosen': -58.9935417175293, 'logps/rejected': -97.69529724121094, 'logps/ref_chosen': -59.05763626098633, 'logps/ref_rejected': -97.50466918945312, 'logits/chosen': -0.4743150472640991, 'logits/rejected': -0.4301157593727112, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.2547225058078766, 'margin_dpo/beta_margin_mean': 0.025472251698374748, 'margin_dpo/beta_margin_std': 0.044476091861724854, 'margin_dpo/beta_margin_grad_mean': -0.4936363697052002, 'margin_dpo/beta_margin_grad_std': 0.01110980473458767, 'epoch': 0.03} 3%|██▌ | 22/681 [01:02<30:20, 2.76s/it] 3%|██▋ | 23/681 [01:05<31:37, 2.88s/it] {'loss': 1.364, 'grad_norm': 80.28763580322266, 'learning_rate': 1.5942028985507245e-07, 'margin_dpo/margin_mean': 0.22987452149391174, 'margin_dpo/margin_std': 0.4392421543598175, 'logps/chosen': -60.04255676269531, 'logps/rejected': -81.33428955078125, 'logps/ref_chosen': -60.07769775390625, 'logps/ref_rejected': -81.1395492553711, 'logits/chosen': -0.4873223900794983, 'logits/rejected': -0.4646031856536865, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.22987452149391174, 'margin_dpo/beta_margin_mean': 0.022987453266978264, 'margin_dpo/beta_margin_std': 0.04579947143793106, 'margin_dpo/beta_margin_grad_mean': -0.49425840377807617, 'margin_dpo/beta_margin_grad_std': 0.011434967629611492, 'epoch': 0.03} 3%|██▋ | 23/681 [01:05<31:37, 2.88s/it] 4%|██▊ | 24/681 [01:08<30:58, 2.83s/it] {'loss': 1.3629, 'grad_norm': 80.72453308105469, 'learning_rate': 1.6666666666666665e-07, 'margin_dpo/margin_mean': 0.24034002423286438, 'margin_dpo/margin_std': 0.42840874195098877, 'logps/chosen': -44.27165985107422, 'logps/rejected': -99.34617614746094, 'logps/ref_chosen': -44.29103469848633, 'logps/ref_rejected': -99.12521362304688, 'logits/chosen': -0.479617714881897, 'logits/rejected': -0.46357664465904236, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.2403390109539032, 'margin_dpo/beta_margin_mean': 0.02403390221297741, 'margin_dpo/beta_margin_std': 0.04411429166793823, 'margin_dpo/beta_margin_grad_mean': -0.49399420619010925, 'margin_dpo/beta_margin_grad_std': 0.01102045550942421, 'epoch': 0.04} 4%|██▊ | 24/681 [01:08<30:58, 2.83s/it] 4%|██▉ | 25/681 [01:11<30:33, 2.79s/it] {'loss': 1.3645, 'grad_norm': 73.97421264648438, 'learning_rate': 1.7391304347826085e-07, 'margin_dpo/margin_mean': 0.22478067874908447, 'margin_dpo/margin_std': 0.4543741047382355, 'logps/chosen': -52.51414489746094, 'logps/rejected': -89.54405975341797, 'logps/ref_chosen': -52.537052154541016, 'logps/ref_rejected': -89.34219360351562, 'logits/chosen': -0.49460622668266296, 'logits/rejected': -0.4645787179470062, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.22478055953979492, 'margin_dpo/beta_margin_mean': 0.02247805707156658, 'margin_dpo/beta_margin_std': 0.045543402433395386, 'margin_dpo/beta_margin_grad_mean': -0.4943839907646179, 'margin_dpo/beta_margin_grad_std': 0.011376350186765194, 'epoch': 0.04} 4%|██▉ | 25/681 [01:11<30:33, 2.79s/it] 4%|███ | 26/681 [01:13<28:58, 2.65s/it] {'loss': 1.3457, 'grad_norm': 87.36368560791016, 'learning_rate': 1.8115942028985507e-07, 'margin_dpo/margin_mean': 0.41762077808380127, 'margin_dpo/margin_std': 0.5226191282272339, 'logps/chosen': -53.813804626464844, 'logps/rejected': -103.66832733154297, 'logps/ref_chosen': -53.92280578613281, 'logps/ref_rejected': -103.35971069335938, 'logits/chosen': -0.5448323488235474, 'logits/rejected': -0.5133931636810303, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.4176199734210968, 'margin_dpo/beta_margin_mean': 0.04176199808716774, 'margin_dpo/beta_margin_std': 0.05279136076569557, 'margin_dpo/beta_margin_grad_mean': -0.48957008123397827, 'margin_dpo/beta_margin_grad_std': 0.013178465887904167, 'epoch': 0.04} 4%|███ | 26/681 [01:13<28:58, 2.65s/it] 4%|███▏ | 27/681 [01:16<28:27, 2.61s/it] {'loss': 1.3374, 'grad_norm': 94.08861541748047, 'learning_rate': 1.8840579710144927e-07, 'margin_dpo/margin_mean': 0.5043210983276367, 'margin_dpo/margin_std': 0.5811291933059692, 'logps/chosen': -42.766082763671875, 'logps/rejected': -99.09607696533203, 'logps/ref_chosen': -42.898529052734375, 'logps/ref_rejected': -98.72420501708984, 'logits/chosen': -0.5202087163925171, 'logits/rejected': -0.4837333858013153, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.504321813583374, 'margin_dpo/beta_margin_mean': 0.05043218284845352, 'margin_dpo/beta_margin_std': 0.05854206159710884, 'margin_dpo/beta_margin_grad_mean': -0.4874098598957062, 'margin_dpo/beta_margin_grad_std': 0.014595179818570614, 'epoch': 0.04} 4%|███▏ | 27/681 [01:16<28:27, 2.61s/it] 4%|███▏ | 28/681 [01:18<28:24, 2.61s/it] {'loss': 1.3547, 'grad_norm': 75.05455780029297, 'learning_rate': 1.9565217391304347e-07, 'margin_dpo/margin_mean': 0.3272559344768524, 'margin_dpo/margin_std': 0.5973866581916809, 'logps/chosen': -60.553565979003906, 'logps/rejected': -91.7254409790039, 'logps/ref_chosen': -60.55650329589844, 'logps/ref_rejected': -91.40111541748047, 'logits/chosen': -0.5194311141967773, 'logits/rejected': -0.46526244282722473, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.3272556662559509, 'margin_dpo/beta_margin_mean': 0.03272556886076927, 'margin_dpo/beta_margin_std': 0.06033402308821678, 'margin_dpo/beta_margin_grad_mean': -0.4918249249458313, 'margin_dpo/beta_margin_grad_std': 0.015058773569762707, 'epoch': 0.04} 4%|███▏ | 28/681 [01:18<28:24, 2.61s/it] 4%|███▎ | 29/681 [01:21<27:28, 2.53s/it] {'loss': 1.3289, 'grad_norm': 90.46174621582031, 'learning_rate': 2.028985507246377e-07, 'margin_dpo/margin_mean': 0.5928229689598083, 'margin_dpo/margin_std': 0.6189556121826172, 'logps/chosen': -57.68913269042969, 'logps/rejected': -97.86851501464844, 'logps/ref_chosen': -57.80778503417969, 'logps/ref_rejected': -97.39434814453125, 'logits/chosen': -0.5414900779724121, 'logits/rejected': -0.49426716566085815, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.5928221344947815, 'margin_dpo/beta_margin_mean': 0.05928221344947815, 'margin_dpo/beta_margin_std': 0.062019772827625275, 'margin_dpo/beta_margin_grad_mean': -0.4852002263069153, 'margin_dpo/beta_margin_grad_std': 0.015466224402189255, 'epoch': 0.04} 4%|███▎ | 29/681 [01:21<27:28, 2.53s/it] 4%|███▍ | 30/681 [01:23<28:01, 2.58s/it] {'loss': 1.3197, 'grad_norm': 87.33443450927734, 'learning_rate': 2.1014492753623187e-07, 'margin_dpo/margin_mean': 0.6878979206085205, 'margin_dpo/margin_std': 0.62163245677948, 'logps/chosen': -52.40911102294922, 'logps/rejected': -99.00884246826172, 'logps/ref_chosen': -52.57737350463867, 'logps/ref_rejected': -98.48921203613281, 'logits/chosen': -0.4894167184829712, 'logits/rejected': -0.45850175619125366, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.6878980398178101, 'margin_dpo/beta_margin_mean': 0.06878980994224548, 'margin_dpo/beta_margin_std': 0.06341779977083206, 'margin_dpo/beta_margin_grad_mean': -0.48282885551452637, 'margin_dpo/beta_margin_grad_std': 0.01581035926938057, 'epoch': 0.04} 4%|███▍ | 30/681 [01:23<28:01, 2.58s/it] 5%|███▌ | 31/681 [01:26<28:28, 2.63s/it] {'loss': 1.3429, 'grad_norm': 67.94820404052734, 'learning_rate': 2.1739130434782607e-07, 'margin_dpo/margin_mean': 0.4500678479671478, 'margin_dpo/margin_std': 0.6665528416633606, 'logps/chosen': -63.70445251464844, 'logps/rejected': -73.24160766601562, 'logps/ref_chosen': -63.806922912597656, 'logps/ref_rejected': -72.89400482177734, 'logits/chosen': -0.5108931064605713, 'logits/rejected': -0.4666990637779236, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.450067937374115, 'margin_dpo/beta_margin_mean': 0.04500679671764374, 'margin_dpo/beta_margin_std': 0.06682661920785904, 'margin_dpo/beta_margin_grad_mean': -0.48876938223838806, 'margin_dpo/beta_margin_grad_std': 0.0166544821113348, 'epoch': 0.05} 5%|███▌ | 31/681 [01:26<28:28, 2.63s/it] 5%|███▋ | 32/681 [01:29<29:03, 2.69s/it] {'loss': 1.3154, 'grad_norm': 82.90047454833984, 'learning_rate': 2.2463768115942027e-07, 'margin_dpo/margin_mean': 0.7446720600128174, 'margin_dpo/margin_std': 0.9450139999389648, 'logps/chosen': -62.53711700439453, 'logps/rejected': -89.8597640991211, 'logps/ref_chosen': -62.739524841308594, 'logps/ref_rejected': -89.3175048828125, 'logits/chosen': -0.49858012795448303, 'logits/rejected': -0.45628952980041504, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.7446719408035278, 'margin_dpo/beta_margin_mean': 0.07446718961000443, 'margin_dpo/beta_margin_std': 0.09461291879415512, 'margin_dpo/beta_margin_grad_mean': -0.48145589232444763, 'margin_dpo/beta_margin_grad_std': 0.023477083072066307, 'epoch': 0.05} 5%|███▋ | 32/681 [01:29<29:03, 2.69s/it] 5%|███▊ | 33/681 [01:32<29:14, 2.71s/it] {'loss': 1.3243, 'grad_norm': 72.11341857910156, 'learning_rate': 2.318840579710145e-07, 'margin_dpo/margin_mean': 0.6417955160140991, 'margin_dpo/margin_std': 0.6490182876586914, 'logps/chosen': -53.105873107910156, 'logps/rejected': -88.37184143066406, 'logps/ref_chosen': -53.26097106933594, 'logps/ref_rejected': -87.8851318359375, 'logits/chosen': -0.47633564472198486, 'logits/rejected': -0.4497436285018921, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.6417950391769409, 'margin_dpo/beta_margin_mean': 0.06417950242757797, 'margin_dpo/beta_margin_std': 0.06679090112447739, 'margin_dpo/beta_margin_grad_mean': -0.48398149013519287, 'margin_dpo/beta_margin_grad_std': 0.016650153324007988, 'epoch': 0.05} 5%|███▊ | 33/681 [01:32<29:14, 2.71s/it] 5%|███▉ | 34/681 [01:34<29:25, 2.73s/it] {'loss': 1.3068, 'grad_norm': 77.38883209228516, 'learning_rate': 2.391304347826087e-07, 'margin_dpo/margin_mean': 0.8307995796203613, 'margin_dpo/margin_std': 0.8540636301040649, 'logps/chosen': -50.72978210449219, 'logps/rejected': -102.66510009765625, 'logps/ref_chosen': -50.81732940673828, 'logps/ref_rejected': -101.92184448242188, 'logits/chosen': -0.5127777457237244, 'logits/rejected': -0.49532148241996765, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 0.8307997584342957, 'margin_dpo/beta_margin_mean': 0.0830799788236618, 'margin_dpo/beta_margin_std': 0.08640186488628387, 'margin_dpo/beta_margin_grad_mean': -0.4792894721031189, 'margin_dpo/beta_margin_grad_std': 0.021500185132026672, 'epoch': 0.05} 5%|███▉ | 34/681 [01:34<29:25, 2.73s/it] 5%|████ | 35/681 [01:37<29:31, 2.74s/it] {'loss': 1.2708, 'grad_norm': 82.41116333007812, 'learning_rate': 2.463768115942029e-07, 'margin_dpo/margin_mean': 1.2235569953918457, 'margin_dpo/margin_std': 1.111976146697998, 'logps/chosen': -50.88545227050781, 'logps/rejected': -107.90895080566406, 'logps/ref_chosen': -51.02449035644531, 'logps/ref_rejected': -106.82443237304688, 'logits/chosen': -0.5374979972839355, 'logits/rejected': -0.5004309415817261, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 1.2235571146011353, 'margin_dpo/beta_margin_mean': 0.12235570698976517, 'margin_dpo/beta_margin_std': 0.11256185173988342, 'margin_dpo/beta_margin_grad_mean': -0.4696078896522522, 'margin_dpo/beta_margin_grad_std': 0.02748698741197586, 'epoch': 0.05} 5%|████ | 35/681 [01:37<29:31, 2.74s/it] 5%|████▏ | 36/681 [01:40<29:10, 2.71s/it] {'loss': 1.2813, 'grad_norm': 72.79762268066406, 'learning_rate': 2.536231884057971e-07, 'margin_dpo/margin_mean': 1.122597098350525, 'margin_dpo/margin_std': 1.2439404726028442, 'logps/chosen': -51.94648742675781, 'logps/rejected': -87.11822509765625, 'logps/ref_chosen': -51.991493225097656, 'logps/ref_rejected': -86.04061889648438, 'logits/chosen': -0.5538948774337769, 'logits/rejected': -0.517404317855835, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 1.1225981712341309, 'margin_dpo/beta_margin_mean': 0.11225982010364532, 'margin_dpo/beta_margin_std': 0.12831299006938934, 'margin_dpo/beta_margin_grad_mean': -0.47209563851356506, 'margin_dpo/beta_margin_grad_std': 0.03178109601140022, 'epoch': 0.05} 5%|████▏ | 36/681 [01:40<29:10, 2.71s/it] 5%|████▎ | 37/681 [01:42<28:56, 2.70s/it] {'loss': 1.2911, 'grad_norm': 61.13553237915039, 'learning_rate': 2.6086956521739126e-07, 'margin_dpo/margin_mean': 1.0293034315109253, 'margin_dpo/margin_std': 1.3807631731033325, 'logps/chosen': -62.78415298461914, 'logps/rejected': -78.90142059326172, 'logps/ref_chosen': -62.807106018066406, 'logps/ref_rejected': -77.89507293701172, 'logits/chosen': -0.5280976295471191, 'logits/rejected': -0.4858455955982208, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 1.0293034315109253, 'margin_dpo/beta_margin_mean': 0.10293034464120865, 'margin_dpo/beta_margin_std': 0.14328184723854065, 'margin_dpo/beta_margin_grad_mean': -0.47450384497642517, 'margin_dpo/beta_margin_grad_std': 0.03514566645026207, 'epoch': 0.05} 5%|████▎ | 37/681 [01:42<28:56, 2.70s/it] 6%|████▍ | 38/681 [01:45<27:34, 2.57s/it] {'loss': 1.262, 'grad_norm': 70.00904083251953, 'learning_rate': 2.681159420289855e-07, 'margin_dpo/margin_mean': 1.3506265878677368, 'margin_dpo/margin_std': 1.575331449508667, 'logps/chosen': -48.24530792236328, 'logps/rejected': -99.11785888671875, 'logps/ref_chosen': -48.39051818847656, 'logps/ref_rejected': -97.91244506835938, 'logits/chosen': -0.5190426111221313, 'logits/rejected': -0.4862367510795593, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 1.3506262302398682, 'margin_dpo/beta_margin_mean': 0.13506263494491577, 'margin_dpo/beta_margin_std': 0.15932665765285492, 'margin_dpo/beta_margin_grad_mean': -0.4666314721107483, 'margin_dpo/beta_margin_grad_std': 0.03878392279148102, 'epoch': 0.06} 6%|████▍ | 38/681 [01:45<27:34, 2.57s/it] 6%|████▌ | 39/681 [01:47<27:19, 2.55s/it] {'loss': 1.2298, 'grad_norm': 74.47781372070312, 'learning_rate': 2.753623188405797e-07, 'margin_dpo/margin_mean': 1.6912682056427002, 'margin_dpo/margin_std': 1.4713746309280396, 'logps/chosen': -50.65707015991211, 'logps/rejected': -80.16737365722656, 'logps/ref_chosen': -50.75046920776367, 'logps/ref_rejected': -78.56951141357422, 'logits/chosen': -0.5537021160125732, 'logits/rejected': -0.5135682821273804, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 1.6912682056427002, 'margin_dpo/beta_margin_mean': 0.16912682354450226, 'margin_dpo/beta_margin_std': 0.14913904666900635, 'margin_dpo/beta_margin_grad_mean': -0.45806559920310974, 'margin_dpo/beta_margin_grad_std': 0.036758922040462494, 'epoch': 0.06} 6%|████▌ | 39/681 [01:47<27:19, 2.55s/it] 6%|████▋ | 40/681 [01:50<27:58, 2.62s/it] {'loss': 1.243, 'grad_norm': 59.9489631652832, 'learning_rate': 2.8260869565217386e-07, 'margin_dpo/margin_mean': 1.5692870616912842, 'margin_dpo/margin_std': 1.697884202003479, 'logps/chosen': -57.77392578125, 'logps/rejected': -75.65821075439453, 'logps/ref_chosen': -57.985069274902344, 'logps/ref_rejected': -74.30007934570312, 'logits/chosen': -0.5245569348335266, 'logits/rejected': -0.4949991703033447, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 1.5692862272262573, 'margin_dpo/beta_margin_mean': 0.1569286286830902, 'margin_dpo/beta_margin_std': 0.1742551028728485, 'margin_dpo/beta_margin_grad_mean': -0.46128079295158386, 'margin_dpo/beta_margin_grad_std': 0.04237818345427513, 'epoch': 0.06} 6%|████▋ | 40/681 [01:50<27:58, 2.62s/it] 6%|████▊ | 41/681 [01:53<27:52, 2.61s/it] {'loss': 1.2195, 'grad_norm': 67.88613891601562, 'learning_rate': 2.898550724637681e-07, 'margin_dpo/margin_mean': 1.867814540863037, 'margin_dpo/margin_std': 2.0870983600616455, 'logps/chosen': -62.67747497558594, 'logps/rejected': -98.87300109863281, 'logps/ref_chosen': -62.69581604003906, 'logps/ref_rejected': -97.02352905273438, 'logits/chosen': -0.5592871308326721, 'logits/rejected': -0.5240367650985718, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 1.8678141832351685, 'margin_dpo/beta_margin_mean': 0.18678142130374908, 'margin_dpo/beta_margin_std': 0.21468721330165863, 'margin_dpo/beta_margin_grad_mean': -0.4541543424129486, 'margin_dpo/beta_margin_grad_std': 0.05179302766919136, 'epoch': 0.06} 6%|████▊ | 41/681 [01:53<27:52, 2.61s/it] 6%|████▊ | 42/681 [01:56<29:12, 2.74s/it] {'loss': 1.1578, 'grad_norm': 78.81612396240234, 'learning_rate': 2.971014492753623e-07, 'margin_dpo/margin_mean': 2.601499319076538, 'margin_dpo/margin_std': 2.445554733276367, 'logps/chosen': -58.707366943359375, 'logps/rejected': -112.25081634521484, 'logps/ref_chosen': -58.96642303466797, 'logps/ref_rejected': -109.90837097167969, 'logits/chosen': -0.5433309674263, 'logits/rejected': -0.49680295586586, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 2.601499557495117, 'margin_dpo/beta_margin_mean': 0.2601499557495117, 'margin_dpo/beta_margin_std': 0.24821382761001587, 'margin_dpo/beta_margin_grad_mean': -0.4366336166858673, 'margin_dpo/beta_margin_grad_std': 0.058427974581718445, 'epoch': 0.06} 6%|████▊ | 42/681 [01:56<29:12, 2.74s/it] 6%|████▉ | 43/681 [01:58<28:59, 2.73s/it] {'loss': 1.1675, 'grad_norm': 72.23222351074219, 'learning_rate': 3.043478260869565e-07, 'margin_dpo/margin_mean': 2.4315857887268066, 'margin_dpo/margin_std': 1.964142918586731, 'logps/chosen': -53.65935516357422, 'logps/rejected': -98.41513061523438, 'logps/ref_chosen': -54.15599822998047, 'logps/ref_rejected': -96.48019409179688, 'logits/chosen': -0.5568352341651917, 'logits/rejected': -0.532639741897583, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 2.4315857887268066, 'margin_dpo/beta_margin_mean': 0.24315857887268066, 'margin_dpo/beta_margin_std': 0.19878432154655457, 'margin_dpo/beta_margin_grad_mean': -0.44025009870529175, 'margin_dpo/beta_margin_grad_std': 0.04758695140480995, 'epoch': 0.06} 6%|████▉ | 43/681 [01:58<28:59, 2.73s/it] 6%|█████ | 44/681 [02:01<29:58, 2.82s/it] {'loss': 1.1338, 'grad_norm': 78.49581909179688, 'learning_rate': 3.115942028985507e-07, 'margin_dpo/margin_mean': 2.852534532546997, 'margin_dpo/margin_std': 2.270460605621338, 'logps/chosen': -49.86518859863281, 'logps/rejected': -111.42298889160156, 'logps/ref_chosen': -50.07849884033203, 'logps/ref_rejected': -108.78376007080078, 'logits/chosen': -0.458575576543808, 'logits/rejected': -0.43896228075027466, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 2.8525354862213135, 'margin_dpo/beta_margin_mean': 0.2852535545825958, 'margin_dpo/beta_margin_std': 0.2277490794658661, 'margin_dpo/beta_margin_grad_mean': -0.43024110794067383, 'margin_dpo/beta_margin_grad_std': 0.0542747788131237, 'epoch': 0.06} 6%|█████ | 44/681 [02:01<29:58, 2.82s/it] 7%|█████▏ | 45/681 [02:04<29:28, 2.78s/it] {'loss': 1.1805, 'grad_norm': 62.053192138671875, 'learning_rate': 3.188405797101449e-07, 'margin_dpo/margin_mean': 2.3724491596221924, 'margin_dpo/margin_std': 2.6500847339630127, 'logps/chosen': -48.24645233154297, 'logps/rejected': -80.1404037475586, 'logps/ref_chosen': -48.41493225097656, 'logps/ref_rejected': -77.93643188476562, 'logits/chosen': -0.4600446820259094, 'logits/rejected': -0.4469829797744751, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 2.3724491596221924, 'margin_dpo/beta_margin_mean': 0.23724493384361267, 'margin_dpo/beta_margin_std': 0.2693977653980255, 'margin_dpo/beta_margin_grad_mean': -0.4424096643924713, 'margin_dpo/beta_margin_grad_std': 0.06356598436832428, 'epoch': 0.07} 7%|█████▏ | 45/681 [02:04<29:28, 2.78s/it] 7%|█████▎ | 46/681 [02:07<29:52, 2.82s/it] {'loss': 1.1354, 'grad_norm': 69.27433013916016, 'learning_rate': 3.260869565217391e-07, 'margin_dpo/margin_mean': 2.9789419174194336, 'margin_dpo/margin_std': 3.244965076446533, 'logps/chosen': -55.80693435668945, 'logps/rejected': -98.43904113769531, 'logps/ref_chosen': -55.999427795410156, 'logps/ref_rejected': -95.652587890625, 'logits/chosen': -0.5094949007034302, 'logits/rejected': -0.45755523443222046, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 2.9789421558380127, 'margin_dpo/beta_margin_mean': 0.2978942394256592, 'margin_dpo/beta_margin_std': 0.3255438506603241, 'margin_dpo/beta_margin_grad_mean': -0.42856818437576294, 'margin_dpo/beta_margin_grad_std': 0.07470017671585083, 'epoch': 0.07} 7%|█████▎ | 46/681 [02:07<29:52, 2.82s/it] 7%|█████▍ | 47/681 [02:10<29:17, 2.77s/it] {'loss': 1.1271, 'grad_norm': 65.2599868774414, 'learning_rate': 3.333333333333333e-07, 'margin_dpo/margin_mean': 2.989128351211548, 'margin_dpo/margin_std': 2.6342062950134277, 'logps/chosen': -57.496604919433594, 'logps/rejected': -97.23886108398438, 'logps/ref_chosen': -57.92607879638672, 'logps/ref_rejected': -94.67920684814453, 'logits/chosen': -0.5813416242599487, 'logits/rejected': -0.5291002988815308, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 2.9891281127929688, 'margin_dpo/beta_margin_mean': 0.29891282320022583, 'margin_dpo/beta_margin_std': 0.26972696185112, 'margin_dpo/beta_margin_grad_mean': -0.42738690972328186, 'margin_dpo/beta_margin_grad_std': 0.0637550950050354, 'epoch': 0.07} 7%|█████▍ | 47/681 [02:10<29:17, 2.77s/it] 7%|█████▌ | 48/681 [02:12<29:22, 2.79s/it] {'loss': 1.1227, 'grad_norm': 73.67699432373047, 'learning_rate': 3.4057971014492755e-07, 'margin_dpo/margin_mean': 3.1348774433135986, 'margin_dpo/margin_std': 3.0109379291534424, 'logps/chosen': -57.117156982421875, 'logps/rejected': -91.08055877685547, 'logps/ref_chosen': -57.188072204589844, 'logps/ref_rejected': -88.0166015625, 'logits/chosen': -0.5998705625534058, 'logits/rejected': -0.5423353910446167, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 3.1348772048950195, 'margin_dpo/beta_margin_mean': 0.3134877383708954, 'margin_dpo/beta_margin_std': 0.32677435874938965, 'margin_dpo/beta_margin_grad_mean': -0.4244069755077362, 'margin_dpo/beta_margin_grad_std': 0.07627448439598083, 'epoch': 0.07} 7%|█████▌ | 48/681 [02:12<29:22, 2.79s/it] 7%|█████▋ | 49/681 [02:15<28:50, 2.74s/it] {'loss': 1.0774, 'grad_norm': 61.355953216552734, 'learning_rate': 3.478260869565217e-07, 'margin_dpo/margin_mean': 3.8097658157348633, 'margin_dpo/margin_std': 3.869323253631592, 'logps/chosen': -61.36932373046875, 'logps/rejected': -87.26129913330078, 'logps/ref_chosen': -61.685264587402344, 'logps/ref_rejected': -83.76747131347656, 'logits/chosen': -0.5448025465011597, 'logits/rejected': -0.4857603907585144, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 3.8097660541534424, 'margin_dpo/beta_margin_mean': 0.3809766173362732, 'margin_dpo/beta_margin_std': 0.3965732753276825, 'margin_dpo/beta_margin_grad_mean': -0.41020649671554565, 'margin_dpo/beta_margin_grad_std': 0.08793335407972336, 'epoch': 0.07} 7%|█████▋ | 49/681 [02:15<28:50, 2.74s/it] 7%|█████▊ | 50/681 [02:18<28:38, 2.72s/it] {'loss': 1.0518, 'grad_norm': 62.80997085571289, 'learning_rate': 3.5507246376811595e-07, 'margin_dpo/margin_mean': 4.163365364074707, 'margin_dpo/margin_std': 4.094795227050781, 'logps/chosen': -58.89775848388672, 'logps/rejected': -100.69513702392578, 'logps/ref_chosen': -58.72413635253906, 'logps/ref_rejected': -96.35814666748047, 'logits/chosen': -0.5425491333007812, 'logits/rejected': -0.5065620541572571, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 4.163364887237549, 'margin_dpo/beta_margin_mean': 0.4163365066051483, 'margin_dpo/beta_margin_std': 0.4100196361541748, 'margin_dpo/beta_margin_grad_mean': -0.40193936228752136, 'margin_dpo/beta_margin_grad_std': 0.09261800348758698, 'epoch': 0.07} 7%|█████▊ | 50/681 [02:18<28:38, 2.72s/it] 7%|█████▉ | 51/681 [02:21<29:04, 2.77s/it] {'loss': 1.085, 'grad_norm': 52.91781234741211, 'learning_rate': 3.6231884057971015e-07, 'margin_dpo/margin_mean': 4.017845153808594, 'margin_dpo/margin_std': 5.1221513748168945, 'logps/chosen': -61.69359588623047, 'logps/rejected': -80.33977508544922, 'logps/ref_chosen': -61.3736686706543, 'logps/ref_rejected': -76.00199890136719, 'logits/chosen': -0.5184497833251953, 'logits/rejected': -0.4852331280708313, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 4.017845153808594, 'margin_dpo/beta_margin_mean': 0.4017845094203949, 'margin_dpo/beta_margin_std': 0.5204705595970154, 'margin_dpo/beta_margin_grad_mean': -0.40868327021598816, 'margin_dpo/beta_margin_grad_std': 0.1110108494758606, 'epoch': 0.07} 7%|█████▉ | 51/681 [02:21<29:04, 2.77s/it] 8%|██████ | 52/681 [02:24<29:32, 2.82s/it] {'loss': 0.9189, 'grad_norm': 58.923404693603516, 'learning_rate': 3.695652173913043e-07, 'margin_dpo/margin_mean': 6.196599006652832, 'margin_dpo/margin_std': 5.190753936767578, 'logps/chosen': -51.979454040527344, 'logps/rejected': -85.81260681152344, 'logps/ref_chosen': -52.33735656738281, 'logps/ref_rejected': -79.97391510009766, 'logits/chosen': -0.5524120330810547, 'logits/rejected': -0.496574342250824, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 6.196599006652832, 'margin_dpo/beta_margin_mean': 0.6196599006652832, 'margin_dpo/beta_margin_std': 0.5214123129844666, 'margin_dpo/beta_margin_grad_mean': -0.3595433533191681, 'margin_dpo/beta_margin_grad_std': 0.10714302211999893, 'epoch': 0.08} 8%|██████ | 52/681 [02:24<29:32, 2.82s/it] 8%|██████▏ | 53/681 [02:26<28:48, 2.75s/it] {'loss': 0.9446, 'grad_norm': 58.20880889892578, 'learning_rate': 3.7681159420289855e-07, 'margin_dpo/margin_mean': 6.325778484344482, 'margin_dpo/margin_std': 6.248142242431641, 'logps/chosen': -53.506500244140625, 'logps/rejected': -98.30122375488281, 'logps/ref_chosen': -53.31465530395508, 'logps/ref_rejected': -91.7835922241211, 'logits/chosen': -0.6073682904243469, 'logits/rejected': -0.5856744050979614, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 6.325778961181641, 'margin_dpo/beta_margin_mean': 0.6325778961181641, 'margin_dpo/beta_margin_std': 0.6903671622276306, 'margin_dpo/beta_margin_grad_mean': -0.36480841040611267, 'margin_dpo/beta_margin_grad_std': 0.12540318071842194, 'epoch': 0.08} 8%|██████▏ | 53/681 [02:26<28:48, 2.75s/it] 8%|██████▎ | 54/681 [02:29<27:39, 2.65s/it] {'loss': 0.9783, 'grad_norm': 59.29412841796875, 'learning_rate': 3.8405797101449274e-07, 'margin_dpo/margin_mean': 5.348155498504639, 'margin_dpo/margin_std': 5.086174488067627, 'logps/chosen': -51.13933563232422, 'logps/rejected': -97.51422119140625, 'logps/ref_chosen': -50.68865966796875, 'logps/ref_rejected': -91.71539306640625, 'logits/chosen': -0.633226752281189, 'logits/rejected': -0.5815136432647705, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 5.348156452178955, 'margin_dpo/beta_margin_mean': 0.5348156690597534, 'margin_dpo/beta_margin_std': 0.5101956725120544, 'margin_dpo/beta_margin_grad_mean': -0.3783862590789795, 'margin_dpo/beta_margin_grad_std': 0.10563214868307114, 'epoch': 0.08} 8%|██████▎ | 54/681 [02:29<27:39, 2.65s/it] 8%|██████▍ | 55/681 [02:31<26:17, 2.52s/it] {'loss': 0.9548, 'grad_norm': 53.738956451416016, 'learning_rate': 3.9130434782608694e-07, 'margin_dpo/margin_mean': 6.541542053222656, 'margin_dpo/margin_std': 7.533283233642578, 'logps/chosen': -63.57060241699219, 'logps/rejected': -96.49041748046875, 'logps/ref_chosen': -62.615234375, 'logps/ref_rejected': -88.99349975585938, 'logits/chosen': -0.6361401081085205, 'logits/rejected': -0.5729630589485168, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 6.541542053222656, 'margin_dpo/beta_margin_mean': 0.6541542410850525, 'margin_dpo/beta_margin_std': 0.7597689032554626, 'margin_dpo/beta_margin_grad_mean': -0.361075222492218, 'margin_dpo/beta_margin_grad_std': 0.14729972183704376, 'epoch': 0.08} 8%|██████▍ | 55/681 [02:31<26:17, 2.52s/it] 8%|██████▍ | 56/681 [02:34<27:07, 2.60s/it] {'loss': 0.9775, 'grad_norm': 48.09397506713867, 'learning_rate': 3.9855072463768114e-07, 'margin_dpo/margin_mean': 6.195199012756348, 'margin_dpo/margin_std': 7.399816989898682, 'logps/chosen': -58.66962432861328, 'logps/rejected': -101.10653686523438, 'logps/ref_chosen': -57.93273162841797, 'logps/ref_rejected': -94.1744384765625, 'logits/chosen': -0.5945051908493042, 'logits/rejected': -0.5514425039291382, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 6.195199966430664, 'margin_dpo/beta_margin_mean': 0.6195200085639954, 'margin_dpo/beta_margin_std': 0.7477858066558838, 'margin_dpo/beta_margin_grad_mean': -0.36780592799186707, 'margin_dpo/beta_margin_grad_std': 0.14850637316703796, 'epoch': 0.08} 8%|██████▍ | 56/681 [02:34<27:07, 2.60s/it] 8%|██████▌ | 57/681 [02:36<26:48, 2.58s/it] {'loss': 0.9078, 'grad_norm': 54.234169006347656, 'learning_rate': 4.057971014492754e-07, 'margin_dpo/margin_mean': 6.902284145355225, 'margin_dpo/margin_std': 6.639451026916504, 'logps/chosen': -71.26276397705078, 'logps/rejected': -103.23522186279297, 'logps/ref_chosen': -70.49528503417969, 'logps/ref_rejected': -95.56546020507812, 'logits/chosen': -0.5641357898712158, 'logits/rejected': -0.5353480577468872, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 6.902284145355225, 'margin_dpo/beta_margin_mean': 0.6902284026145935, 'margin_dpo/beta_margin_std': 0.6726579070091248, 'margin_dpo/beta_margin_grad_mean': -0.34864187240600586, 'margin_dpo/beta_margin_grad_std': 0.13589681684970856, 'epoch': 0.08} 8%|██████▌ | 57/681 [02:36<26:48, 2.58s/it] 9%|██████▋ | 58/681 [02:39<26:59, 2.60s/it] {'loss': 0.8977, 'grad_norm': 59.243927001953125, 'learning_rate': 4.1304347826086954e-07, 'margin_dpo/margin_mean': 7.606607437133789, 'margin_dpo/margin_std': 8.09335708618164, 'logps/chosen': -63.23316955566406, 'logps/rejected': -93.32413482666016, 'logps/ref_chosen': -62.13294219970703, 'logps/ref_rejected': -84.61729431152344, 'logits/chosen': -0.5894064903259277, 'logits/rejected': -0.5127171874046326, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 7.606607437133789, 'margin_dpo/beta_margin_mean': 0.7606607675552368, 'margin_dpo/beta_margin_std': 0.8165130615234375, 'margin_dpo/beta_margin_grad_mean': -0.3427189290523529, 'margin_dpo/beta_margin_grad_std': 0.15285082161426544, 'epoch': 0.09} 9%|██████▋ | 58/681 [02:39<26:59, 2.60s/it] 9%|██████▊ | 59/681 [02:41<26:56, 2.60s/it] {'loss': 0.8575, 'grad_norm': 55.42934799194336, 'learning_rate': 4.2028985507246374e-07, 'margin_dpo/margin_mean': 8.485508918762207, 'margin_dpo/margin_std': 8.604471206665039, 'logps/chosen': -53.42650604248047, 'logps/rejected': -98.86468505859375, 'logps/ref_chosen': -51.932525634765625, 'logps/ref_rejected': -88.88520050048828, 'logits/chosen': -0.6423487663269043, 'logits/rejected': -0.6032625436782837, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 8.485508918762207, 'margin_dpo/beta_margin_mean': 0.8485509157180786, 'margin_dpo/beta_margin_std': 0.8816735148429871, 'margin_dpo/beta_margin_grad_mean': -0.32869505882263184, 'margin_dpo/beta_margin_grad_std': 0.15742561221122742, 'epoch': 0.09} 9%|██████▊ | 59/681 [02:41<26:56, 2.60s/it] 9%|██████▉ | 60/681 [02:44<26:43, 2.58s/it] {'loss': 0.9555, 'grad_norm': 64.29039764404297, 'learning_rate': 4.2753623188405794e-07, 'margin_dpo/margin_mean': 6.686439514160156, 'margin_dpo/margin_std': 7.678452968597412, 'logps/chosen': -63.62670135498047, 'logps/rejected': -94.76435089111328, 'logps/ref_chosen': -60.94218444824219, 'logps/ref_rejected': -85.39340209960938, 'logits/chosen': -0.6296500563621521, 'logits/rejected': -0.5711052417755127, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 6.686439037322998, 'margin_dpo/beta_margin_mean': 0.6686438918113708, 'margin_dpo/beta_margin_std': 0.7756204009056091, 'margin_dpo/beta_margin_grad_mean': -0.3545895218849182, 'margin_dpo/beta_margin_grad_std': 0.15808549523353577, 'epoch': 0.09} 9%|██████▉ | 60/681 [02:44<26:43, 2.58s/it] 9%|███████ | 61/681 [02:47<26:48, 2.59s/it] {'loss': 0.9341, 'grad_norm': 54.964107513427734, 'learning_rate': 4.3478260869565214e-07, 'margin_dpo/margin_mean': 8.251806259155273, 'margin_dpo/margin_std': 11.240764617919922, 'logps/chosen': -62.14350128173828, 'logps/rejected': -99.61428833007812, 'logps/ref_chosen': -60.633522033691406, 'logps/ref_rejected': -89.85249328613281, 'logits/chosen': -0.6372621655464172, 'logits/rejected': -0.6041065454483032, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 8.251806259155273, 'margin_dpo/beta_margin_mean': 0.8251805901527405, 'margin_dpo/beta_margin_std': 1.1422574520111084, 'margin_dpo/beta_margin_grad_mean': -0.34661665558815, 'margin_dpo/beta_margin_grad_std': 0.1781352162361145, 'epoch': 0.09} 9%|███████ | 61/681 [02:47<26:48, 2.59s/it] 9%|███████▏ | 62/681 [02:50<28:06, 2.72s/it] {'loss': 0.9993, 'grad_norm': 58.057708740234375, 'learning_rate': 4.420289855072464e-07, 'margin_dpo/margin_mean': 6.19963264465332, 'margin_dpo/margin_std': 8.127958297729492, 'logps/chosen': -57.778465270996094, 'logps/rejected': -83.39352416992188, 'logps/ref_chosen': -56.15077209472656, 'logps/ref_rejected': -75.56619262695312, 'logits/chosen': -0.6090478897094727, 'logits/rejected': -0.5749986171722412, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 6.19963264465332, 'margin_dpo/beta_margin_mean': 0.6199632883071899, 'margin_dpo/beta_margin_std': 0.8312649130821228, 'margin_dpo/beta_margin_grad_mean': -0.37108179926872253, 'margin_dpo/beta_margin_grad_std': 0.15791070461273193, 'epoch': 0.09} 9%|███████▏ | 62/681 [02:50<28:06, 2.72s/it] 9%|███████▎ | 63/681 [02:52<28:01, 2.72s/it] {'loss': 0.8773, 'grad_norm': 56.769561767578125, 'learning_rate': 4.4927536231884053e-07, 'margin_dpo/margin_mean': 8.366212844848633, 'margin_dpo/margin_std': 8.857807159423828, 'logps/chosen': -75.79495239257812, 'logps/rejected': -108.62382507324219, 'logps/ref_chosen': -73.14739227294922, 'logps/ref_rejected': -97.61006164550781, 'logits/chosen': -0.5860311388969421, 'logits/rejected': -0.5402973890304565, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 8.366211891174316, 'margin_dpo/beta_margin_mean': 0.8366211652755737, 'margin_dpo/beta_margin_std': 0.9040850400924683, 'margin_dpo/beta_margin_grad_mean': -0.3284067213535309, 'margin_dpo/beta_margin_grad_std': 0.16741114854812622, 'epoch': 0.09} 9%|███████▎ | 63/681 [02:52<28:01, 2.72s/it] 9%|███████▍ | 64/681 [02:55<28:08, 2.74s/it] {'loss': 0.8493, 'grad_norm': 52.091590881347656, 'learning_rate': 4.5652173913043473e-07, 'margin_dpo/margin_mean': 9.82172966003418, 'margin_dpo/margin_std': 11.043643951416016, 'logps/chosen': -55.00431823730469, 'logps/rejected': -104.35765075683594, 'logps/ref_chosen': -53.99859619140625, 'logps/ref_rejected': -93.53020477294922, 'logits/chosen': -0.5791685581207275, 'logits/rejected': -0.5466402769088745, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 9.821728706359863, 'margin_dpo/beta_margin_mean': 0.9821729063987732, 'margin_dpo/beta_margin_std': 1.1361504793167114, 'margin_dpo/beta_margin_grad_mean': -0.3144356906414032, 'margin_dpo/beta_margin_grad_std': 0.1805381327867508, 'epoch': 0.09} 9%|███████▍ | 64/681 [02:55<28:08, 2.74s/it] 10%|███████▌ | 65/681 [02:58<28:29, 2.78s/it] {'loss': 0.8585, 'grad_norm': 54.09811782836914, 'learning_rate': 4.63768115942029e-07, 'margin_dpo/margin_mean': 9.843679428100586, 'margin_dpo/margin_std': 10.951974868774414, 'logps/chosen': -68.0100326538086, 'logps/rejected': -122.96417236328125, 'logps/ref_chosen': -64.83599853515625, 'logps/ref_rejected': -109.94645690917969, 'logits/chosen': -0.6608457565307617, 'logits/rejected': -0.6478947401046753, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 9.843679428100586, 'margin_dpo/beta_margin_mean': 0.9843679666519165, 'margin_dpo/beta_margin_std': 1.1074903011322021, 'margin_dpo/beta_margin_grad_mean': -0.31038472056388855, 'margin_dpo/beta_margin_grad_std': 0.18928615748882294, 'epoch': 0.1} 10%|███████▌ | 65/681 [02:58<28:29, 2.78s/it] 10%|███████▋ | 66/681 [03:01<28:07, 2.74s/it] {'loss': 0.8859, 'grad_norm': 52.60911560058594, 'learning_rate': 4.7101449275362313e-07, 'margin_dpo/margin_mean': 8.99482536315918, 'margin_dpo/margin_std': 10.87942123413086, 'logps/chosen': -54.36174011230469, 'logps/rejected': -87.54934692382812, 'logps/ref_chosen': -51.44352722167969, 'logps/ref_rejected': -75.63629150390625, 'logits/chosen': -0.6474887132644653, 'logits/rejected': -0.6150294542312622, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 8.99482536315918, 'margin_dpo/beta_margin_mean': 0.8994826078414917, 'margin_dpo/beta_margin_std': 1.1073994636535645, 'margin_dpo/beta_margin_grad_mean': -0.3307109475135803, 'margin_dpo/beta_margin_grad_std': 0.1775081753730774, 'epoch': 0.1} 10%|███████▋ | 66/681 [03:01<28:07, 2.74s/it] 10%|███████▊ | 67/681 [03:03<26:38, 2.60s/it] {'loss': 0.8693, 'grad_norm': 52.46964645385742, 'learning_rate': 4.782608695652174e-07, 'margin_dpo/margin_mean': 9.277162551879883, 'margin_dpo/margin_std': 10.92019271850586, 'logps/chosen': -61.81807327270508, 'logps/rejected': -84.54171752929688, 'logps/ref_chosen': -59.34080505371094, 'logps/ref_rejected': -72.78729248046875, 'logits/chosen': -0.5966418981552124, 'logits/rejected': -0.5537301301956177, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 9.2771635055542, 'margin_dpo/beta_margin_mean': 0.9277163147926331, 'margin_dpo/beta_margin_std': 1.1012858152389526, 'margin_dpo/beta_margin_grad_mean': -0.32543134689331055, 'margin_dpo/beta_margin_grad_std': 0.17773009836673737, 'epoch': 0.1} 10%|███████▊ | 67/681 [03:03<26:38, 2.60s/it] 10%|███████▉ | 68/681 [03:05<26:27, 2.59s/it] {'loss': 0.8459, 'grad_norm': 52.40779113769531, 'learning_rate': 4.855072463768116e-07, 'margin_dpo/margin_mean': 8.720624923706055, 'margin_dpo/margin_std': 8.963220596313477, 'logps/chosen': -67.98988342285156, 'logps/rejected': -88.71192932128906, 'logps/ref_chosen': -65.2058334350586, 'logps/ref_rejected': -77.20724487304688, 'logits/chosen': -0.6349166631698608, 'logits/rejected': -0.5751150250434875, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 8.720624923706055, 'margin_dpo/beta_margin_mean': 0.8720625042915344, 'margin_dpo/beta_margin_std': 0.9045540690422058, 'margin_dpo/beta_margin_grad_mean': -0.3253341615200043, 'margin_dpo/beta_margin_grad_std': 0.15630275011062622, 'epoch': 0.1} 10%|███████▉ | 68/681 [03:05<26:27, 2.59s/it] 10%|████████ | 69/681 [03:08<27:08, 2.66s/it] {'loss': 0.7777, 'grad_norm': 53.23897933959961, 'learning_rate': 4.927536231884058e-07, 'margin_dpo/margin_mean': 10.385248184204102, 'margin_dpo/margin_std': 10.297136306762695, 'logps/chosen': -62.99334716796875, 'logps/rejected': -116.94822692871094, 'logps/ref_chosen': -59.81924057006836, 'logps/ref_rejected': -103.38886260986328, 'logits/chosen': -0.6085792183876038, 'logits/rejected': -0.5847188234329224, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 10.385248184204102, 'margin_dpo/beta_margin_mean': 1.0385247468948364, 'margin_dpo/beta_margin_std': 1.040853500366211, 'margin_dpo/beta_margin_grad_mean': -0.30277135968208313, 'margin_dpo/beta_margin_grad_std': 0.16077764332294464, 'epoch': 0.1} 10%|████████ | 69/681 [03:08<27:08, 2.66s/it] 10%|████████ | 70/681 [03:11<26:28, 2.60s/it] {'loss': 0.7928, 'grad_norm': 59.40316390991211, 'learning_rate': 5e-07, 'margin_dpo/margin_mean': 11.182600975036621, 'margin_dpo/margin_std': 11.917827606201172, 'logps/chosen': -66.4103012084961, 'logps/rejected': -106.7230453491211, 'logps/ref_chosen': -61.930641174316406, 'logps/ref_rejected': -91.060791015625, 'logits/chosen': -0.625554621219635, 'logits/rejected': -0.5908818244934082, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 11.182600975036621, 'margin_dpo/beta_margin_mean': 1.118260145187378, 'margin_dpo/beta_margin_std': 1.199088215827942, 'margin_dpo/beta_margin_grad_mean': -0.30000537633895874, 'margin_dpo/beta_margin_grad_std': 0.18498124182224274, 'epoch': 0.1} 10%|████████ | 70/681 [03:11<26:28, 2.60s/it] 10%|████████▏ | 71/681 [03:13<26:21, 2.59s/it] {'loss': 0.702, 'grad_norm': 49.68572998046875, 'learning_rate': 4.999967061337492e-07, 'margin_dpo/margin_mean': 12.864418029785156, 'margin_dpo/margin_std': 12.424565315246582, 'logps/chosen': -65.69276428222656, 'logps/rejected': -114.14346313476562, 'logps/ref_chosen': -61.750343322753906, 'logps/ref_rejected': -97.33662414550781, 'logits/chosen': -0.6752599477767944, 'logits/rejected': -0.6361984014511108, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 12.86441707611084, 'margin_dpo/beta_margin_mean': 1.2864418029785156, 'margin_dpo/beta_margin_std': 1.286440372467041, 'margin_dpo/beta_margin_grad_mean': -0.27304375171661377, 'margin_dpo/beta_margin_grad_std': 0.16720205545425415, 'epoch': 0.1} 10%|████████▏ | 71/681 [03:13<26:21, 2.59s/it] 11%|████████▎ | 72/681 [03:16<26:40, 2.63s/it] {'loss': 0.7297, 'grad_norm': 59.574241638183594, 'learning_rate': 4.999868246217933e-07, 'margin_dpo/margin_mean': 13.383831024169922, 'margin_dpo/margin_std': 13.636287689208984, 'logps/chosen': -70.28240966796875, 'logps/rejected': -112.89981079101562, 'logps/ref_chosen': -66.05341339111328, 'logps/ref_rejected': -95.2869873046875, 'logits/chosen': -0.6442112922668457, 'logits/rejected': -0.6080772280693054, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 13.383831024169922, 'margin_dpo/beta_margin_mean': 1.3383830785751343, 'margin_dpo/beta_margin_std': 1.3651797771453857, 'margin_dpo/beta_margin_grad_mean': -0.2696942389011383, 'margin_dpo/beta_margin_grad_std': 0.19517795741558075, 'epoch': 0.11} 11%|████████▎ | 72/681 [03:16<26:40, 2.63s/it] 11%|████████▍ | 73/681 [03:19<27:07, 2.68s/it] {'loss': 0.9513, 'grad_norm': 76.11861419677734, 'learning_rate': 4.999703557245192e-07, 'margin_dpo/margin_mean': 13.051246643066406, 'margin_dpo/margin_std': 18.630680084228516, 'logps/chosen': -72.03385162353516, 'logps/rejected': -109.28495788574219, 'logps/ref_chosen': -66.25627136230469, 'logps/ref_rejected': -90.45613861083984, 'logits/chosen': -0.6918191909790039, 'logits/rejected': -0.6510320901870728, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 13.051246643066406, 'margin_dpo/beta_margin_mean': 1.3051246404647827, 'margin_dpo/beta_margin_std': 1.8701282739639282, 'margin_dpo/beta_margin_grad_mean': -0.31014135479927063, 'margin_dpo/beta_margin_grad_std': 0.24917910993099213, 'epoch': 0.11} 11%|████████▍ | 73/681 [03:19<27:07, 2.68s/it] 11%|████████▌ | 74/681 [03:21<26:36, 2.63s/it] {'loss': 0.8775, 'grad_norm': 71.0533676147461, 'learning_rate': 4.999472998758977e-07, 'margin_dpo/margin_mean': 13.770397186279297, 'margin_dpo/margin_std': 20.299190521240234, 'logps/chosen': -59.54771423339844, 'logps/rejected': -115.84016418457031, 'logps/ref_chosen': -53.42488098144531, 'logps/ref_rejected': -95.94693756103516, 'logits/chosen': -0.6222573518753052, 'logits/rejected': -0.6104036569595337, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 13.770398139953613, 'margin_dpo/beta_margin_mean': 1.3770397901535034, 'margin_dpo/beta_margin_std': 2.0495500564575195, 'margin_dpo/beta_margin_grad_mean': -0.28980112075805664, 'margin_dpo/beta_margin_grad_std': 0.22041486203670502, 'epoch': 0.11} 11%|████████▌ | 74/681 [03:21<26:36, 2.63s/it] 11%|████████▋ | 75/681 [03:24<26:50, 2.66s/it] {'loss': 0.6084, 'grad_norm': 50.546207427978516, 'learning_rate': 4.999176576834721e-07, 'margin_dpo/margin_mean': 19.06302833557129, 'margin_dpo/margin_std': 18.35777473449707, 'logps/chosen': -57.421756744384766, 'logps/rejected': -135.87710571289062, 'logps/ref_chosen': -51.861663818359375, 'logps/ref_rejected': -111.25397491455078, 'logits/chosen': -0.6528257131576538, 'logits/rejected': -0.6429094672203064, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 19.063030242919922, 'margin_dpo/beta_margin_mean': 1.90630304813385, 'margin_dpo/beta_margin_std': 1.8703465461730957, 'margin_dpo/beta_margin_grad_mean': -0.22819584608078003, 'margin_dpo/beta_margin_grad_std': 0.2010163813829422, 'epoch': 0.11} 11%|████████▋ | 75/681 [03:24<26:50, 2.66s/it] 11%|████████▊ | 76/681 [03:27<26:26, 2.62s/it] {'loss': 0.8122, 'grad_norm': 63.239871978759766, 'learning_rate': 4.998814299283415e-07, 'margin_dpo/margin_mean': 12.300118446350098, 'margin_dpo/margin_std': 14.157339096069336, 'logps/chosen': -59.91857147216797, 'logps/rejected': -97.16926574707031, 'logps/ref_chosen': -53.26604080200195, 'logps/ref_rejected': -78.21662139892578, 'logits/chosen': -0.7003756165504456, 'logits/rejected': -0.6578394770622253, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 12.300118446350098, 'margin_dpo/beta_margin_mean': 1.2300118207931519, 'margin_dpo/beta_margin_std': 1.421257495880127, 'margin_dpo/beta_margin_grad_mean': -0.28062668442726135, 'margin_dpo/beta_margin_grad_std': 0.20105010271072388, 'epoch': 0.11} 11%|████████▊ | 76/681 [03:27<26:26, 2.62s/it] 11%|████████▉ | 77/681 [03:29<25:32, 2.54s/it] {'loss': 0.6829, 'grad_norm': 78.45389556884766, 'learning_rate': 4.998386175651409e-07, 'margin_dpo/margin_mean': 19.283706665039062, 'margin_dpo/margin_std': 19.11894989013672, 'logps/chosen': -63.619422912597656, 'logps/rejected': -118.58006286621094, 'logps/ref_chosen': -58.0966796875, 'logps/ref_rejected': -93.77361297607422, 'logits/chosen': -0.6659625768661499, 'logits/rejected': -0.6236972212791443, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 19.283708572387695, 'margin_dpo/beta_margin_mean': 1.9283708333969116, 'margin_dpo/beta_margin_std': 1.9269988536834717, 'margin_dpo/beta_margin_grad_mean': -0.2217966765165329, 'margin_dpo/beta_margin_grad_std': 0.22179701924324036, 'epoch': 0.11} 11%|████████▉ | 77/681 [03:29<25:32, 2.54s/it] 11%|█████████ | 78/681 [03:32<25:56, 2.58s/it] {'loss': 0.7296, 'grad_norm': 66.56047058105469, 'learning_rate': 4.997892217220159e-07, 'margin_dpo/margin_mean': 14.69200325012207, 'margin_dpo/margin_std': 15.322187423706055, 'logps/chosen': -60.89007568359375, 'logps/rejected': -104.90266418457031, 'logps/ref_chosen': -55.61378479003906, 'logps/ref_rejected': -84.93436431884766, 'logits/chosen': -0.6366250514984131, 'logits/rejected': -0.6083469986915588, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 14.692002296447754, 'margin_dpo/beta_margin_mean': 1.4692002534866333, 'margin_dpo/beta_margin_std': 1.5563766956329346, 'margin_dpo/beta_margin_grad_mean': -0.2656141221523285, 'margin_dpo/beta_margin_grad_std': 0.21040529012680054, 'epoch': 0.11} 11%|█████████ | 78/681 [03:32<25:56, 2.58s/it] 12%|█████████▏ | 79/681 [03:34<26:22, 2.63s/it] {'loss': 0.7766, 'grad_norm': 59.296844482421875, 'learning_rate': 4.997332437005931e-07, 'margin_dpo/margin_mean': 16.086679458618164, 'margin_dpo/margin_std': 18.848827362060547, 'logps/chosen': -60.498695373535156, 'logps/rejected': -108.78245544433594, 'logps/ref_chosen': -55.45048522949219, 'logps/ref_rejected': -87.64756774902344, 'logits/chosen': -0.6760110855102539, 'logits/rejected': -0.6464430093765259, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 16.086679458618164, 'margin_dpo/beta_margin_mean': 1.6086679697036743, 'margin_dpo/beta_margin_std': 1.9045932292938232, 'margin_dpo/beta_margin_grad_mean': -0.2785935699939728, 'margin_dpo/beta_margin_grad_std': 0.22844330966472626, 'epoch': 0.12} 12%|█████████▏ | 79/681 [03:34<26:22, 2.63s/it] 12%|█████████▎ | 80/681 [03:37<26:14, 2.62s/it] {'loss': 0.8355, 'grad_norm': 63.66164016723633, 'learning_rate': 4.996706849759452e-07, 'margin_dpo/margin_mean': 14.238598823547363, 'margin_dpo/margin_std': 17.483436584472656, 'logps/chosen': -65.51264190673828, 'logps/rejected': -108.77944946289062, 'logps/ref_chosen': -58.519290924072266, 'logps/ref_rejected': -87.54750061035156, 'logits/chosen': -0.7178832292556763, 'logits/rejected': -0.6710443496704102, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 14.238598823547363, 'margin_dpo/beta_margin_mean': 1.42385995388031, 'margin_dpo/beta_margin_std': 1.8568590879440308, 'margin_dpo/beta_margin_grad_mean': -0.293730765581131, 'margin_dpo/beta_margin_grad_std': 0.22851316630840302, 'epoch': 0.12} 12%|█████████▎ | 80/681 [03:37<26:14, 2.62s/it] 12%|█████████▍ | 81/681 [03:40<27:24, 2.74s/it] {'loss': 0.6904, 'grad_norm': 72.20431518554688, 'learning_rate': 4.996015471965529e-07, 'margin_dpo/margin_mean': 18.687902450561523, 'margin_dpo/margin_std': 20.542957305908203, 'logps/chosen': -72.02912902832031, 'logps/rejected': -153.9308624267578, 'logps/ref_chosen': -66.44886779785156, 'logps/ref_rejected': -129.66270446777344, 'logits/chosen': -0.7084971070289612, 'logits/rejected': -0.6748213171958923, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 18.687902450561523, 'margin_dpo/beta_margin_mean': 1.8687902688980103, 'margin_dpo/beta_margin_std': 2.0691728591918945, 'margin_dpo/beta_margin_grad_mean': -0.24315199255943298, 'margin_dpo/beta_margin_grad_std': 0.22722284495830536, 'epoch': 0.12} 12%|█████████▍ | 81/681 [03:40<27:24, 2.74s/it] 12%|█████████▌ | 82/681 [03:42<26:35, 2.66s/it] {'loss': 0.9632, 'grad_norm': 87.32213592529297, 'learning_rate': 4.995258321842611e-07, 'margin_dpo/margin_mean': 15.053365707397461, 'margin_dpo/margin_std': 21.363815307617188, 'logps/chosen': -59.366302490234375, 'logps/rejected': -112.9305419921875, 'logps/ref_chosen': -52.232383728027344, 'logps/ref_rejected': -90.74325561523438, 'logits/chosen': -0.6286877393722534, 'logits/rejected': -0.6112765073776245, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 15.053364753723145, 'margin_dpo/beta_margin_mean': 1.5053365230560303, 'margin_dpo/beta_margin_std': 2.1997368335723877, 'margin_dpo/beta_margin_grad_mean': -0.2833250164985657, 'margin_dpo/beta_margin_grad_std': 0.2522350549697876, 'epoch': 0.12} 12%|█████████▌ | 82/681 [03:42<26:35, 2.66s/it] 12%|█████████▋ | 83/681 [03:45<25:49, 2.59s/it] {'loss': 0.7422, 'grad_norm': 67.5205307006836, 'learning_rate': 4.994435419342304e-07, 'margin_dpo/margin_mean': 16.844711303710938, 'margin_dpo/margin_std': 18.56102752685547, 'logps/chosen': -62.771873474121094, 'logps/rejected': -127.50509643554688, 'logps/ref_chosen': -55.82738494873047, 'logps/ref_rejected': -103.71590423583984, 'logits/chosen': -0.6808498501777649, 'logits/rejected': -0.6353092193603516, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 16.844711303710938, 'margin_dpo/beta_margin_mean': 1.6844712495803833, 'margin_dpo/beta_margin_std': 1.8570791482925415, 'margin_dpo/beta_margin_grad_mean': -0.2590833604335785, 'margin_dpo/beta_margin_grad_std': 0.22852419316768646, 'epoch': 0.12} 12%|█████████▋ | 83/681 [03:45<25:49, 2.59s/it] 12%|█████████▋ | 84/681 [03:48<26:19, 2.65s/it] {'loss': 0.6737, 'grad_norm': 58.463897705078125, 'learning_rate': 4.993546786148857e-07, 'margin_dpo/margin_mean': 15.18610954284668, 'margin_dpo/margin_std': 13.861265182495117, 'logps/chosen': -72.32835388183594, 'logps/rejected': -107.63688659667969, 'logps/ref_chosen': -67.1761703491211, 'logps/ref_rejected': -87.29859924316406, 'logits/chosen': -0.6490943431854248, 'logits/rejected': -0.6113982200622559, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 15.186108589172363, 'margin_dpo/beta_margin_mean': 1.5186108350753784, 'margin_dpo/beta_margin_std': 1.4150245189666748, 'margin_dpo/beta_margin_grad_mean': -0.244083434343338, 'margin_dpo/beta_margin_grad_std': 0.19765815138816833, 'epoch': 0.12} 12%|█████████▋ | 84/681 [03:48<26:19, 2.65s/it] 12%|█████████▊ | 85/681 [03:51<27:18, 2.75s/it] {'loss': 0.7715, 'grad_norm': 64.56900787353516, 'learning_rate': 4.992592445678582e-07, 'margin_dpo/margin_mean': 14.56472110748291, 'margin_dpo/margin_std': 15.904397010803223, 'logps/chosen': -64.1954345703125, 'logps/rejected': -98.99234008789062, 'logps/ref_chosen': -58.406620025634766, 'logps/ref_rejected': -78.63880157470703, 'logits/chosen': -0.6331249475479126, 'logits/rejected': -0.6021745204925537, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 14.564720153808594, 'margin_dpo/beta_margin_mean': 1.4564720392227173, 'margin_dpo/beta_margin_std': 1.6396623849868774, 'margin_dpo/beta_margin_grad_mean': -0.27991896867752075, 'margin_dpo/beta_margin_grad_std': 0.2167506366968155, 'epoch': 0.12} 12%|█████████▊ | 85/681 [03:51<27:18, 2.75s/it] 13%|█████████▉ | 86/681 [03:53<27:42, 2.79s/it] {'loss': 0.9186, 'grad_norm': 117.56181335449219, 'learning_rate': 4.991572423079235e-07, 'margin_dpo/margin_mean': 15.120819091796875, 'margin_dpo/margin_std': 21.751773834228516, 'logps/chosen': -63.10496520996094, 'logps/rejected': -110.20996856689453, 'logps/ref_chosen': -56.13746643066406, 'logps/ref_rejected': -88.12165069580078, 'logits/chosen': -0.6592001914978027, 'logits/rejected': -0.6417681574821472, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 15.120819091796875, 'margin_dpo/beta_margin_mean': 1.5120818614959717, 'margin_dpo/beta_margin_std': 2.213315725326538, 'margin_dpo/beta_margin_grad_mean': -0.2995266318321228, 'margin_dpo/beta_margin_grad_std': 0.24782723188400269, 'epoch': 0.13} 13%|█████████▉ | 86/681 [03:54<27:42, 2.79s/it] 13%|██████████ | 87/681 [03:56<27:28, 2.78s/it] {'loss': 0.7859, 'grad_norm': 66.58505249023438, 'learning_rate': 4.990486745229364e-07, 'margin_dpo/margin_mean': 16.435949325561523, 'margin_dpo/margin_std': 18.915019989013672, 'logps/chosen': -62.457305908203125, 'logps/rejected': -118.72473907470703, 'logps/ref_chosen': -55.63609313964844, 'logps/ref_rejected': -95.46757507324219, 'logits/chosen': -0.7072494626045227, 'logits/rejected': -0.670096755027771, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 16.435949325561523, 'margin_dpo/beta_margin_mean': 1.6435949802398682, 'margin_dpo/beta_margin_std': 1.9270051717758179, 'margin_dpo/beta_margin_grad_mean': -0.2566095292568207, 'margin_dpo/beta_margin_grad_std': 0.23276211321353912, 'epoch': 0.13} 13%|██████████ | 87/681 [03:56<27:28, 2.78s/it] 13%|██████████▏ | 88/681 [03:59<28:07, 2.85s/it] {'loss': 0.9197, 'grad_norm': 75.93292999267578, 'learning_rate': 4.989335440737586e-07, 'margin_dpo/margin_mean': 12.606681823730469, 'margin_dpo/margin_std': 15.93301773071289, 'logps/chosen': -82.13240051269531, 'logps/rejected': -127.77642822265625, 'logps/ref_chosen': -73.67115020751953, 'logps/ref_rejected': -106.70849609375, 'logits/chosen': -0.6478073596954346, 'logits/rejected': -0.6310935020446777, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 12.606681823730469, 'margin_dpo/beta_margin_mean': 1.260668158531189, 'margin_dpo/beta_margin_std': 1.7176685333251953, 'margin_dpo/beta_margin_grad_mean': -0.3014436364173889, 'margin_dpo/beta_margin_grad_std': 0.23827531933784485, 'epoch': 0.13} 13%|██████████▏ | 88/681 [03:59<28:07, 2.85s/it] 13%|██████████▎ | 89/681 [04:02<27:02, 2.74s/it] {'loss': 0.7412, 'grad_norm': 56.17230224609375, 'learning_rate': 4.988118539941847e-07, 'margin_dpo/margin_mean': 12.958423614501953, 'margin_dpo/margin_std': 13.854536056518555, 'logps/chosen': -65.11277770996094, 'logps/rejected': -99.52984619140625, 'logps/ref_chosen': -60.624916076660156, 'logps/ref_rejected': -82.08354949951172, 'logits/chosen': -0.6928755640983582, 'logits/rejected': -0.6521140336990356, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 12.958422660827637, 'margin_dpo/beta_margin_mean': 1.2958422899246216, 'margin_dpo/beta_margin_std': 1.4058305025100708, 'margin_dpo/beta_margin_grad_mean': -0.27584946155548096, 'margin_dpo/beta_margin_grad_std': 0.18205879628658295, 'epoch': 0.13} 13%|██████████▎ | 89/681 [04:02<27:02, 2.74s/it] 13%|██████████▍ | 90/681 [04:04<26:07, 2.65s/it] {'loss': 0.8411, 'grad_norm': 66.36186981201172, 'learning_rate': 4.986836074908615e-07, 'margin_dpo/margin_mean': 15.813644409179688, 'margin_dpo/margin_std': 20.459163665771484, 'logps/chosen': -59.482887268066406, 'logps/rejected': -133.55593872070312, 'logps/ref_chosen': -53.285308837890625, 'logps/ref_rejected': -111.54470825195312, 'logits/chosen': -0.6513394713401794, 'logits/rejected': -0.6424415111541748, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 15.813644409179688, 'margin_dpo/beta_margin_mean': 1.5813645124435425, 'margin_dpo/beta_margin_std': 2.088043451309204, 'margin_dpo/beta_margin_grad_mean': -0.2849555015563965, 'margin_dpo/beta_margin_grad_std': 0.22966991364955902, 'epoch': 0.13} 13%|██████████▍ | 90/681 [04:04<26:07, 2.65s/it] 13%|██████████▌ | 91/681 [04:07<25:59, 2.64s/it] {'loss': 0.762, 'grad_norm': 65.70479583740234, 'learning_rate': 4.985488079432037e-07, 'margin_dpo/margin_mean': 15.881075859069824, 'margin_dpo/margin_std': 17.554851531982422, 'logps/chosen': -67.02444458007812, 'logps/rejected': -108.97652435302734, 'logps/ref_chosen': -61.80295944213867, 'logps/ref_rejected': -87.87395477294922, 'logits/chosen': -0.695541262626648, 'logits/rejected': -0.6568491458892822, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 15.88107681274414, 'margin_dpo/beta_margin_mean': 1.588107705116272, 'margin_dpo/beta_margin_std': 1.7729498147964478, 'margin_dpo/beta_margin_grad_mean': -0.27236229181289673, 'margin_dpo/beta_margin_grad_std': 0.23041805624961853, 'epoch': 0.13} 13%|██████████▌ | 91/681 [04:07<25:59, 2.64s/it] 14%|██████████▋ | 92/681 [04:09<26:05, 2.66s/it] {'loss': 0.8107, 'grad_norm': 60.354248046875, 'learning_rate': 4.984074589033043e-07, 'margin_dpo/margin_mean': 14.722909927368164, 'margin_dpo/margin_std': 17.423236846923828, 'logps/chosen': -56.71138000488281, 'logps/rejected': -97.6747055053711, 'logps/ref_chosen': -51.640769958496094, 'logps/ref_rejected': -77.88117980957031, 'logits/chosen': -0.7051235437393188, 'logits/rejected': -0.6763289570808411, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 14.722909927368164, 'margin_dpo/beta_margin_mean': 1.4722909927368164, 'margin_dpo/beta_margin_std': 1.7639739513397217, 'margin_dpo/beta_margin_grad_mean': -0.2847803235054016, 'margin_dpo/beta_margin_grad_std': 0.230974480509758, 'epoch': 0.14} 14%|██████████▋ | 92/681 [04:10<26:05, 2.66s/it] 14%|██████████▊ | 93/681 [04:12<24:32, 2.50s/it] {'loss': 0.6862, 'grad_norm': 48.63566589355469, 'learning_rate': 4.982595640958425e-07, 'margin_dpo/margin_mean': 14.930685043334961, 'margin_dpo/margin_std': 15.499519348144531, 'logps/chosen': -57.98681640625, 'logps/rejected': -97.54901123046875, 'logps/ref_chosen': -52.529239654541016, 'logps/ref_rejected': -77.1607437133789, 'logits/chosen': -0.7185451984405518, 'logits/rejected': -0.6557145714759827, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 14.930684089660645, 'margin_dpo/beta_margin_mean': 1.4930684566497803, 'margin_dpo/beta_margin_std': 1.5610140562057495, 'margin_dpo/beta_margin_grad_mean': -0.25878626108169556, 'margin_dpo/beta_margin_grad_std': 0.19036650657653809, 'epoch': 0.14} 14%|██████████▊ | 93/681 [04:12<24:32, 2.50s/it] 14%|██████████▉ | 94/681 [04:15<25:34, 2.61s/it] {'loss': 0.6489, 'grad_norm': 51.54408264160156, 'learning_rate': 4.98105127417984e-07, 'margin_dpo/margin_mean': 15.727283477783203, 'margin_dpo/margin_std': 14.665702819824219, 'logps/chosen': -67.19898986816406, 'logps/rejected': -121.30268859863281, 'logps/ref_chosen': -61.22261047363281, 'logps/ref_rejected': -99.59902954101562, 'logits/chosen': -0.6754232048988342, 'logits/rejected': -0.6463443040847778, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 15.72728443145752, 'margin_dpo/beta_margin_mean': 1.5727283954620361, 'margin_dpo/beta_margin_std': 1.4842208623886108, 'margin_dpo/beta_margin_grad_mean': -0.2494257092475891, 'margin_dpo/beta_margin_grad_std': 0.1913631409406662, 'epoch': 0.14} 14%|██████████▉ | 94/681 [04:15<25:34, 2.61s/it] 14%|███████████ | 95/681 [04:17<25:07, 2.57s/it] {'loss': 0.7168, 'grad_norm': 49.133419036865234, 'learning_rate': 4.979441529392784e-07, 'margin_dpo/margin_mean': 12.927990913391113, 'margin_dpo/margin_std': 12.453845977783203, 'logps/chosen': -57.13197326660156, 'logps/rejected': -93.41667175292969, 'logps/ref_chosen': -52.52364730834961, 'logps/ref_rejected': -75.88035583496094, 'logits/chosen': -0.7007203102111816, 'logits/rejected': -0.6629537343978882, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 12.927990913391113, 'margin_dpo/beta_margin_mean': 1.2927991151809692, 'margin_dpo/beta_margin_std': 1.250800371170044, 'margin_dpo/beta_margin_grad_mean': -0.26963499188423157, 'margin_dpo/beta_margin_grad_std': 0.18063600361347198, 'epoch': 0.14} 14%|███████████ | 95/681 [04:17<25:07, 2.57s/it] 14%|███████████▏ | 96/681 [04:20<25:16, 2.59s/it] {'loss': 0.6242, 'grad_norm': 50.28048324584961, 'learning_rate': 4.977766449015534e-07, 'margin_dpo/margin_mean': 17.10177230834961, 'margin_dpo/margin_std': 16.97222328186035, 'logps/chosen': -65.92119598388672, 'logps/rejected': -117.46200561523438, 'logps/ref_chosen': -62.15697479248047, 'logps/ref_rejected': -96.59601593017578, 'logits/chosen': -0.6796199083328247, 'logits/rejected': -0.6378945708274841, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 17.101774215698242, 'margin_dpo/beta_margin_mean': 1.7101774215698242, 'margin_dpo/beta_margin_std': 1.7064077854156494, 'margin_dpo/beta_margin_grad_mean': -0.23479405045509338, 'margin_dpo/beta_margin_grad_std': 0.18774589896202087, 'epoch': 0.14} 14%|███████████▏ | 96/681 [04:20<25:16, 2.59s/it] 14%|███████████▎ | 97/681 [04:22<25:48, 2.65s/it] {'loss': 0.6845, 'grad_norm': 53.316551208496094, 'learning_rate': 4.976026077188012e-07, 'margin_dpo/margin_mean': 13.75833511352539, 'margin_dpo/margin_std': 12.287176132202148, 'logps/chosen': -59.297088623046875, 'logps/rejected': -95.37380981445312, 'logps/ref_chosen': -54.64636993408203, 'logps/ref_rejected': -76.96475219726562, 'logits/chosen': -0.6628963947296143, 'logits/rejected': -0.6078641414642334, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 13.758334159851074, 'margin_dpo/beta_margin_mean': 1.375833511352539, 'margin_dpo/beta_margin_std': 1.2895034551620483, 'margin_dpo/beta_margin_grad_mean': -0.26004087924957275, 'margin_dpo/beta_margin_grad_std': 0.18338225781917572, 'epoch': 0.14} 14%|███████████▎ | 97/681 [04:22<25:48, 2.65s/it] 14%|███████████▎ | 98/681 [04:25<25:06, 2.58s/it] {'loss': 0.748, 'grad_norm': 59.0107536315918, 'learning_rate': 4.974220459770639e-07, 'margin_dpo/margin_mean': 14.806069374084473, 'margin_dpo/margin_std': 15.232458114624023, 'logps/chosen': -71.05479431152344, 'logps/rejected': -117.12971496582031, 'logps/ref_chosen': -65.25862884521484, 'logps/ref_rejected': -96.5274887084961, 'logits/chosen': -0.6648178100585938, 'logits/rejected': -0.6405047178268433, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 14.806070327758789, 'margin_dpo/beta_margin_mean': 1.480607032775879, 'margin_dpo/beta_margin_std': 1.5301272869110107, 'margin_dpo/beta_margin_grad_mean': -0.25851550698280334, 'margin_dpo/beta_margin_grad_std': 0.21575090289115906, 'epoch': 0.14} 14%|███████████▎ | 98/681 [04:25<25:06, 2.58s/it] 15%|███████████▍ | 99/681 [04:27<24:02, 2.48s/it] {'loss': 0.6371, 'grad_norm': 48.36380386352539, 'learning_rate': 4.972349644343108e-07, 'margin_dpo/margin_mean': 16.02899932861328, 'margin_dpo/margin_std': 16.368377685546875, 'logps/chosen': -50.50402069091797, 'logps/rejected': -107.33246612548828, 'logps/ref_chosen': -45.63848114013672, 'logps/ref_rejected': -86.43792724609375, 'logits/chosen': -0.6831210851669312, 'logits/rejected': -0.6707972884178162, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 16.02899932861328, 'margin_dpo/beta_margin_mean': 1.6028999090194702, 'margin_dpo/beta_margin_std': 1.6380233764648438, 'margin_dpo/beta_margin_grad_mean': -0.24989381432533264, 'margin_dpo/beta_margin_grad_std': 0.17566484212875366, 'epoch': 0.15} 15%|███████████▍ | 99/681 [04:27<24:02, 2.48s/it] 15%|███████████▍ | 100/681 [04:30<25:19, 2.62s/it] {'loss': 0.9045, 'grad_norm': 66.3724365234375, 'learning_rate': 4.970413680203148e-07, 'margin_dpo/margin_mean': 11.581655502319336, 'margin_dpo/margin_std': 15.148920059204102, 'logps/chosen': -62.664703369140625, 'logps/rejected': -90.71258544921875, 'logps/ref_chosen': -57.5939826965332, 'logps/ref_rejected': -74.06021118164062, 'logits/chosen': -0.6820343732833862, 'logits/rejected': -0.6383761167526245, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 11.581656455993652, 'margin_dpo/beta_margin_mean': 1.158165693283081, 'margin_dpo/beta_margin_std': 1.5505050420761108, 'margin_dpo/beta_margin_grad_mean': -0.30843135714530945, 'margin_dpo/beta_margin_grad_std': 0.22273820638656616, 'epoch': 0.15} 15%|███████████▍ | 100/681 [04:30<25:19, 2.62s/it][INFO|trainer.py:4307] 2026-04-17 21:31:00,913 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-17 21:31:00,913 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-17 21:31:00,913 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-17 21:36:01,990 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-17 21:36:01,990 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-200 [INFO|configuration_utils.py:419] 2026-04-17 21:36:59,407 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-200/config.json [INFO|configuration_utils.py:911] 2026-04-17 21:36:59,411 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-200/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-17 21:37:51,624 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-200/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-17 21:37:51,632 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-200/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-17 21:37:51,645 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-200/special_tokens_map.json 30%|█████████████████████▊ | 201/681 [15:06<13:39:01, 102.38s/it] {'loss': 0.5336, 'grad_norm': 52.3123893737793, 'learning_rate': 4.455721242469372e-07, 'margin_dpo/margin_mean': 25.56842803955078, 'margin_dpo/margin_std': 23.23642349243164, 'logps/chosen': -83.8150634765625, 'logps/rejected': -148.78948974609375, 'logps/ref_chosen': -75.4022216796875, 'logps/ref_rejected': -114.80821990966797, 'logits/chosen': -0.6279897689819336, 'logits/rejected': -0.5977976322174072, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.56842613220215, 'margin_dpo/beta_margin_mean': 2.556842803955078, 'margin_dpo/beta_margin_std': 2.3333568572998047, 'margin_dpo/beta_margin_grad_mean': -0.18962885439395905, 'margin_dpo/beta_margin_grad_std': 0.21612294018268585, 'epoch': 0.3} 30%|█████████████████████▊ | 201/681 [15:06<13:39:01, 102.38s/it] 30%|██████████████████████▌ | 202/681 [15:09<9:38:42, 72.49s/it] {'loss': 0.7708, 'grad_norm': 71.65006256103516, 'learning_rate': 4.4477014363141755e-07, 'margin_dpo/margin_mean': 18.85692596435547, 'margin_dpo/margin_std': 21.052621841430664, 'logps/chosen': -60.95005798339844, 'logps/rejected': -116.69070434570312, 'logps/ref_chosen': -50.101318359375, 'logps/ref_rejected': -86.98503112792969, 'logits/chosen': -0.6751070618629456, 'logits/rejected': -0.663360059261322, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 18.85692596435547, 'margin_dpo/beta_margin_mean': 1.8856925964355469, 'margin_dpo/beta_margin_std': 2.1072704792022705, 'margin_dpo/beta_margin_grad_mean': -0.25191596150398254, 'margin_dpo/beta_margin_grad_std': 0.24605146050453186, 'epoch': 0.3} 30%|██████████████████████▌ | 202/681 [15:09<9:38:42, 72.49s/it] 30%|██████████████████████▋ | 203/681 [15:12<6:51:18, 51.63s/it] {'loss': 0.532, 'grad_norm': 45.57592010498047, 'learning_rate': 4.439630306414758e-07, 'margin_dpo/margin_mean': 21.060504913330078, 'margin_dpo/margin_std': 18.373455047607422, 'logps/chosen': -68.63609313964844, 'logps/rejected': -114.98287963867188, 'logps/ref_chosen': -60.60969543457031, 'logps/ref_rejected': -85.89596557617188, 'logits/chosen': -0.6758503317832947, 'logits/rejected': -0.632592499256134, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 21.060503005981445, 'margin_dpo/beta_margin_mean': 2.106050491333008, 'margin_dpo/beta_margin_std': 1.8404932022094727, 'margin_dpo/beta_margin_grad_mean': -0.20186971127986908, 'margin_dpo/beta_margin_grad_std': 0.19567202031612396, 'epoch': 0.3} 30%|██████████████████████▋ | 203/681 [15:12<6:51:18, 51.63s/it] 30%|██████████████████████▊ | 204/681 [15:15<4:54:09, 37.00s/it] {'loss': 0.5197, 'grad_norm': 46.98274230957031, 'learning_rate': 4.431508065452897e-07, 'margin_dpo/margin_mean': 21.9324893951416, 'margin_dpo/margin_std': 19.7504825592041, 'logps/chosen': -89.8402099609375, 'logps/rejected': -119.30364990234375, 'logps/ref_chosen': -80.16496276855469, 'logps/ref_rejected': -87.69590759277344, 'logits/chosen': -0.640312910079956, 'logits/rejected': -0.5806652307510376, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 21.9324893951416, 'margin_dpo/beta_margin_mean': 2.193248987197876, 'margin_dpo/beta_margin_std': 2.008610725402832, 'margin_dpo/beta_margin_grad_mean': -0.1984976977109909, 'margin_dpo/beta_margin_grad_std': 0.19227007031440735, 'epoch': 0.3} 30%|██████████████████████▊ | 204/681 [15:15<4:54:09, 37.00s/it] 30%|██████████████████████▉ | 205/681 [15:17<3:31:53, 26.71s/it] {'loss': 0.6391, 'grad_norm': 67.07288360595703, 'learning_rate': 4.4233349274571974e-07, 'margin_dpo/margin_mean': 23.78384780883789, 'margin_dpo/margin_std': 23.02971649169922, 'logps/chosen': -70.58891296386719, 'logps/rejected': -120.11308288574219, 'logps/ref_chosen': -59.384735107421875, 'logps/ref_rejected': -85.12505340576172, 'logits/chosen': -0.690580427646637, 'logits/rejected': -0.6472284197807312, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 23.78384780883789, 'margin_dpo/beta_margin_mean': 2.378384828567505, 'margin_dpo/beta_margin_std': 2.325178384780884, 'margin_dpo/beta_margin_grad_mean': -0.21127313375473022, 'margin_dpo/beta_margin_grad_std': 0.2417152225971222, 'epoch': 0.3} 30%|██████████████████████▉ | 205/681 [15:18<3:31:53, 26.71s/it] 30%|██████████████████████▉ | 206/681 [15:20<2:33:29, 19.39s/it] {'loss': 0.4331, 'grad_norm': 44.62752914428711, 'learning_rate': 4.415111107797445e-07, 'margin_dpo/margin_mean': 27.303932189941406, 'margin_dpo/margin_std': 21.387907028198242, 'logps/chosen': -57.75178527832031, 'logps/rejected': -137.04470825195312, 'logps/ref_chosen': -46.964500427246094, 'logps/ref_rejected': -98.9534912109375, 'logits/chosen': -0.6897552013397217, 'logits/rejected': -0.6878204345703125, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.303930282592773, 'margin_dpo/beta_margin_mean': 2.730393171310425, 'margin_dpo/beta_margin_std': 2.1477179527282715, 'margin_dpo/beta_margin_grad_mean': -0.16189786791801453, 'margin_dpo/beta_margin_grad_std': 0.19883880019187927, 'epoch': 0.3} 30%|██████████████████████▉ | 206/681 [15:20<2:33:29, 19.39s/it] 30%|███████████████████████ | 207/681 [15:22<1:53:25, 14.36s/it] {'loss': 0.4613, 'grad_norm': 51.993370056152344, 'learning_rate': 4.4068368231789365e-07, 'margin_dpo/margin_mean': 28.46685028076172, 'margin_dpo/margin_std': 25.901378631591797, 'logps/chosen': -64.26461791992188, 'logps/rejected': -121.12300872802734, 'logps/ref_chosen': -56.05625915527344, 'logps/ref_rejected': -84.44779968261719, 'logits/chosen': -0.6620955467224121, 'logits/rejected': -0.6329070925712585, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.46685028076172, 'margin_dpo/beta_margin_mean': 2.8466851711273193, 'margin_dpo/beta_margin_std': 2.5996739864349365, 'margin_dpo/beta_margin_grad_mean': -0.1773681491613388, 'margin_dpo/beta_margin_grad_std': 0.19014772772789001, 'epoch': 0.3} 30%|███████████████████████ | 207/681 [15:22<1:53:25, 14.36s/it] 31%|███████████████████████▏ | 208/681 [15:25<1:25:31, 10.85s/it] {'loss': 0.4839, 'grad_norm': 52.05033493041992, 'learning_rate': 4.398512291636768e-07, 'margin_dpo/margin_mean': 23.237594604492188, 'margin_dpo/margin_std': 20.450965881347656, 'logps/chosen': -79.44926452636719, 'logps/rejected': -129.90614318847656, 'logps/ref_chosen': -67.06761169433594, 'logps/ref_rejected': -94.28689575195312, 'logits/chosen': -0.6714012622833252, 'logits/rejected': -0.6417751312255859, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 23.237592697143555, 'margin_dpo/beta_margin_mean': 2.3237593173980713, 'margin_dpo/beta_margin_std': 2.0613441467285156, 'margin_dpo/beta_margin_grad_mean': -0.19027359783649445, 'margin_dpo/beta_margin_grad_std': 0.18024224042892456, 'epoch': 0.31} 31%|███████████████████████▏ | 208/681 [15:25<1:25:31, 10.85s/it] 31%|███████████████████████▎ | 209/681 [15:27<1:05:24, 8.31s/it] {'loss': 0.6043, 'grad_norm': 49.21156692504883, 'learning_rate': 4.3901377325300857e-07, 'margin_dpo/margin_mean': 23.724063873291016, 'margin_dpo/margin_std': 21.8853816986084, 'logps/chosen': -65.82164764404297, 'logps/rejected': -114.3055419921875, 'logps/ref_chosen': -56.18169403076172, 'logps/ref_rejected': -80.94152069091797, 'logits/chosen': -0.6503257751464844, 'logits/rejected': -0.6238174438476562, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 23.724063873291016, 'margin_dpo/beta_margin_mean': 2.372406482696533, 'margin_dpo/beta_margin_std': 2.2294185161590576, 'margin_dpo/beta_margin_grad_mean': -0.20685909688472748, 'margin_dpo/beta_margin_grad_std': 0.23329907655715942, 'epoch': 0.31} 31%|███████████████████████▎ | 209/681 [15:28<1:05:24, 8.31s/it] 31%|████████████████████████ | 210/681 [15:30<51:28, 6.56s/it] {'loss': 0.5055, 'grad_norm': 47.80137634277344, 'learning_rate': 4.381713366536311e-07, 'margin_dpo/margin_mean': 22.228591918945312, 'margin_dpo/margin_std': 18.896484375, 'logps/chosen': -56.31086349487305, 'logps/rejected': -108.84925842285156, 'logps/ref_chosen': -46.371822357177734, 'logps/ref_rejected': -76.68162536621094, 'logits/chosen': -0.6832214593887329, 'logits/rejected': -0.6534386873245239, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 22.228591918945312, 'margin_dpo/beta_margin_mean': 2.2228593826293945, 'margin_dpo/beta_margin_std': 1.9132236242294312, 'margin_dpo/beta_margin_grad_mean': -0.19675123691558838, 'margin_dpo/beta_margin_grad_std': 0.18586638569831848, 'epoch': 0.31} 31%|████████████████████████ | 210/681 [15:30<51:28, 6.56s/it] 31%|████████████████████████▏ | 211/681 [15:32<41:30, 5.30s/it] {'loss': 0.597, 'grad_norm': 59.744529724121094, 'learning_rate': 4.373239415645323e-07, 'margin_dpo/margin_mean': 22.822023391723633, 'margin_dpo/margin_std': 20.654033660888672, 'logps/chosen': -91.70536041259766, 'logps/rejected': -122.416015625, 'logps/ref_chosen': -78.93235778808594, 'logps/ref_rejected': -86.82098388671875, 'logits/chosen': -0.6856693029403687, 'logits/rejected': -0.6315619945526123, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 22.822023391723633, 'margin_dpo/beta_margin_mean': 2.2822024822235107, 'margin_dpo/beta_margin_std': 2.1043384075164795, 'margin_dpo/beta_margin_grad_mean': -0.20305074751377106, 'margin_dpo/beta_margin_grad_std': 0.23186683654785156, 'epoch': 0.31} 31%|████████████████████████▏ | 211/681 [15:32<41:30, 5.30s/it] 31%|████████████████████████▎ | 212/681 [15:35<35:15, 4.51s/it] {'loss': 0.4566, 'grad_norm': 49.052188873291016, 'learning_rate': 4.3647161031536086e-07, 'margin_dpo/margin_mean': 28.725915908813477, 'margin_dpo/margin_std': 24.575883865356445, 'logps/chosen': -70.00460052490234, 'logps/rejected': -143.5913543701172, 'logps/ref_chosen': -58.19701385498047, 'logps/ref_rejected': -103.05784606933594, 'logits/chosen': -0.6898226737976074, 'logits/rejected': -0.6595550775527954, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.725915908813477, 'margin_dpo/beta_margin_mean': 2.872591733932495, 'margin_dpo/beta_margin_std': 2.5697410106658936, 'margin_dpo/beta_margin_grad_mean': -0.17350082099437714, 'margin_dpo/beta_margin_grad_std': 0.19579628109931946, 'epoch': 0.31} 31%|████████████████████████▎ | 212/681 [15:35<35:15, 4.51s/it] 31%|████████████████████████▍ | 213/681 [15:38<30:49, 3.95s/it] {'loss': 0.4856, 'grad_norm': 51.70877456665039, 'learning_rate': 4.3561436536583774e-07, 'margin_dpo/margin_mean': 29.106151580810547, 'margin_dpo/margin_std': 25.876745223999023, 'logps/chosen': -77.46551513671875, 'logps/rejected': -132.9736785888672, 'logps/ref_chosen': -67.51271057128906, 'logps/ref_rejected': -93.91471862792969, 'logits/chosen': -0.6648838520050049, 'logits/rejected': -0.6213950514793396, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.10615348815918, 'margin_dpo/beta_margin_mean': 2.9106154441833496, 'margin_dpo/beta_margin_std': 2.622797966003418, 'margin_dpo/beta_margin_grad_mean': -0.1718873530626297, 'margin_dpo/beta_margin_grad_std': 0.20221562683582306, 'epoch': 0.31} 31%|████████████████████████▍ | 213/681 [15:38<30:49, 3.95s/it] 31%|████████████████████████▌ | 214/681 [15:40<26:54, 3.46s/it] {'loss': 0.6245, 'grad_norm': 59.337345123291016, 'learning_rate': 4.3475222930516473e-07, 'margin_dpo/margin_mean': 23.016742706298828, 'margin_dpo/margin_std': 24.032947540283203, 'logps/chosen': -52.264801025390625, 'logps/rejected': -111.19406127929688, 'logps/ref_chosen': -41.604888916015625, 'logps/ref_rejected': -77.51741027832031, 'logits/chosen': -0.6546872854232788, 'logits/rejected': -0.6387699842453003, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 23.016742706298828, 'margin_dpo/beta_margin_mean': 2.3016743659973145, 'margin_dpo/beta_margin_std': 2.411076784133911, 'margin_dpo/beta_margin_grad_mean': -0.21652430295944214, 'margin_dpo/beta_margin_grad_std': 0.2097923308610916, 'epoch': 0.31} 31%|████████████████████████▌ | 214/681 [15:40<26:54, 3.46s/it] 32%|████████████████████████▋ | 215/681 [15:43<25:31, 3.29s/it] {'loss': 0.5351, 'grad_norm': 56.185760498046875, 'learning_rate': 4.3388522485142885e-07, 'margin_dpo/margin_mean': 24.892501831054688, 'margin_dpo/margin_std': 23.67044448852539, 'logps/chosen': -63.87721252441406, 'logps/rejected': -125.45509338378906, 'logps/ref_chosen': -53.279266357421875, 'logps/ref_rejected': -89.96464538574219, 'logits/chosen': -0.6533488035202026, 'logits/rejected': -0.624896228313446, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.892501831054688, 'margin_dpo/beta_margin_mean': 2.4892501831054688, 'margin_dpo/beta_margin_std': 2.3997206687927246, 'margin_dpo/beta_margin_grad_mean': -0.1931016743183136, 'margin_dpo/beta_margin_grad_std': 0.20534518361091614, 'epoch': 0.32} 32%|████████████████████████▋ | 215/681 [15:43<25:31, 3.29s/it] 32%|████████████████████████▋ | 216/681 [15:46<24:52, 3.21s/it] {'loss': 0.5867, 'grad_norm': 63.23625564575195, 'learning_rate': 4.330133748510036e-07, 'margin_dpo/margin_mean': 26.958616256713867, 'margin_dpo/margin_std': 26.038864135742188, 'logps/chosen': -61.9163932800293, 'logps/rejected': -117.18614196777344, 'logps/ref_chosen': -48.887794494628906, 'logps/ref_rejected': -77.19892883300781, 'logits/chosen': -0.6673412919044495, 'logits/rejected': -0.6347181797027588, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.958616256713867, 'margin_dpo/beta_margin_mean': 2.695861577987671, 'margin_dpo/beta_margin_std': 2.648024797439575, 'margin_dpo/beta_margin_grad_mean': -0.2070668637752533, 'margin_dpo/beta_margin_grad_std': 0.22892099618911743, 'epoch': 0.32} 32%|████████████████████████▋ | 216/681 [15:46<24:52, 3.21s/it] 32%|████████████████████████▊ | 217/681 [15:48<23:21, 3.02s/it] {'loss': 0.4057, 'grad_norm': 40.630271911621094, 'learning_rate': 4.3213670227794757e-07, 'margin_dpo/margin_mean': 27.929889678955078, 'margin_dpo/margin_std': 21.507854461669922, 'logps/chosen': -60.836097717285156, 'logps/rejected': -138.99900817871094, 'logps/ref_chosen': -49.845306396484375, 'logps/ref_rejected': -100.07832336425781, 'logits/chosen': -0.6815335750579834, 'logits/rejected': -0.6386614441871643, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.929887771606445, 'margin_dpo/beta_margin_mean': 2.7929890155792236, 'margin_dpo/beta_margin_std': 2.152282953262329, 'margin_dpo/beta_margin_grad_mean': -0.15030139684677124, 'margin_dpo/beta_margin_grad_std': 0.1869634985923767, 'epoch': 0.32} 32%|████████████████████████▊ | 217/681 [15:48<23:21, 3.02s/it] 32%|████████████████████████▉ | 218/681 [15:51<22:34, 2.93s/it] {'loss': 0.5357, 'grad_norm': 54.89970397949219, 'learning_rate': 4.3125523023339815e-07, 'margin_dpo/margin_mean': 24.937423706054688, 'margin_dpo/margin_std': 23.69991683959961, 'logps/chosen': -69.97561645507812, 'logps/rejected': -124.18275451660156, 'logps/ref_chosen': -58.576683044433594, 'logps/ref_rejected': -87.84639739990234, 'logits/chosen': -0.6562488079071045, 'logits/rejected': -0.6250983476638794, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.937423706054688, 'margin_dpo/beta_margin_mean': 2.4937424659729004, 'margin_dpo/beta_margin_std': 2.4085628986358643, 'margin_dpo/beta_margin_grad_mean': -0.19920094311237335, 'margin_dpo/beta_margin_grad_std': 0.20679454505443573, 'epoch': 0.32} 32%|████████████████████████▉ | 218/681 [15:51<22:34, 2.93s/it] 32%|█████████████████████████ | 219/681 [15:54<22:06, 2.87s/it] {'loss': 0.5267, 'grad_norm': 60.82085037231445, 'learning_rate': 4.303689819449636e-07, 'margin_dpo/margin_mean': 22.007417678833008, 'margin_dpo/margin_std': 19.649311065673828, 'logps/chosen': -72.4845962524414, 'logps/rejected': -119.23858642578125, 'logps/ref_chosen': -61.083858489990234, 'logps/ref_rejected': -85.83042907714844, 'logits/chosen': -0.6473067998886108, 'logits/rejected': -0.6150014400482178, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 22.007417678833008, 'margin_dpo/beta_margin_mean': 2.200741767883301, 'margin_dpo/beta_margin_std': 1.9767764806747437, 'margin_dpo/beta_margin_grad_mean': -0.19810640811920166, 'margin_dpo/beta_margin_grad_std': 0.19194501638412476, 'epoch': 0.32} 32%|█████████████████████████ | 219/681 [15:54<22:06, 2.87s/it] 32%|█████████████████████████▏ | 220/681 [15:56<21:30, 2.80s/it] {'loss': 0.5029, 'grad_norm': 47.847412109375, 'learning_rate': 4.2947798076611047e-07, 'margin_dpo/margin_mean': 20.889970779418945, 'margin_dpo/margin_std': 16.6192626953125, 'logps/chosen': -81.12652587890625, 'logps/rejected': -119.67072296142578, 'logps/ref_chosen': -70.03128051757812, 'logps/ref_rejected': -87.68551635742188, 'logits/chosen': -0.6549057960510254, 'logits/rejected': -0.6138431429862976, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 20.889970779418945, 'margin_dpo/beta_margin_mean': 2.0889971256256104, 'margin_dpo/beta_margin_std': 1.714986801147461, 'margin_dpo/beta_margin_grad_mean': -0.19834579527378082, 'margin_dpo/beta_margin_grad_std': 0.180747851729393, 'epoch': 0.32} 32%|█████████████████████████▏ | 220/681 [15:57<21:30, 2.80s/it] 32%|█████████████████████████▎ | 221/681 [15:59<20:53, 2.73s/it] {'loss': 0.339, 'grad_norm': 48.202903747558594, 'learning_rate': 4.285822501755485e-07, 'margin_dpo/margin_mean': 32.81925582885742, 'margin_dpo/margin_std': 22.837791442871094, 'logps/chosen': -64.44230651855469, 'logps/rejected': -151.5745391845703, 'logps/ref_chosen': -52.15470886230469, 'logps/ref_rejected': -106.46768188476562, 'logits/chosen': -0.6510884761810303, 'logits/rejected': -0.6392531394958496, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.81925582885742, 'margin_dpo/beta_margin_mean': 3.281925678253174, 'margin_dpo/beta_margin_std': 2.298614740371704, 'margin_dpo/beta_margin_grad_mean': -0.12460769712924957, 'margin_dpo/beta_margin_grad_std': 0.1816825121641159, 'epoch': 0.32} 32%|█████████████████████████▎ | 221/681 [15:59<20:53, 2.73s/it] 33%|█████████████████████████▍ | 222/681 [16:02<20:49, 2.72s/it] {'loss': 0.5718, 'grad_norm': 76.29179382324219, 'learning_rate': 4.276818137766118e-07, 'margin_dpo/margin_mean': 26.146484375, 'margin_dpo/margin_std': 24.794532775878906, 'logps/chosen': -74.65057373046875, 'logps/rejected': -139.82711791992188, 'logps/ref_chosen': -60.971099853515625, 'logps/ref_rejected': -100.00115203857422, 'logits/chosen': -0.7204064130783081, 'logits/rejected': -0.6859586238861084, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.146484375, 'margin_dpo/beta_margin_mean': 2.6146483421325684, 'margin_dpo/beta_margin_std': 2.517277479171753, 'margin_dpo/beta_margin_grad_mean': -0.19307678937911987, 'margin_dpo/beta_margin_grad_std': 0.22420592606067657, 'epoch': 0.33} 33%|█████████████████████████▍ | 222/681 [16:02<20:49, 2.72s/it] 33%|█████████████████████████▌ | 223/681 [16:04<19:38, 2.57s/it] {'loss': 0.738, 'grad_norm': 78.53938293457031, 'learning_rate': 4.2677669529663686e-07, 'margin_dpo/margin_mean': 22.292381286621094, 'margin_dpo/margin_std': 22.842185974121094, 'logps/chosen': -68.55857849121094, 'logps/rejected': -121.03541564941406, 'logps/ref_chosen': -52.64057922363281, 'logps/ref_rejected': -82.82502746582031, 'logits/chosen': -0.7086101770401001, 'logits/rejected': -0.6605532169342041, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 22.292381286621094, 'margin_dpo/beta_margin_mean': 2.229238271713257, 'margin_dpo/beta_margin_std': 2.2963743209838867, 'margin_dpo/beta_margin_grad_mean': -0.22235842049121857, 'margin_dpo/beta_margin_grad_std': 0.26257219910621643, 'epoch': 0.33} 33%|█████████████████████████▌ | 223/681 [16:04<19:38, 2.57s/it] 33%|█████████████████████████▋ | 224/681 [16:06<18:35, 2.44s/it] {'loss': 0.5673, 'grad_norm': 74.9097671508789, 'learning_rate': 4.2586691858633747e-07, 'margin_dpo/margin_mean': 26.778461456298828, 'margin_dpo/margin_std': 24.951221466064453, 'logps/chosen': -61.69850158691406, 'logps/rejected': -116.998046875, 'logps/ref_chosen': -48.59540939331055, 'logps/ref_rejected': -77.11648559570312, 'logits/chosen': -0.6751635074615479, 'logits/rejected': -0.6340160369873047, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.778461456298828, 'margin_dpo/beta_margin_mean': 2.6778461933135986, 'margin_dpo/beta_margin_std': 2.5320727825164795, 'margin_dpo/beta_margin_grad_mean': -0.19054511189460754, 'margin_dpo/beta_margin_grad_std': 0.2218094766139984, 'epoch': 0.33} 33%|█████████████████████████▋ | 224/681 [16:06<18:35, 2.44s/it] 33%|█████████████████████████▊ | 225/681 [16:08<18:15, 2.40s/it] {'loss': 0.4077, 'grad_norm': 43.42683792114258, 'learning_rate': 4.249525076191759e-07, 'margin_dpo/margin_mean': 32.95307922363281, 'margin_dpo/margin_std': 26.78388214111328, 'logps/chosen': -72.66340637207031, 'logps/rejected': -147.5189208984375, 'logps/ref_chosen': -58.000465393066406, 'logps/ref_rejected': -99.90290832519531, 'logits/chosen': -0.6741304397583008, 'logits/rejected': -0.6419914960861206, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.95307922363281, 'margin_dpo/beta_margin_mean': 3.2953078746795654, 'margin_dpo/beta_margin_std': 2.6787664890289307, 'margin_dpo/beta_margin_grad_mean': -0.14978624880313873, 'margin_dpo/beta_margin_grad_std': 0.203273743391037, 'epoch': 0.33} 33%|█████████████████████████▊ | 225/681 [16:08<18:15, 2.40s/it] 33%|█████████████████████████▉ | 226/681 [16:11<19:03, 2.51s/it] {'loss': 0.4851, 'grad_norm': 51.3669548034668, 'learning_rate': 4.2403348649073167e-07, 'margin_dpo/margin_mean': 25.563575744628906, 'margin_dpo/margin_std': 21.131916046142578, 'logps/chosen': -69.51836395263672, 'logps/rejected': -114.87089538574219, 'logps/ref_chosen': -58.898799896240234, 'logps/ref_rejected': -78.68775939941406, 'logits/chosen': -0.6864838600158691, 'logits/rejected': -0.6341279745101929, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.563575744628906, 'margin_dpo/beta_margin_mean': 2.5563576221466064, 'margin_dpo/beta_margin_std': 2.2327442169189453, 'margin_dpo/beta_margin_grad_mean': -0.17993833124637604, 'margin_dpo/beta_margin_grad_std': 0.19534794986248016, 'epoch': 0.33} 33%|█████████████████████████▉ | 226/681 [16:11<19:03, 2.51s/it] 33%|██████████████████████████ | 227/681 [16:14<19:00, 2.51s/it] {'loss': 0.4141, 'grad_norm': 48.44467544555664, 'learning_rate': 4.2310987941806615e-07, 'margin_dpo/margin_mean': 31.536819458007812, 'margin_dpo/margin_std': 26.029647827148438, 'logps/chosen': -70.79923248291016, 'logps/rejected': -142.67623901367188, 'logps/ref_chosen': -59.072181701660156, 'logps/ref_rejected': -99.41236877441406, 'logits/chosen': -0.6759487390518188, 'logits/rejected': -0.6457496881484985, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.53681755065918, 'margin_dpo/beta_margin_mean': 3.153681993484497, 'margin_dpo/beta_margin_std': 2.6311187744140625, 'margin_dpo/beta_margin_grad_mean': -0.15354523062705994, 'margin_dpo/beta_margin_grad_std': 0.1981254369020462, 'epoch': 0.33} 33%|██████████████████████████ | 227/681 [16:14<19:00, 2.51s/it] 33%|██████████████████████████ | 228/681 [16:17<20:26, 2.71s/it] {'loss': 0.5336, 'grad_norm': 55.06504821777344, 'learning_rate': 4.2218171073908463e-07, 'margin_dpo/margin_mean': 24.27182960510254, 'margin_dpo/margin_std': 20.79153823852539, 'logps/chosen': -78.89456176757812, 'logps/rejected': -128.3238525390625, 'logps/ref_chosen': -65.89129638671875, 'logps/ref_rejected': -91.04875183105469, 'logits/chosen': -0.6563422679901123, 'logits/rejected': -0.6227169036865234, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.27182960510254, 'margin_dpo/beta_margin_mean': 2.427182912826538, 'margin_dpo/beta_margin_std': 2.091808795928955, 'margin_dpo/beta_margin_grad_mean': -0.18722449243068695, 'margin_dpo/beta_margin_grad_std': 0.2196025550365448, 'epoch': 0.33} 33%|██████████████████████████ | 228/681 [16:17<20:26, 2.71s/it] 34%|██████████████████████████▏ | 229/681 [16:19<19:50, 2.63s/it] {'loss': 0.598, 'grad_norm': 64.2571792602539, 'learning_rate': 4.212490049118951e-07, 'margin_dpo/margin_mean': 27.103031158447266, 'margin_dpo/margin_std': 25.39706802368164, 'logps/chosen': -85.07996368408203, 'logps/rejected': -126.00403594970703, 'logps/ref_chosen': -70.70636749267578, 'logps/ref_rejected': -84.52740478515625, 'logits/chosen': -0.6885573863983154, 'logits/rejected': -0.6359836459159851, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.1030330657959, 'margin_dpo/beta_margin_mean': 2.71030330657959, 'margin_dpo/beta_margin_std': 2.555929660797119, 'margin_dpo/beta_margin_grad_mean': -0.1898403912782669, 'margin_dpo/beta_margin_grad_std': 0.23285524547100067, 'epoch': 0.34} 34%|██████████████████████████▏ | 229/681 [16:19<19:50, 2.63s/it] 34%|██████████████████████████▎ | 230/681 [16:22<19:16, 2.56s/it] {'loss': 0.5067, 'grad_norm': 50.68180465698242, 'learning_rate': 4.203117865141635e-07, 'margin_dpo/margin_mean': 30.37428092956543, 'margin_dpo/margin_std': 27.84336280822754, 'logps/chosen': -51.398292541503906, 'logps/rejected': -128.11248779296875, 'logps/ref_chosen': -39.282005310058594, 'logps/ref_rejected': -85.62191009521484, 'logits/chosen': -0.6804044842720032, 'logits/rejected': -0.6706264019012451, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.37428092956543, 'margin_dpo/beta_margin_mean': 3.037428140640259, 'margin_dpo/beta_margin_std': 2.7951576709747314, 'margin_dpo/beta_margin_grad_mean': -0.16133855283260345, 'margin_dpo/beta_margin_grad_std': 0.21256163716316223, 'epoch': 0.34} 34%|██████████████████████████▎ | 230/681 [16:22<19:16, 2.56s/it] 34%|██████████████████████████▍ | 231/681 [16:24<19:31, 2.60s/it] {'loss': 0.4698, 'grad_norm': 42.53703689575195, 'learning_rate': 4.1937008024246625e-07, 'margin_dpo/margin_mean': 26.028377532958984, 'margin_dpo/margin_std': 24.996898651123047, 'logps/chosen': -74.62582397460938, 'logps/rejected': -111.50166320800781, 'logps/ref_chosen': -63.27644348144531, 'logps/ref_rejected': -74.1239013671875, 'logits/chosen': -0.6829984188079834, 'logits/rejected': -0.6394829750061035, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.028377532958984, 'margin_dpo/beta_margin_mean': 2.6028378009796143, 'margin_dpo/beta_margin_std': 2.508455276489258, 'margin_dpo/beta_margin_grad_mean': -0.18761104345321655, 'margin_dpo/beta_margin_grad_std': 0.17357850074768066, 'epoch': 0.34} 34%|██████████████████████████▍ | 231/681 [16:24<19:31, 2.60s/it] 34%|██████████████████████████▌ | 232/681 [16:27<19:55, 2.66s/it] {'loss': 0.6921, 'grad_norm': 70.08275604248047, 'learning_rate': 4.1842391091163933e-07, 'margin_dpo/margin_mean': 21.211563110351562, 'margin_dpo/margin_std': 22.4114933013916, 'logps/chosen': -84.29617309570312, 'logps/rejected': -118.73604583740234, 'logps/ref_chosen': -70.74876403808594, 'logps/ref_rejected': -83.97706604003906, 'logits/chosen': -0.6572903394699097, 'logits/rejected': -0.5994934439659119, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 21.211563110351562, 'margin_dpo/beta_margin_mean': 2.1211562156677246, 'margin_dpo/beta_margin_std': 2.3135242462158203, 'margin_dpo/beta_margin_grad_mean': -0.2373134195804596, 'margin_dpo/beta_margin_grad_std': 0.24425449967384338, 'epoch': 0.34} 34%|██████████████████████████▌ | 232/681 [16:27<19:55, 2.66s/it] 34%|██████████████████████████▋ | 233/681 [16:30<20:07, 2.69s/it] {'loss': 0.5602, 'grad_norm': 61.6278076171875, 'learning_rate': 4.174733034541245e-07, 'margin_dpo/margin_mean': 27.884885787963867, 'margin_dpo/margin_std': 26.062320709228516, 'logps/chosen': -67.88652801513672, 'logps/rejected': -148.36856079101562, 'logps/ref_chosen': -54.8829345703125, 'logps/ref_rejected': -107.48007202148438, 'logits/chosen': -0.6890474557876587, 'logits/rejected': -0.6643567085266113, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.8848876953125, 'margin_dpo/beta_margin_mean': 2.7884888648986816, 'margin_dpo/beta_margin_std': 2.659421682357788, 'margin_dpo/beta_margin_grad_mean': -0.18775954842567444, 'margin_dpo/beta_margin_grad_std': 0.23140782117843628, 'epoch': 0.34} 34%|██████████████████████████▋ | 233/681 [16:30<20:07, 2.69s/it] 34%|██████████████████████████▊ | 234/681 [16:33<20:06, 2.70s/it] {'loss': 0.4561, 'grad_norm': 60.47285461425781, 'learning_rate': 4.165182829193126e-07, 'margin_dpo/margin_mean': 28.14922332763672, 'margin_dpo/margin_std': 21.847400665283203, 'logps/chosen': -54.90777587890625, 'logps/rejected': -138.9691162109375, 'logps/ref_chosen': -44.09451675415039, 'logps/ref_rejected': -100.00663757324219, 'logits/chosen': -0.6370252370834351, 'logits/rejected': -0.6381373405456543, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.14922523498535, 'margin_dpo/beta_margin_mean': 2.814922571182251, 'margin_dpo/beta_margin_std': 2.241670846939087, 'margin_dpo/beta_margin_grad_mean': -0.15120825171470642, 'margin_dpo/beta_margin_grad_std': 0.1929025799036026, 'epoch': 0.34} 34%|██████████████████████████▊ | 234/681 [16:33<20:06, 2.70s/it] 35%|██████████████████████████▉ | 235/681 [16:35<19:30, 2.62s/it] {'loss': 0.5974, 'grad_norm': 63.21758270263672, 'learning_rate': 4.1555887447288255e-07, 'margin_dpo/margin_mean': 22.86014175415039, 'margin_dpo/margin_std': 22.919218063354492, 'logps/chosen': -77.54314422607422, 'logps/rejected': -128.5604248046875, 'logps/ref_chosen': -62.237911224365234, 'logps/ref_rejected': -90.39505767822266, 'logits/chosen': -0.6568065881729126, 'logits/rejected': -0.614643931388855, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 22.860143661499023, 'margin_dpo/beta_margin_mean': 2.2860143184661865, 'margin_dpo/beta_margin_std': 2.296712875366211, 'margin_dpo/beta_margin_grad_mean': -0.215665802359581, 'margin_dpo/beta_margin_grad_std': 0.2162242829799652, 'epoch': 0.35} 35%|██████████████████████████▉ | 235/681 [16:35<19:30, 2.62s/it] 35%|███████████████████████████ | 236/681 [16:38<19:41, 2.65s/it] {'loss': 0.5646, 'grad_norm': 65.25566864013672, 'learning_rate': 4.1459510339613946e-07, 'margin_dpo/margin_mean': 25.488582611083984, 'margin_dpo/margin_std': 23.585155487060547, 'logps/chosen': -60.41249084472656, 'logps/rejected': -140.07135009765625, 'logps/ref_chosen': -49.34136199951172, 'logps/ref_rejected': -103.51162719726562, 'logits/chosen': -0.6559075117111206, 'logits/rejected': -0.6537389159202576, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.488582611083984, 'margin_dpo/beta_margin_mean': 2.548858165740967, 'margin_dpo/beta_margin_std': 2.3670222759246826, 'margin_dpo/beta_margin_grad_mean': -0.19891716539859772, 'margin_dpo/beta_margin_grad_std': 0.22555947303771973, 'epoch': 0.35} 35%|███████████████████████████ | 236/681 [16:38<19:41, 2.65s/it] 35%|███████████████████████████▏ | 237/681 [16:41<19:49, 2.68s/it] {'loss': 0.5116, 'grad_norm': 48.03404235839844, 'learning_rate': 4.136269950853473e-07, 'margin_dpo/margin_mean': 27.52564811706543, 'margin_dpo/margin_std': 24.07387924194336, 'logps/chosen': -65.91875457763672, 'logps/rejected': -134.05665588378906, 'logps/ref_chosen': -54.168121337890625, 'logps/ref_rejected': -94.78036499023438, 'logits/chosen': -0.6702800989151001, 'logits/rejected': -0.636156439781189, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.52564811706543, 'margin_dpo/beta_margin_mean': 2.7525649070739746, 'margin_dpo/beta_margin_std': 2.4088451862335205, 'margin_dpo/beta_margin_grad_mean': -0.17360197007656097, 'margin_dpo/beta_margin_grad_std': 0.21111546456813812, 'epoch': 0.35} 35%|███████████████████████████▏ | 237/681 [16:41<19:49, 2.68s/it] 35%|███████████████████████████▎ | 238/681 [16:43<20:07, 2.72s/it] {'loss': 0.4407, 'grad_norm': 39.46758270263672, 'learning_rate': 4.126545750510605e-07, 'margin_dpo/margin_mean': 24.640880584716797, 'margin_dpo/margin_std': 20.111305236816406, 'logps/chosen': -64.94898986816406, 'logps/rejected': -125.03469848632812, 'logps/ref_chosen': -53.973121643066406, 'logps/ref_rejected': -89.41795349121094, 'logits/chosen': -0.6305921077728271, 'logits/rejected': -0.6234115958213806, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.640880584716797, 'margin_dpo/beta_margin_mean': 2.464088201522827, 'margin_dpo/beta_margin_std': 2.025897979736328, 'margin_dpo/beta_margin_grad_mean': -0.17112761735916138, 'margin_dpo/beta_margin_grad_std': 0.17932020127773285, 'epoch': 0.35} 35%|███████████████████████████▎ | 238/681 [16:43<20:07, 2.72s/it] 35%|███████████████████████████▎ | 239/681 [16:46<19:36, 2.66s/it] {'loss': 0.4436, 'grad_norm': 49.3748664855957, 'learning_rate': 4.116778689174514e-07, 'margin_dpo/margin_mean': 25.54654312133789, 'margin_dpo/margin_std': 19.89307975769043, 'logps/chosen': -70.67376708984375, 'logps/rejected': -131.71542358398438, 'logps/ref_chosen': -58.09782409667969, 'logps/ref_rejected': -93.59294128417969, 'logits/chosen': -0.7114957571029663, 'logits/rejected': -0.6843305826187134, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.54654312133789, 'margin_dpo/beta_margin_mean': 2.554654359817505, 'margin_dpo/beta_margin_std': 2.115661144256592, 'margin_dpo/beta_margin_grad_mean': -0.16616235673427582, 'margin_dpo/beta_margin_grad_std': 0.18609200417995453, 'epoch': 0.35} 35%|███████████████████████████▎ | 239/681 [16:46<19:36, 2.66s/it] 35%|███████████████████████████▍ | 240/681 [16:49<19:27, 2.65s/it] {'loss': 0.6257, 'grad_norm': 60.53359603881836, 'learning_rate': 4.106969024216348e-07, 'margin_dpo/margin_mean': 22.555252075195312, 'margin_dpo/margin_std': 20.787620544433594, 'logps/chosen': -73.52519226074219, 'logps/rejected': -109.58448791503906, 'logps/ref_chosen': -60.6144905090332, 'logps/ref_rejected': -74.1185302734375, 'logits/chosen': -0.6911687850952148, 'logits/rejected': -0.6599963903427124, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 22.555252075195312, 'margin_dpo/beta_margin_mean': 2.2555253505706787, 'margin_dpo/beta_margin_std': 2.1107068061828613, 'margin_dpo/beta_margin_grad_mean': -0.2058243602514267, 'margin_dpo/beta_margin_grad_std': 0.2343757450580597, 'epoch': 0.35} 35%|███████████████████████████▍ | 240/681 [16:49<19:27, 2.65s/it] 35%|███████████████████████████▌ | 241/681 [16:51<19:12, 2.62s/it] {'loss': 0.5099, 'grad_norm': 59.422630310058594, 'learning_rate': 4.097117014129903e-07, 'margin_dpo/margin_mean': 31.69961929321289, 'margin_dpo/margin_std': 29.628376007080078, 'logps/chosen': -76.52700805664062, 'logps/rejected': -130.19644165039062, 'logps/ref_chosen': -66.091064453125, 'logps/ref_rejected': -88.06088256835938, 'logits/chosen': -0.6552136540412903, 'logits/rejected': -0.6012428998947144, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.69961929321289, 'margin_dpo/beta_margin_mean': 3.169961929321289, 'margin_dpo/beta_margin_std': 3.125377655029297, 'margin_dpo/beta_margin_grad_mean': -0.1663733720779419, 'margin_dpo/beta_margin_grad_std': 0.22269202768802643, 'epoch': 0.35} 35%|███████████████████████████▌ | 241/681 [16:51<19:12, 2.62s/it] 36%|███████████████████████████▋ | 242/681 [16:54<18:49, 2.57s/it] {'loss': 0.4934, 'grad_norm': 52.94541931152344, 'learning_rate': 4.087222918524807e-07, 'margin_dpo/margin_mean': 24.644126892089844, 'margin_dpo/margin_std': 21.64803123474121, 'logps/chosen': -79.44197845458984, 'logps/rejected': -119.58251190185547, 'logps/ref_chosen': -67.86392211914062, 'logps/ref_rejected': -83.36033630371094, 'logits/chosen': -0.6454315185546875, 'logits/rejected': -0.6136279702186584, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.644126892089844, 'margin_dpo/beta_margin_mean': 2.4644126892089844, 'margin_dpo/beta_margin_std': 2.184521198272705, 'margin_dpo/beta_margin_grad_mean': -0.1810285896062851, 'margin_dpo/beta_margin_grad_std': 0.19746260344982147, 'epoch': 0.36} 36%|███████████████████████████▋ | 242/681 [16:54<18:49, 2.57s/it] 36%|███████████████████████████▊ | 243/681 [16:56<18:41, 2.56s/it] {'loss': 0.3271, 'grad_norm': 34.107791900634766, 'learning_rate': 4.07728699811968e-07, 'margin_dpo/margin_mean': 29.49079132080078, 'margin_dpo/margin_std': 21.805618286132812, 'logps/chosen': -74.12469482421875, 'logps/rejected': -116.86687469482422, 'logps/ref_chosen': -63.08424377441406, 'logps/ref_rejected': -76.33563232421875, 'logits/chosen': -0.6725857257843018, 'logits/rejected': -0.6084048748016357, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.49078941345215, 'margin_dpo/beta_margin_mean': 2.9490790367126465, 'margin_dpo/beta_margin_std': 2.1849277019500732, 'margin_dpo/beta_margin_grad_mean': -0.13316328823566437, 'margin_dpo/beta_margin_grad_std': 0.1522829383611679, 'epoch': 0.36} 36%|███████████████████████████▊ | 243/681 [16:56<18:41, 2.56s/it] 36%|███████████████████████████▉ | 244/681 [16:59<18:40, 2.56s/it] {'loss': 0.4934, 'grad_norm': 42.87071228027344, 'learning_rate': 4.067309514735267e-07, 'margin_dpo/margin_mean': 25.319198608398438, 'margin_dpo/margin_std': 21.36996078491211, 'logps/chosen': -71.2780990600586, 'logps/rejected': -130.34854125976562, 'logps/ref_chosen': -61.14069366455078, 'logps/ref_rejected': -94.89193725585938, 'logits/chosen': -0.6881895065307617, 'logits/rejected': -0.6778185367584229, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.319198608398438, 'margin_dpo/beta_margin_mean': 2.5319199562072754, 'margin_dpo/beta_margin_std': 2.150766372680664, 'margin_dpo/beta_margin_grad_mean': -0.18467824161052704, 'margin_dpo/beta_margin_grad_std': 0.20196670293807983, 'epoch': 0.36} 36%|███████████████████████████▉ | 244/681 [16:59<18:40, 2.56s/it] 36%|████████████████████████████ | 245/681 [17:01<19:06, 2.63s/it] {'loss': 0.5326, 'grad_norm': 74.6055679321289, 'learning_rate': 4.057290731287531e-07, 'margin_dpo/margin_mean': 26.900461196899414, 'margin_dpo/margin_std': 25.503192901611328, 'logps/chosen': -78.92977905273438, 'logps/rejected': -126.20805358886719, 'logps/ref_chosen': -67.26228332519531, 'logps/ref_rejected': -87.64010620117188, 'logits/chosen': -0.7033660411834717, 'logits/rejected': -0.6514378786087036, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.90045928955078, 'margin_dpo/beta_margin_mean': 2.6900460720062256, 'margin_dpo/beta_margin_std': 2.7254021167755127, 'margin_dpo/beta_margin_grad_mean': -0.1932898312807083, 'margin_dpo/beta_margin_grad_std': 0.2051628977060318, 'epoch': 0.36} 36%|████████████████████████████ | 245/681 [17:02<19:06, 2.63s/it] 36%|████████████████████████████▏ | 246/681 [17:04<19:01, 2.62s/it] {'loss': 0.5288, 'grad_norm': 56.00790023803711, 'learning_rate': 4.047230911780736e-07, 'margin_dpo/margin_mean': 23.103233337402344, 'margin_dpo/margin_std': 21.10454559326172, 'logps/chosen': -78.0211181640625, 'logps/rejected': -118.77372741699219, 'logps/ref_chosen': -66.69696807861328, 'logps/ref_rejected': -84.34634399414062, 'logits/chosen': -0.7089934945106506, 'logits/rejected': -0.6705622673034668, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 23.103235244750977, 'margin_dpo/beta_margin_mean': 2.310323476791382, 'margin_dpo/beta_margin_std': 2.1114535331726074, 'margin_dpo/beta_margin_grad_mean': -0.1983981728553772, 'margin_dpo/beta_margin_grad_std': 0.19785138964653015, 'epoch': 0.36} 36%|████████████████████████████▏ | 246/681 [17:04<19:01, 2.62s/it] 36%|████████████████████████████▎ | 247/681 [17:07<18:42, 2.59s/it] {'loss': 0.4045, 'grad_norm': 41.90789031982422, 'learning_rate': 4.0371303213004814e-07, 'margin_dpo/margin_mean': 32.50457000732422, 'margin_dpo/margin_std': 25.436208724975586, 'logps/chosen': -68.0724868774414, 'logps/rejected': -150.26498413085938, 'logps/ref_chosen': -56.6053466796875, 'logps/ref_rejected': -106.29327392578125, 'logits/chosen': -0.7110755443572998, 'logits/rejected': -0.6894150972366333, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.50457000732422, 'margin_dpo/beta_margin_mean': 3.2504570484161377, 'margin_dpo/beta_margin_std': 2.560331344604492, 'margin_dpo/beta_margin_grad_mean': -0.14395716786384583, 'margin_dpo/beta_margin_grad_std': 0.19858244061470032, 'epoch': 0.36} 36%|████████████████████████████▎ | 247/681 [17:07<18:42, 2.59s/it] 36%|████████████████████████████▍ | 248/681 [17:09<18:36, 2.58s/it] {'loss': 0.4107, 'grad_norm': 42.92959213256836, 'learning_rate': 4.0269892260067197e-07, 'margin_dpo/margin_mean': 24.35607147216797, 'margin_dpo/margin_std': 19.101226806640625, 'logps/chosen': -54.540321350097656, 'logps/rejected': -126.71005249023438, 'logps/ref_chosen': -44.043216705322266, 'logps/ref_rejected': -91.85687255859375, 'logits/chosen': -0.6845219135284424, 'logits/rejected': -0.6683632135391235, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.35607147216797, 'margin_dpo/beta_margin_mean': 2.4356071949005127, 'margin_dpo/beta_margin_std': 1.9233942031860352, 'margin_dpo/beta_margin_grad_mean': -0.16844278573989868, 'margin_dpo/beta_margin_grad_std': 0.15559379756450653, 'epoch': 0.36} 36%|████████████████████████████▍ | 248/681 [17:09<18:36, 2.58s/it] 37%|████████████████████████████▌ | 249/681 [17:12<18:50, 2.62s/it] {'loss': 0.6535, 'grad_norm': 59.13127517700195, 'learning_rate': 4.0168078931267426e-07, 'margin_dpo/margin_mean': 20.92279624938965, 'margin_dpo/margin_std': 20.69894790649414, 'logps/chosen': -74.95724487304688, 'logps/rejected': -113.90575408935547, 'logps/ref_chosen': -62.442352294921875, 'logps/ref_rejected': -80.46806335449219, 'logits/chosen': -0.7052150964736938, 'logits/rejected': -0.6669450998306274, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 20.92279624938965, 'margin_dpo/beta_margin_mean': 2.0922796726226807, 'margin_dpo/beta_margin_std': 2.0924148559570312, 'margin_dpo/beta_margin_grad_mean': -0.23194736242294312, 'margin_dpo/beta_margin_grad_std': 0.2308819442987442, 'epoch': 0.37} 37%|████████████████████████████▌ | 249/681 [17:12<18:50, 2.62s/it] 37%|████████████████████████████▋ | 250/681 [17:15<19:32, 2.72s/it] {'loss': 0.4359, 'grad_norm': 34.11585235595703, 'learning_rate': 4.006586590948141e-07, 'margin_dpo/margin_mean': 25.81413459777832, 'margin_dpo/margin_std': 18.1165828704834, 'logps/chosen': -74.52294158935547, 'logps/rejected': -108.57221221923828, 'logps/ref_chosen': -65.6366958618164, 'logps/ref_rejected': -73.87183380126953, 'logits/chosen': -0.6944586038589478, 'logits/rejected': -0.6244109272956848, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.81413459777832, 'margin_dpo/beta_margin_mean': 2.581413507461548, 'margin_dpo/beta_margin_std': 1.8444185256958008, 'margin_dpo/beta_margin_grad_mean': -0.1552935689687729, 'margin_dpo/beta_margin_grad_std': 0.19923508167266846, 'epoch': 0.37} 37%|████████████████████████████▋ | 250/681 [17:15<19:32, 2.72s/it] 37%|████████████████████████████▋ | 251/681 [17:17<18:53, 2.64s/it] {'loss': 0.4579, 'grad_norm': 44.37178039550781, 'learning_rate': 3.9963255888117325e-07, 'margin_dpo/margin_mean': 25.74152183532715, 'margin_dpo/margin_std': 20.57958984375, 'logps/chosen': -70.05519104003906, 'logps/rejected': -116.27742767333984, 'logps/ref_chosen': -57.182716369628906, 'logps/ref_rejected': -77.66343688964844, 'logits/chosen': -0.7029905319213867, 'logits/rejected': -0.6486064195632935, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.74152183532715, 'margin_dpo/beta_margin_mean': 2.5741522312164307, 'margin_dpo/beta_margin_std': 2.079760789871216, 'margin_dpo/beta_margin_grad_mean': -0.1774299144744873, 'margin_dpo/beta_margin_grad_std': 0.19099289178848267, 'epoch': 0.37} 37%|████████████████████████████▋ | 251/681 [17:17<18:53, 2.64s/it] 37%|████████████████████████████▊ | 252/681 [17:20<18:51, 2.64s/it] {'loss': 0.4309, 'grad_norm': 53.544761657714844, 'learning_rate': 3.9860251571044666e-07, 'margin_dpo/margin_mean': 25.683515548706055, 'margin_dpo/margin_std': 19.43787384033203, 'logps/chosen': -83.42109680175781, 'logps/rejected': -122.17694854736328, 'logps/ref_chosen': -71.68563842773438, 'logps/ref_rejected': -84.75798797607422, 'logits/chosen': -0.6703172326087952, 'logits/rejected': -0.62431800365448, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.683515548706055, 'margin_dpo/beta_margin_mean': 2.5683515071868896, 'margin_dpo/beta_margin_std': 1.9699735641479492, 'margin_dpo/beta_margin_grad_mean': -0.15842175483703613, 'margin_dpo/beta_margin_grad_std': 0.1866365671157837, 'epoch': 0.37} 37%|████████████████████████████▊ | 252/681 [17:20<18:51, 2.64s/it] 37%|████████████████████████████▉ | 253/681 [17:23<18:52, 2.65s/it] {'loss': 0.6253, 'grad_norm': 50.1516227722168, 'learning_rate': 3.9756855672522986e-07, 'margin_dpo/margin_mean': 24.096176147460938, 'margin_dpo/margin_std': 22.95254135131836, 'logps/chosen': -79.20399475097656, 'logps/rejected': -132.8687744140625, 'logps/ref_chosen': -69.13392639160156, 'logps/ref_rejected': -98.70252990722656, 'logits/chosen': -0.6842066049575806, 'logits/rejected': -0.654214084148407, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.096176147460938, 'margin_dpo/beta_margin_mean': 2.4096176624298096, 'margin_dpo/beta_margin_std': 2.3356130123138428, 'margin_dpo/beta_margin_grad_mean': -0.2049117386341095, 'margin_dpo/beta_margin_grad_std': 0.23132526874542236, 'epoch': 0.37} 37%|████████████████████████████▉ | 253/681 [17:23<18:52, 2.65s/it] 37%|█████████████████████████████ | 254/681 [17:25<19:02, 2.68s/it] {'loss': 0.5557, 'grad_norm': 64.96926879882812, 'learning_rate': 3.965307091713037e-07, 'margin_dpo/margin_mean': 24.636829376220703, 'margin_dpo/margin_std': 22.574371337890625, 'logps/chosen': -64.82050323486328, 'logps/rejected': -125.6099853515625, 'logps/ref_chosen': -54.154998779296875, 'logps/ref_rejected': -90.30764770507812, 'logits/chosen': -0.7051047682762146, 'logits/rejected': -0.6575514078140259, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.636829376220703, 'margin_dpo/beta_margin_mean': 2.4636828899383545, 'margin_dpo/beta_margin_std': 2.2709455490112305, 'margin_dpo/beta_margin_grad_mean': -0.19986601173877716, 'margin_dpo/beta_margin_grad_std': 0.2192426323890686, 'epoch': 0.37} 37%|█████████████████████████████ | 254/681 [17:25<19:02, 2.68s/it] 37%|█████████████████████████████▏ | 255/681 [17:28<19:01, 2.68s/it] {'loss': 0.6594, 'grad_norm': 66.39599609375, 'learning_rate': 3.954890003969163e-07, 'margin_dpo/margin_mean': 27.34168243408203, 'margin_dpo/margin_std': 28.012739181518555, 'logps/chosen': -70.39166259765625, 'logps/rejected': -130.80026245117188, 'logps/ref_chosen': -57.14167022705078, 'logps/ref_rejected': -90.2085952758789, 'logits/chosen': -0.7102745771408081, 'logits/rejected': -0.6797518730163574, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.34168243408203, 'margin_dpo/beta_margin_mean': 2.734168291091919, 'margin_dpo/beta_margin_std': 2.8561336994171143, 'margin_dpo/beta_margin_grad_mean': -0.1914447396993637, 'margin_dpo/beta_margin_grad_std': 0.22745780646800995, 'epoch': 0.37} 37%|█████████████████████████████▏ | 255/681 [17:28<19:01, 2.68s/it] 38%|█████████████████████████████▎ | 256/681 [17:31<19:07, 2.70s/it] {'loss': 0.5121, 'grad_norm': 58.85321807861328, 'learning_rate': 3.944434578520628e-07, 'margin_dpo/margin_mean': 27.34583854675293, 'margin_dpo/margin_std': 25.045982360839844, 'logps/chosen': -68.35701751708984, 'logps/rejected': -133.102294921875, 'logps/ref_chosen': -55.163490295410156, 'logps/ref_rejected': -92.56291961669922, 'logits/chosen': -0.6565215587615967, 'logits/rejected': -0.6265472769737244, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.34583854675293, 'margin_dpo/beta_margin_mean': 2.734584093093872, 'margin_dpo/beta_margin_std': 2.515676498413086, 'margin_dpo/beta_margin_grad_mean': -0.17351847887039185, 'margin_dpo/beta_margin_grad_std': 0.20426101982593536, 'epoch': 0.38} 38%|█████████████████████████████▎ | 256/681 [17:31<19:07, 2.70s/it] 38%|█████████████████████████████▍ | 257/681 [17:33<19:00, 2.69s/it] {'loss': 0.5015, 'grad_norm': 45.65961456298828, 'learning_rate': 3.933941090877615e-07, 'margin_dpo/margin_mean': 30.103281021118164, 'margin_dpo/margin_std': 25.718414306640625, 'logps/chosen': -61.90161895751953, 'logps/rejected': -122.11911010742188, 'logps/ref_chosen': -49.4236946105957, 'logps/ref_rejected': -79.53791809082031, 'logits/chosen': -0.6682260036468506, 'logits/rejected': -0.6451402902603149, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.103281021118164, 'margin_dpo/beta_margin_mean': 3.0103280544281006, 'margin_dpo/beta_margin_std': 2.688237190246582, 'margin_dpo/beta_margin_grad_mean': -0.17995263636112213, 'margin_dpo/beta_margin_grad_std': 0.22144293785095215, 'epoch': 0.38} 38%|█████████████████████████████▍ | 257/681 [17:33<19:00, 2.69s/it] 38%|█████████████████████████████▌ | 258/681 [17:36<18:07, 2.57s/it] {'loss': 0.7407, 'grad_norm': 90.31965637207031, 'learning_rate': 3.923409817553284e-07, 'margin_dpo/margin_mean': 26.426227569580078, 'margin_dpo/margin_std': 27.302228927612305, 'logps/chosen': -75.35392761230469, 'logps/rejected': -138.38613891601562, 'logps/ref_chosen': -59.384124755859375, 'logps/ref_rejected': -95.9901123046875, 'logits/chosen': -0.6991258263587952, 'logits/rejected': -0.669155478477478, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.426227569580078, 'margin_dpo/beta_margin_mean': 2.642622947692871, 'margin_dpo/beta_margin_std': 2.7432861328125, 'margin_dpo/beta_margin_grad_mean': -0.2009788304567337, 'margin_dpo/beta_margin_grad_std': 0.24991276860237122, 'epoch': 0.38} 38%|█████████████████████████████▌ | 258/681 [17:36<18:07, 2.57s/it] 38%|█████████████████████████████▋ | 259/681 [17:39<18:43, 2.66s/it] {'loss': 0.5311, 'grad_norm': 54.30337142944336, 'learning_rate': 3.9128410360564793e-07, 'margin_dpo/margin_mean': 23.94602394104004, 'margin_dpo/margin_std': 20.407352447509766, 'logps/chosen': -67.30290222167969, 'logps/rejected': -127.61224365234375, 'logps/ref_chosen': -52.828346252441406, 'logps/ref_rejected': -89.19165802001953, 'logits/chosen': -0.6342747211456299, 'logits/rejected': -0.6089296340942383, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 23.946025848388672, 'margin_dpo/beta_margin_mean': 2.3946025371551514, 'margin_dpo/beta_margin_std': 2.1142630577087402, 'margin_dpo/beta_margin_grad_mean': -0.1905156373977661, 'margin_dpo/beta_margin_grad_std': 0.20092925429344177, 'epoch': 0.38} 38%|█████████████████████████████▋ | 259/681 [17:39<18:43, 2.66s/it] 38%|█████████████████████████████▊ | 260/681 [17:41<18:48, 2.68s/it] {'loss': 0.5057, 'grad_norm': 60.32538604736328, 'learning_rate': 3.9022350248844246e-07, 'margin_dpo/margin_mean': 27.156875610351562, 'margin_dpo/margin_std': 25.030288696289062, 'logps/chosen': -62.85065460205078, 'logps/rejected': -137.6796417236328, 'logps/ref_chosen': -47.41767501831055, 'logps/ref_rejected': -95.08979034423828, 'logits/chosen': -0.6234908103942871, 'logits/rejected': -0.6234794855117798, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.156875610351562, 'margin_dpo/beta_margin_mean': 2.7156875133514404, 'margin_dpo/beta_margin_std': 2.5070676803588867, 'margin_dpo/beta_margin_grad_mean': -0.18406428396701813, 'margin_dpo/beta_margin_grad_std': 0.20967774093151093, 'epoch': 0.38} 38%|█████████████████████████████▊ | 260/681 [17:41<18:48, 2.68s/it] 38%|█████████████████████████████▉ | 261/681 [17:44<18:06, 2.59s/it] {'loss': 0.4748, 'grad_norm': 47.035186767578125, 'learning_rate': 3.891592063515376e-07, 'margin_dpo/margin_mean': 28.690322875976562, 'margin_dpo/margin_std': 26.378036499023438, 'logps/chosen': -65.26475524902344, 'logps/rejected': -129.43865966796875, 'logps/ref_chosen': -53.03137969970703, 'logps/ref_rejected': -88.51494598388672, 'logits/chosen': -0.6528719067573547, 'logits/rejected': -0.6170308589935303, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.690322875976562, 'margin_dpo/beta_margin_mean': 2.869032144546509, 'margin_dpo/beta_margin_std': 2.6597743034362793, 'margin_dpo/beta_margin_grad_mean': -0.17383930087089539, 'margin_dpo/beta_margin_grad_std': 0.20882652699947357, 'epoch': 0.38} 38%|█████████████████████████████▉ | 261/681 [17:44<18:06, 2.59s/it] 38%|██████████████████████████████ | 262/681 [17:46<17:56, 2.57s/it] {'loss': 0.5286, 'grad_norm': 65.2990493774414, 'learning_rate': 3.880912432401264e-07, 'margin_dpo/margin_mean': 25.835269927978516, 'margin_dpo/margin_std': 21.91771697998047, 'logps/chosen': -74.31780242919922, 'logps/rejected': -126.95146179199219, 'logps/ref_chosen': -59.620140075683594, 'logps/ref_rejected': -86.41853332519531, 'logits/chosen': -0.6476384401321411, 'logits/rejected': -0.601101279258728, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.835268020629883, 'margin_dpo/beta_margin_mean': 2.583526849746704, 'margin_dpo/beta_margin_std': 2.1952381134033203, 'margin_dpo/beta_margin_grad_mean': -0.1703178882598877, 'margin_dpo/beta_margin_grad_std': 0.2238461673259735, 'epoch': 0.38} 38%|██████████████████████████████ | 262/681 [17:46<17:56, 2.57s/it] 39%|██████████████████████████████ | 263/681 [17:49<18:24, 2.64s/it] {'loss': 0.4332, 'grad_norm': 63.93273162841797, 'learning_rate': 3.870196412960302e-07, 'margin_dpo/margin_mean': 30.601646423339844, 'margin_dpo/margin_std': 26.212867736816406, 'logps/chosen': -71.28265380859375, 'logps/rejected': -139.320556640625, 'logps/ref_chosen': -59.42094421386719, 'logps/ref_rejected': -96.85720825195312, 'logits/chosen': -0.6848942041397095, 'logits/rejected': -0.6289730072021484, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.601646423339844, 'margin_dpo/beta_margin_mean': 3.0601646900177, 'margin_dpo/beta_margin_std': 2.6217379570007324, 'margin_dpo/beta_margin_grad_mean': -0.16087834537029266, 'margin_dpo/beta_margin_grad_std': 0.1975705325603485, 'epoch': 0.39} 39%|██████████████████████████████ | 263/681 [17:49<18:24, 2.64s/it] 39%|██████████████████████████████▏ | 264/681 [17:52<18:32, 2.67s/it] {'loss': 0.5449, 'grad_norm': 65.42985534667969, 'learning_rate': 3.8594442875695665e-07, 'margin_dpo/margin_mean': 24.097835540771484, 'margin_dpo/margin_std': 21.762907028198242, 'logps/chosen': -76.15332794189453, 'logps/rejected': -131.38528442382812, 'logps/ref_chosen': -62.722084045410156, 'logps/ref_rejected': -93.85621643066406, 'logits/chosen': -0.6409514546394348, 'logits/rejected': -0.6121193766593933, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.09783363342285, 'margin_dpo/beta_margin_mean': 2.409783363342285, 'margin_dpo/beta_margin_std': 2.2144694328308105, 'margin_dpo/beta_margin_grad_mean': -0.19213639199733734, 'margin_dpo/beta_margin_grad_std': 0.20564202964305878, 'epoch': 0.39} 39%|██████████████████████████████▏ | 264/681 [17:52<18:32, 2.67s/it] 39%|██████████████████████████████▎ | 265/681 [17:54<18:26, 2.66s/it] {'loss': 0.5823, 'grad_norm': 73.85417938232422, 'learning_rate': 3.848656339557562e-07, 'margin_dpo/margin_mean': 25.292123794555664, 'margin_dpo/margin_std': 25.746461868286133, 'logps/chosen': -76.27545928955078, 'logps/rejected': -127.61671447753906, 'logps/ref_chosen': -61.971466064453125, 'logps/ref_rejected': -88.02059936523438, 'logits/chosen': -0.6545775532722473, 'logits/rejected': -0.6242020130157471, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.292123794555664, 'margin_dpo/beta_margin_mean': 2.529212474822998, 'margin_dpo/beta_margin_std': 2.583730459213257, 'margin_dpo/beta_margin_grad_mean': -0.20656327903270721, 'margin_dpo/beta_margin_grad_std': 0.2223885953426361, 'epoch': 0.39} 39%|██████████████████████████████▎ | 265/681 [17:54<18:26, 2.66s/it] 39%|██████████████████████████████▍ | 266/681 [17:57<18:06, 2.62s/it] {'loss': 0.5648, 'grad_norm': 57.55160903930664, 'learning_rate': 3.8378328531967507e-07, 'margin_dpo/margin_mean': 24.757904052734375, 'margin_dpo/margin_std': 22.62070083618164, 'logps/chosen': -80.7601547241211, 'logps/rejected': -106.38961791992188, 'logps/ref_chosen': -67.09967041015625, 'logps/ref_rejected': -67.97122192382812, 'logits/chosen': -0.6736335754394531, 'logits/rejected': -0.6081231832504272, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.757902145385742, 'margin_dpo/beta_margin_mean': 2.47579026222229, 'margin_dpo/beta_margin_std': 2.267148017883301, 'margin_dpo/beta_margin_grad_mean': -0.20071399211883545, 'margin_dpo/beta_margin_grad_std': 0.22172965109348297, 'epoch': 0.39} 39%|██████████████████████████████▍ | 266/681 [17:57<18:06, 2.62s/it] 39%|██████████████████████████████▌ | 267/681 [18:00<18:01, 2.61s/it] {'loss': 0.4124, 'grad_norm': 53.11775588989258, 'learning_rate': 3.8269741136960646e-07, 'margin_dpo/margin_mean': 27.410789489746094, 'margin_dpo/margin_std': 22.08306884765625, 'logps/chosen': -82.08721923828125, 'logps/rejected': -130.69570922851562, 'logps/ref_chosen': -68.97074890136719, 'logps/ref_rejected': -90.16844940185547, 'logits/chosen': -0.6374738216400146, 'logits/rejected': -0.5902992486953735, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.410789489746094, 'margin_dpo/beta_margin_mean': 2.7410788536071777, 'margin_dpo/beta_margin_std': 2.210850954055786, 'margin_dpo/beta_margin_grad_mean': -0.15888135135173798, 'margin_dpo/beta_margin_grad_std': 0.18455727398395538, 'epoch': 0.39} 39%|██████████████████████████████▌ | 267/681 [18:00<18:01, 2.61s/it] 39%|██████████████████████████████▋ | 268/681 [18:02<18:18, 2.66s/it] {'loss': 0.4971, 'grad_norm': 62.39994812011719, 'learning_rate': 3.8160804071933894e-07, 'margin_dpo/margin_mean': 25.404464721679688, 'margin_dpo/margin_std': 21.592742919921875, 'logps/chosen': -68.46856689453125, 'logps/rejected': -139.620361328125, 'logps/ref_chosen': -55.900306701660156, 'logps/ref_rejected': -101.64763641357422, 'logits/chosen': -0.6283696293830872, 'logits/rejected': -0.6117571592330933, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.404462814331055, 'margin_dpo/beta_margin_mean': 2.5404462814331055, 'margin_dpo/beta_margin_std': 2.1823596954345703, 'margin_dpo/beta_margin_grad_mean': -0.1794806867837906, 'margin_dpo/beta_margin_grad_std': 0.21003910899162292, 'epoch': 0.39} 39%|██████████████████████████████▋ | 268/681 [18:02<18:18, 2.66s/it] 40%|██████████████████████████████▊ | 269/681 [18:05<17:54, 2.61s/it] {'loss': 0.4067, 'grad_norm': 53.5538330078125, 'learning_rate': 3.8051520207480204e-07, 'margin_dpo/margin_mean': 32.81420135498047, 'margin_dpo/margin_std': 23.582063674926758, 'logps/chosen': -82.96507263183594, 'logps/rejected': -153.08908081054688, 'logps/ref_chosen': -70.03955078125, 'logps/ref_rejected': -107.34937286376953, 'logits/chosen': -0.6579411029815674, 'logits/rejected': -0.6127967238426208, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.81420135498047, 'margin_dpo/beta_margin_mean': 3.2814202308654785, 'margin_dpo/beta_margin_std': 2.366851329803467, 'margin_dpo/beta_margin_grad_mean': -0.14487290382385254, 'margin_dpo/beta_margin_grad_std': 0.21411198377609253, 'epoch': 0.4} 40%|██████████████████████████████▊ | 269/681 [18:05<17:54, 2.61s/it] 40%|██████████████████████████████▉ | 270/681 [18:07<18:06, 2.64s/it] {'loss': 0.5061, 'grad_norm': 41.36155319213867, 'learning_rate': 3.794189242333106e-07, 'margin_dpo/margin_mean': 25.06671142578125, 'margin_dpo/margin_std': 22.077110290527344, 'logps/chosen': -80.42156219482422, 'logps/rejected': -145.8834228515625, 'logps/ref_chosen': -69.53347778320312, 'logps/ref_rejected': -109.92864990234375, 'logits/chosen': -0.6722906827926636, 'logits/rejected': -0.6524355411529541, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.06671142578125, 'margin_dpo/beta_margin_mean': 2.506671190261841, 'margin_dpo/beta_margin_std': 2.224198341369629, 'margin_dpo/beta_margin_grad_mean': -0.18635544180870056, 'margin_dpo/beta_margin_grad_std': 0.20361123979091644, 'epoch': 0.4} 40%|██████████████████████████████▉ | 270/681 [18:08<18:06, 2.64s/it] 40%|███████████████████████████████ | 271/681 [18:10<17:40, 2.59s/it] {'loss': 0.534, 'grad_norm': 51.65666961669922, 'learning_rate': 3.7831923608280514e-07, 'margin_dpo/margin_mean': 25.931888580322266, 'margin_dpo/margin_std': 23.307266235351562, 'logps/chosen': -71.0101318359375, 'logps/rejected': -132.6912841796875, 'logps/ref_chosen': -56.76457214355469, 'logps/ref_rejected': -92.51383209228516, 'logits/chosen': -0.6164276599884033, 'logits/rejected': -0.5750702619552612, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.9318904876709, 'margin_dpo/beta_margin_mean': 2.593189001083374, 'margin_dpo/beta_margin_std': 2.331104040145874, 'margin_dpo/beta_margin_grad_mean': -0.18699227273464203, 'margin_dpo/beta_margin_grad_std': 0.21381914615631104, 'epoch': 0.4} 40%|███████████████████████████████ | 271/681 [18:10<17:40, 2.59s/it] 40%|███████████████████████████████▏ | 272/681 [18:13<18:01, 2.64s/it] {'loss': 0.5359, 'grad_norm': 52.00728225708008, 'learning_rate': 3.772161666010912e-07, 'margin_dpo/margin_mean': 31.5368595123291, 'margin_dpo/margin_std': 27.00173568725586, 'logps/chosen': -62.51170349121094, 'logps/rejected': -150.09420776367188, 'logps/ref_chosen': -49.49715805053711, 'logps/ref_rejected': -105.54279327392578, 'logits/chosen': -0.6098858714103699, 'logits/rejected': -0.5980672836303711, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.5368595123291, 'margin_dpo/beta_margin_mean': 3.153686046600342, 'margin_dpo/beta_margin_std': 2.745887041091919, 'margin_dpo/beta_margin_grad_mean': -0.17077092826366425, 'margin_dpo/beta_margin_grad_std': 0.23799988627433777, 'epoch': 0.4} 40%|███████████████████████████████▏ | 272/681 [18:13<18:01, 2.64s/it] 40%|███████████████████████████████▎ | 273/681 [18:15<17:28, 2.57s/it] {'loss': 0.4627, 'grad_norm': 59.0302848815918, 'learning_rate': 3.761097448550755e-07, 'margin_dpo/margin_mean': 30.293479919433594, 'margin_dpo/margin_std': 24.974472045898438, 'logps/chosen': -77.9120864868164, 'logps/rejected': -137.728759765625, 'logps/ref_chosen': -62.97539520263672, 'logps/ref_rejected': -92.49858093261719, 'logits/chosen': -0.5825521945953369, 'logits/rejected': -0.5468716025352478, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.29347801208496, 'margin_dpo/beta_margin_mean': 3.0293478965759277, 'margin_dpo/beta_margin_std': 2.5976314544677734, 'margin_dpo/beta_margin_grad_mean': -0.16412891447544098, 'margin_dpo/beta_margin_grad_std': 0.2039998471736908, 'epoch': 0.4} 40%|███████████████████████████████▎ | 273/681 [18:15<17:28, 2.57s/it] 40%|███████████████████████████████▍ | 274/681 [18:18<17:41, 2.61s/it] {'loss': 0.5193, 'grad_norm': 55.06562423706055, 'learning_rate': 3.75e-07, 'margin_dpo/margin_mean': 26.310407638549805, 'margin_dpo/margin_std': 23.162757873535156, 'logps/chosen': -71.956298828125, 'logps/rejected': -119.93206787109375, 'logps/ref_chosen': -55.66770935058594, 'logps/ref_rejected': -77.33308410644531, 'logits/chosen': -0.6257538199424744, 'logits/rejected': -0.5888440608978271, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.310407638549805, 'margin_dpo/beta_margin_mean': 2.6310408115386963, 'margin_dpo/beta_margin_std': 2.3380789756774902, 'margin_dpo/beta_margin_grad_mean': -0.1804315745830536, 'margin_dpo/beta_margin_grad_std': 0.2112412303686142, 'epoch': 0.4} 40%|███████████████████████████████▍ | 274/681 [18:18<17:41, 2.61s/it] 40%|███████████████████████████████▍ | 275/681 [18:21<18:25, 2.72s/it] {'loss': 0.4719, 'grad_norm': 64.80329895019531, 'learning_rate': 3.738869612786737e-07, 'margin_dpo/margin_mean': 27.70269775390625, 'margin_dpo/margin_std': 24.364110946655273, 'logps/chosen': -60.017059326171875, 'logps/rejected': -132.4287567138672, 'logps/ref_chosen': -48.594703674316406, 'logps/ref_rejected': -93.30369567871094, 'logits/chosen': -0.657637894153595, 'logits/rejected': -0.6397134065628052, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.70269775390625, 'margin_dpo/beta_margin_mean': 2.7702696323394775, 'margin_dpo/beta_margin_std': 2.443610906600952, 'margin_dpo/beta_margin_grad_mean': -0.17478503286838531, 'margin_dpo/beta_margin_grad_std': 0.20220105350017548, 'epoch': 0.4} 40%|███████████████████████████████▍ | 275/681 [18:21<18:25, 2.72s/it] 41%|███████████████████████████████▌ | 276/681 [18:24<18:23, 2.73s/it] {'loss': 0.5956, 'grad_norm': 62.004940032958984, 'learning_rate': 3.7277065802070204e-07, 'margin_dpo/margin_mean': 25.469280242919922, 'margin_dpo/margin_std': 24.609455108642578, 'logps/chosen': -70.30280303955078, 'logps/rejected': -109.56034851074219, 'logps/ref_chosen': -56.57740783691406, 'logps/ref_rejected': -70.36566925048828, 'logits/chosen': -0.6588333249092102, 'logits/rejected': -0.6178128719329834, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.469282150268555, 'margin_dpo/beta_margin_mean': 2.5469281673431396, 'margin_dpo/beta_margin_std': 2.4726295471191406, 'margin_dpo/beta_margin_grad_mean': -0.21022818982601166, 'margin_dpo/beta_margin_grad_std': 0.23304350674152374, 'epoch': 0.41} 41%|███████████████████████████████▌ | 276/681 [18:24<18:23, 2.73s/it] 41%|███████████████████████████████▋ | 277/681 [18:26<17:28, 2.60s/it] {'loss': 0.4262, 'grad_norm': 41.01213836669922, 'learning_rate': 3.71651119641714e-07, 'margin_dpo/margin_mean': 24.34532928466797, 'margin_dpo/margin_std': 18.586042404174805, 'logps/chosen': -68.6185302734375, 'logps/rejected': -129.5735626220703, 'logps/ref_chosen': -56.27156066894531, 'logps/ref_rejected': -92.88127136230469, 'logits/chosen': -0.6487230658531189, 'logits/rejected': -0.6121164560317993, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.345327377319336, 'margin_dpo/beta_margin_mean': 2.434532880783081, 'margin_dpo/beta_margin_std': 1.9181517362594604, 'margin_dpo/beta_margin_grad_mean': -0.17126400768756866, 'margin_dpo/beta_margin_grad_std': 0.16744239628314972, 'epoch': 0.41} 41%|███████████████████████████████▋ | 277/681 [18:26<17:28, 2.60s/it] 41%|███████████████████████████████▊ | 278/681 [18:28<17:22, 2.59s/it] {'loss': 0.482, 'grad_norm': 45.832942962646484, 'learning_rate': 3.705283756425872e-07, 'margin_dpo/margin_mean': 29.894882202148438, 'margin_dpo/margin_std': 26.599227905273438, 'logps/chosen': -64.33135986328125, 'logps/rejected': -132.53787231445312, 'logps/ref_chosen': -52.94194030761719, 'logps/ref_rejected': -91.25357818603516, 'logits/chosen': -0.6533815860748291, 'logits/rejected': -0.6426759958267212, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.894880294799805, 'margin_dpo/beta_margin_mean': 2.989488124847412, 'margin_dpo/beta_margin_std': 2.6637423038482666, 'margin_dpo/beta_margin_grad_mean': -0.17872436344623566, 'margin_dpo/beta_margin_grad_std': 0.21186299622058868, 'epoch': 0.41} 41%|███████████████████████████████▊ | 278/681 [18:28<17:22, 2.59s/it] 41%|███████████████████████████████▉ | 279/681 [18:31<17:16, 2.58s/it] {'loss': 0.488, 'grad_norm': 55.28075408935547, 'learning_rate': 3.6940245560867e-07, 'margin_dpo/margin_mean': 29.428203582763672, 'margin_dpo/margin_std': 24.257051467895508, 'logps/chosen': -60.90464782714844, 'logps/rejected': -129.54296875, 'logps/ref_chosen': -48.641319274902344, 'logps/ref_rejected': -87.8514404296875, 'logits/chosen': -0.6621390581130981, 'logits/rejected': -0.6348008513450623, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.428205490112305, 'margin_dpo/beta_margin_mean': 2.9428205490112305, 'margin_dpo/beta_margin_std': 2.446150302886963, 'margin_dpo/beta_margin_grad_mean': -0.17444436252117157, 'margin_dpo/beta_margin_grad_std': 0.2196406126022339, 'epoch': 0.41} 41%|███████████████████████████████▉ | 279/681 [18:31<17:16, 2.58s/it] 41%|████████████████████████████████ | 280/681 [18:34<17:23, 2.60s/it] {'loss': 0.3483, 'grad_norm': 38.125099182128906, 'learning_rate': 3.6827338920900253e-07, 'margin_dpo/margin_mean': 28.934520721435547, 'margin_dpo/margin_std': 18.611377716064453, 'logps/chosen': -72.28899383544922, 'logps/rejected': -141.04525756835938, 'logps/ref_chosen': -58.797122955322266, 'logps/ref_rejected': -98.61885070800781, 'logits/chosen': -0.6191302537918091, 'logits/rejected': -0.6008737683296204, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.934520721435547, 'margin_dpo/beta_margin_mean': 2.8934521675109863, 'margin_dpo/beta_margin_std': 1.8741562366485596, 'margin_dpo/beta_margin_grad_mean': -0.133779838681221, 'margin_dpo/beta_margin_grad_std': 0.18086881935596466, 'epoch': 0.41} 41%|████████████████████████████████ | 280/681 [18:34<17:23, 2.60s/it] 41%|████████████████████████████████▏ | 281/681 [18:36<17:46, 2.67s/it] {'loss': 0.483, 'grad_norm': 64.53363037109375, 'learning_rate': 3.6714120619553435e-07, 'margin_dpo/margin_mean': 25.293540954589844, 'margin_dpo/margin_std': 20.025854110717773, 'logps/chosen': -67.85418701171875, 'logps/rejected': -118.54179382324219, 'logps/ref_chosen': -55.488521575927734, 'logps/ref_rejected': -80.88258361816406, 'logits/chosen': -0.665095329284668, 'logits/rejected': -0.62502121925354, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.293540954589844, 'margin_dpo/beta_margin_mean': 2.5293540954589844, 'margin_dpo/beta_margin_std': 2.0211899280548096, 'margin_dpo/beta_margin_grad_mean': -0.15846484899520874, 'margin_dpo/beta_margin_grad_std': 0.1876082420349121, 'epoch': 0.41} 41%|████████████████████████████████▏ | 281/681 [18:36<17:46, 2.67s/it] 41%|████████████████████████████████▎ | 282/681 [18:39<17:26, 2.62s/it] {'loss': 0.475, 'grad_norm': 50.1074333190918, 'learning_rate': 3.660059364023408e-07, 'margin_dpo/margin_mean': 23.41071128845215, 'margin_dpo/margin_std': 20.858131408691406, 'logps/chosen': -85.81141662597656, 'logps/rejected': -131.50296020507812, 'logps/ref_chosen': -73.07014465332031, 'logps/ref_rejected': -95.35098266601562, 'logits/chosen': -0.6407305002212524, 'logits/rejected': -0.593590497970581, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 23.41071128845215, 'margin_dpo/beta_margin_mean': 2.341071128845215, 'margin_dpo/beta_margin_std': 2.0938100814819336, 'margin_dpo/beta_margin_grad_mean': -0.18391168117523193, 'margin_dpo/beta_margin_grad_std': 0.18578048050403595, 'epoch': 0.41} 41%|████████████████████████████████▎ | 282/681 [18:39<17:26, 2.62s/it] 42%|████████████████████████████████▍ | 283/681 [18:42<17:21, 2.62s/it] {'loss': 0.4753, 'grad_norm': 48.29468536376953, 'learning_rate': 3.6486760974483685e-07, 'margin_dpo/margin_mean': 28.113353729248047, 'margin_dpo/margin_std': 23.463539123535156, 'logps/chosen': -74.30840301513672, 'logps/rejected': -137.50985717773438, 'logps/ref_chosen': -61.89844512939453, 'logps/ref_rejected': -96.98655700683594, 'logits/chosen': -0.6420848369598389, 'logits/rejected': -0.6138025522232056, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.113353729248047, 'margin_dpo/beta_margin_mean': 2.811335563659668, 'margin_dpo/beta_margin_std': 2.354823589324951, 'margin_dpo/beta_margin_grad_mean': -0.1640281230211258, 'margin_dpo/beta_margin_grad_std': 0.21006377041339874, 'epoch': 0.42} 42%|████████████████████████████████▍ | 283/681 [18:42<17:21, 2.62s/it] 42%|████████████████████████████████▌ | 284/681 [18:44<17:39, 2.67s/it] {'loss': 0.4108, 'grad_norm': 43.3529167175293, 'learning_rate': 3.6372625621898863e-07, 'margin_dpo/margin_mean': 29.832664489746094, 'margin_dpo/margin_std': 25.724153518676758, 'logps/chosen': -72.13871765136719, 'logps/rejected': -137.00511169433594, 'logps/ref_chosen': -58.4355354309082, 'logps/ref_rejected': -93.46926879882812, 'logits/chosen': -0.6275640726089478, 'logits/rejected': -0.6143908500671387, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.832664489746094, 'margin_dpo/beta_margin_mean': 2.983266592025757, 'margin_dpo/beta_margin_std': 2.578672170639038, 'margin_dpo/beta_margin_grad_mean': -0.15540792047977448, 'margin_dpo/beta_margin_grad_std': 0.1845344454050064, 'epoch': 0.42} 42%|████████████████████████████████▌ | 284/681 [18:44<17:39, 2.67s/it] 42%|████████████████████████████████▋ | 285/681 [18:47<17:23, 2.64s/it] {'loss': 0.4257, 'grad_norm': 57.101165771484375, 'learning_rate': 3.625819059005228e-07, 'margin_dpo/margin_mean': 26.457977294921875, 'margin_dpo/margin_std': 20.855016708374023, 'logps/chosen': -81.82306671142578, 'logps/rejected': -141.17568969726562, 'logps/ref_chosen': -66.2322006225586, 'logps/ref_rejected': -99.1268310546875, 'logits/chosen': -0.6859316825866699, 'logits/rejected': -0.6596359014511108, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.457977294921875, 'margin_dpo/beta_margin_mean': 2.6457977294921875, 'margin_dpo/beta_margin_std': 2.1147918701171875, 'margin_dpo/beta_margin_grad_mean': -0.16400812566280365, 'margin_dpo/beta_margin_grad_std': 0.18523728847503662, 'epoch': 0.42} 42%|████████████████████████████████▋ | 285/681 [18:47<17:23, 2.64s/it] 42%|████████████████████████████████▊ | 286/681 [18:50<17:24, 2.65s/it] {'loss': 0.5505, 'grad_norm': 58.981807708740234, 'learning_rate': 3.614345889441346e-07, 'margin_dpo/margin_mean': 27.725364685058594, 'margin_dpo/margin_std': 25.239097595214844, 'logps/chosen': -86.8876724243164, 'logps/rejected': -130.25048828125, 'logps/ref_chosen': -72.95100402832031, 'logps/ref_rejected': -88.58845520019531, 'logits/chosen': -0.6508222222328186, 'logits/rejected': -0.6174975633621216, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.725364685058594, 'margin_dpo/beta_margin_mean': 2.772536516189575, 'margin_dpo/beta_margin_std': 2.563308000564575, 'margin_dpo/beta_margin_grad_mean': -0.18501420319080353, 'margin_dpo/beta_margin_grad_std': 0.22696195542812347, 'epoch': 0.42} 42%|████████████████████████████████▊ | 286/681 [18:50<17:24, 2.65s/it] 42%|████████████████████████████████▊ | 287/681 [18:52<16:38, 2.53s/it] {'loss': 0.534, 'grad_norm': 52.582481384277344, 'learning_rate': 3.6028433558269275e-07, 'margin_dpo/margin_mean': 26.86972427368164, 'margin_dpo/margin_std': 25.95490264892578, 'logps/chosen': -75.86917114257812, 'logps/rejected': -118.89381408691406, 'logps/ref_chosen': -61.54115295410156, 'logps/ref_rejected': -77.6960678100586, 'logits/chosen': -0.658734142780304, 'logits/rejected': -0.6133627891540527, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.869726181030273, 'margin_dpo/beta_margin_mean': 2.6869726181030273, 'margin_dpo/beta_margin_std': 2.597897529602051, 'margin_dpo/beta_margin_grad_mean': -0.1924961507320404, 'margin_dpo/beta_margin_grad_std': 0.21066516637802124, 'epoch': 0.42} 42%|████████████████████████████████▊ | 287/681 [18:52<16:38, 2.53s/it] 42%|████████████████████████████████▉ | 288/681 [18:55<17:33, 2.68s/it] {'loss': 0.4303, 'grad_norm': 57.62872314453125, 'learning_rate': 3.5913117612644327e-07, 'margin_dpo/margin_mean': 27.80404281616211, 'margin_dpo/margin_std': 21.125944137573242, 'logps/chosen': -72.6466293334961, 'logps/rejected': -131.12515258789062, 'logps/ref_chosen': -56.661224365234375, 'logps/ref_rejected': -87.335693359375, 'logits/chosen': -0.634566605091095, 'logits/rejected': -0.6029102206230164, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.80404281616211, 'margin_dpo/beta_margin_mean': 2.7804043292999268, 'margin_dpo/beta_margin_std': 2.1588449478149414, 'margin_dpo/beta_margin_grad_mean': -0.16040383279323578, 'margin_dpo/beta_margin_grad_std': 0.19727593660354614, 'epoch': 0.42} 42%|████████████████████████████████▉ | 288/681 [18:55<17:33, 2.68s/it] 42%|█████████████████████████████████ | 289/681 [18:57<17:09, 2.63s/it] {'loss': 0.5004, 'grad_norm': 50.83492660522461, 'learning_rate': 3.5797514096221024e-07, 'margin_dpo/margin_mean': 30.28182601928711, 'margin_dpo/margin_std': 29.03339958190918, 'logps/chosen': -61.59012985229492, 'logps/rejected': -134.28424072265625, 'logps/ref_chosen': -45.23039245605469, 'logps/ref_rejected': -87.64266967773438, 'logits/chosen': -0.6417437791824341, 'logits/rejected': -0.6304539442062378, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.28182601928711, 'margin_dpo/beta_margin_mean': 3.0281827449798584, 'margin_dpo/beta_margin_std': 2.936239004135132, 'margin_dpo/beta_margin_grad_mean': -0.18503758311271667, 'margin_dpo/beta_margin_grad_std': 0.2104889303445816, 'epoch': 0.42} 42%|█████████████████████████████████ | 289/681 [18:57<17:09, 2.63s/it] 43%|█████████████████████████████████▏ | 290/681 [19:00<17:08, 2.63s/it] {'loss': 0.5027, 'grad_norm': 63.28836441040039, 'learning_rate': 3.568162605525952e-07, 'margin_dpo/margin_mean': 31.53160858154297, 'margin_dpo/margin_std': 29.308597564697266, 'logps/chosen': -72.06575775146484, 'logps/rejected': -164.83444213867188, 'logps/ref_chosen': -55.47149658203125, 'logps/ref_rejected': -116.70857238769531, 'logits/chosen': -0.5944575071334839, 'logits/rejected': -0.5908774137496948, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.53160858154297, 'margin_dpo/beta_margin_mean': 3.15316104888916, 'margin_dpo/beta_margin_std': 2.957723617553711, 'margin_dpo/beta_margin_grad_mean': -0.17223544418811798, 'margin_dpo/beta_margin_grad_std': 0.22271078824996948, 'epoch': 0.43} 43%|█████████████████████████████████▏ | 290/681 [19:00<17:08, 2.63s/it] 43%|█████████████████████████████████▎ | 291/681 [19:03<17:01, 2.62s/it] {'loss': 0.4813, 'grad_norm': 56.67517852783203, 'learning_rate': 3.5565456543517485e-07, 'margin_dpo/margin_mean': 27.70128059387207, 'margin_dpo/margin_std': 22.738750457763672, 'logps/chosen': -76.0269775390625, 'logps/rejected': -129.76498413085938, 'logps/ref_chosen': -63.26036834716797, 'logps/ref_rejected': -89.29708862304688, 'logits/chosen': -0.6302033066749573, 'logits/rejected': -0.595551609992981, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.701282501220703, 'margin_dpo/beta_margin_mean': 2.7701282501220703, 'margin_dpo/beta_margin_std': 2.2966012954711914, 'margin_dpo/beta_margin_grad_mean': -0.16477952897548676, 'margin_dpo/beta_margin_grad_std': 0.20427057147026062, 'epoch': 0.43} 43%|█████████████████████████████████▎ | 291/681 [19:03<17:01, 2.62s/it] 43%|█████████████████████████████████▍ | 292/681 [19:05<16:32, 2.55s/it] {'loss': 0.3934, 'grad_norm': 54.23537063598633, 'learning_rate': 3.5449008622169583e-07, 'margin_dpo/margin_mean': 29.846210479736328, 'margin_dpo/margin_std': 24.429065704345703, 'logps/chosen': -70.70861053466797, 'logps/rejected': -136.59767150878906, 'logps/ref_chosen': -53.91852951049805, 'logps/ref_rejected': -89.96138000488281, 'logits/chosen': -0.6187624931335449, 'logits/rejected': -0.5753225684165955, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.846208572387695, 'margin_dpo/beta_margin_mean': 2.9846208095550537, 'margin_dpo/beta_margin_std': 2.461369752883911, 'margin_dpo/beta_margin_grad_mean': -0.15516723692417145, 'margin_dpo/beta_margin_grad_std': 0.1758795827627182, 'epoch': 0.43} 43%|█████████████████████████████████▍ | 292/681 [19:05<16:32, 2.55s/it] 43%|█████████████████████████████████▌ | 293/681 [19:08<16:44, 2.59s/it] {'loss': 0.5966, 'grad_norm': 52.83697509765625, 'learning_rate': 3.5332285359726846e-07, 'margin_dpo/margin_mean': 24.17245864868164, 'margin_dpo/margin_std': 24.42681121826172, 'logps/chosen': -76.67402648925781, 'logps/rejected': -118.3228988647461, 'logps/ref_chosen': -60.376033782958984, 'logps/ref_rejected': -77.8524398803711, 'logits/chosen': -0.6384230852127075, 'logits/rejected': -0.6081752777099609, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.17245864868164, 'margin_dpo/beta_margin_mean': 2.417245864868164, 'margin_dpo/beta_margin_std': 2.448225259780884, 'margin_dpo/beta_margin_grad_mean': -0.2109437733888626, 'margin_dpo/beta_margin_grad_std': 0.217354878783226, 'epoch': 0.43} 43%|█████████████████████████████████▌ | 293/681 [19:08<16:44, 2.59s/it] 43%|█████████████████████████████████▋ | 294/681 [19:10<17:02, 2.64s/it] {'loss': 0.5153, 'grad_norm': 42.41814041137695, 'learning_rate': 3.5215289831955786e-07, 'margin_dpo/margin_mean': 27.206314086914062, 'margin_dpo/margin_std': 25.83649444580078, 'logps/chosen': -62.738616943359375, 'logps/rejected': -123.75438690185547, 'logps/ref_chosen': -48.0875358581543, 'logps/ref_rejected': -81.89698791503906, 'logits/chosen': -0.6331781148910522, 'logits/rejected': -0.6203632354736328, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.206314086914062, 'margin_dpo/beta_margin_mean': 2.7206313610076904, 'margin_dpo/beta_margin_std': 2.606935977935791, 'margin_dpo/beta_margin_grad_mean': -0.1880524605512619, 'margin_dpo/beta_margin_grad_std': 0.2148909568786621, 'epoch': 0.43} 43%|█████████████████████████████████▋ | 294/681 [19:10<17:02, 2.64s/it] 43%|█████████████████████████████████▊ | 295/681 [19:13<16:34, 2.58s/it] {'loss': 0.5905, 'grad_norm': 63.754703521728516, 'learning_rate': 3.509802512179737e-07, 'margin_dpo/margin_mean': 27.146446228027344, 'margin_dpo/margin_std': 24.949363708496094, 'logps/chosen': -68.84889221191406, 'logps/rejected': -133.5269775390625, 'logps/ref_chosen': -49.92467498779297, 'logps/ref_rejected': -87.45632934570312, 'logits/chosen': -0.6102343797683716, 'logits/rejected': -0.6015244722366333, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.14644432067871, 'margin_dpo/beta_margin_mean': 2.714644432067871, 'margin_dpo/beta_margin_std': 2.515841245651245, 'margin_dpo/beta_margin_grad_mean': -0.18437956273555756, 'margin_dpo/beta_margin_grad_std': 0.22511690855026245, 'epoch': 0.43} 43%|█████████████████████████████████▊ | 295/681 [19:13<16:34, 2.58s/it] 43%|█████████████████████████████████▉ | 296/681 [19:15<16:16, 2.54s/it] {'loss': 0.7362, 'grad_norm': 79.74415588378906, 'learning_rate': 3.498049431928577e-07, 'margin_dpo/margin_mean': 23.47943878173828, 'margin_dpo/margin_std': 26.526391983032227, 'logps/chosen': -84.0577392578125, 'logps/rejected': -135.13502502441406, 'logps/ref_chosen': -65.49124145507812, 'logps/ref_rejected': -93.08908081054688, 'logits/chosen': -0.6979824304580688, 'logits/rejected': -0.6591476202011108, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 23.47943878173828, 'margin_dpo/beta_margin_mean': 2.3479440212249756, 'margin_dpo/beta_margin_std': 2.6585099697113037, 'margin_dpo/beta_margin_grad_mean': -0.23561769723892212, 'margin_dpo/beta_margin_grad_std': 0.25472357869148254, 'epoch': 0.43} 43%|█████████████████████████████████▉ | 296/681 [19:15<16:16, 2.54s/it] 44%|██████████████████████████████████ | 297/681 [19:18<16:32, 2.58s/it] {'loss': 0.426, 'grad_norm': 44.74517059326172, 'learning_rate': 3.486270052146694e-07, 'margin_dpo/margin_mean': 28.571863174438477, 'margin_dpo/margin_std': 23.766578674316406, 'logps/chosen': -74.85836029052734, 'logps/rejected': -142.09182739257812, 'logps/ref_chosen': -56.47694778442383, 'logps/ref_rejected': -95.1385498046875, 'logits/chosen': -0.5774829387664795, 'logits/rejected': -0.5429031848907471, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.571861267089844, 'margin_dpo/beta_margin_mean': 2.8571863174438477, 'margin_dpo/beta_margin_std': 2.378805160522461, 'margin_dpo/beta_margin_grad_mean': -0.16123466193675995, 'margin_dpo/beta_margin_grad_std': 0.1913672685623169, 'epoch': 0.44} 44%|██████████████████████████████████ | 297/681 [19:18<16:32, 2.58s/it] 44%|██████████████████████████████████▏ | 298/681 [19:21<17:10, 2.69s/it] {'loss': 0.4135, 'grad_norm': 44.202003479003906, 'learning_rate': 3.474464683231698e-07, 'margin_dpo/margin_mean': 29.80324935913086, 'margin_dpo/margin_std': 26.537506103515625, 'logps/chosen': -83.96099090576172, 'logps/rejected': -163.1012420654297, 'logps/ref_chosen': -67.32516479492188, 'logps/ref_rejected': -116.66217041015625, 'logits/chosen': -0.6306143999099731, 'logits/rejected': -0.6259936690330505, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.803251266479492, 'margin_dpo/beta_margin_mean': 2.9803249835968018, 'margin_dpo/beta_margin_std': 2.6625173091888428, 'margin_dpo/beta_margin_grad_mean': -0.1614616960287094, 'margin_dpo/beta_margin_grad_std': 0.180389866232872, 'epoch': 0.44} 44%|██████████████████████████████████▏ | 298/681 [19:21<17:10, 2.69s/it] 44%|██████████████████████████████████▏ | 299/681 [19:24<16:56, 2.66s/it] {'loss': 0.5069, 'grad_norm': 59.32780075073242, 'learning_rate': 3.462633636266041e-07, 'margin_dpo/margin_mean': 31.181495666503906, 'margin_dpo/margin_std': 27.928592681884766, 'logps/chosen': -64.30989837646484, 'logps/rejected': -130.85752868652344, 'logps/ref_chosen': -48.96209716796875, 'logps/ref_rejected': -84.32823944091797, 'logits/chosen': -0.5633834600448608, 'logits/rejected': -0.5420501232147217, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.181493759155273, 'margin_dpo/beta_margin_mean': 3.118149518966675, 'margin_dpo/beta_margin_std': 2.880525588989258, 'margin_dpo/beta_margin_grad_mean': -0.17790299654006958, 'margin_dpo/beta_margin_grad_std': 0.22678081691265106, 'epoch': 0.44} 44%|██████████████████████████████████▏ | 299/681 [19:24<16:56, 2.66s/it] 44%|██████████████████████████████████▎ | 300/681 [19:27<17:34, 2.77s/it] {'loss': 0.7096, 'grad_norm': 81.78619384765625, 'learning_rate': 3.4507772230088147e-07, 'margin_dpo/margin_mean': 29.621217727661133, 'margin_dpo/margin_std': 29.81679344177246, 'logps/chosen': -80.5472183227539, 'logps/rejected': -147.06117248535156, 'logps/ref_chosen': -59.073707580566406, 'logps/ref_rejected': -95.9664535522461, 'logits/chosen': -0.6105576157569885, 'logits/rejected': -0.591549277305603, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.6212158203125, 'margin_dpo/beta_margin_mean': 2.9621217250823975, 'margin_dpo/beta_margin_std': 2.987994909286499, 'margin_dpo/beta_margin_grad_mean': -0.20562496781349182, 'margin_dpo/beta_margin_grad_std': 0.27093952894210815, 'epoch': 0.44} 44%|██████████████████████████████████▎ | 300/681 [19:27<17:34, 2.77s/it][INFO|trainer.py:4307] 2026-04-17 21:45:57,467 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-17 21:45:57,467 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-17 21:45:57,467 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-17 21:51:01,212 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-17 21:51:01,212 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-400 [INFO|configuration_utils.py:419] 2026-04-17 21:51:57,594 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-400/config.json [INFO|configuration_utils.py:911] 2026-04-17 21:51:57,601 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-400/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-17 21:52:50,158 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-400/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-17 21:52:50,168 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-400/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-17 21:52:50,173 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-400/special_tokens_map.json 59%|████████████████████████████████████████████▏ | 401/681 [30:09<8:02:03, 103.30s/it] {'loss': 0.659, 'grad_norm': 61.670928955078125, 'learning_rate': 2.1800473436235136e-07, 'margin_dpo/margin_mean': 29.519912719726562, 'margin_dpo/margin_std': 31.590171813964844, 'logps/chosen': -76.15703582763672, 'logps/rejected': -132.30641174316406, 'logps/ref_chosen': -57.16303253173828, 'logps/ref_rejected': -83.79249572753906, 'logits/chosen': -0.5769657492637634, 'logits/rejected': -0.5573090314865112, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.519914627075195, 'margin_dpo/beta_margin_mean': 2.951991558074951, 'margin_dpo/beta_margin_std': 3.1654133796691895, 'margin_dpo/beta_margin_grad_mean': -0.2077464908361435, 'margin_dpo/beta_margin_grad_std': 0.25239863991737366, 'epoch': 0.59} 59%|████████████████████████████████████████████▏ | 401/681 [30:09<8:02:03, 103.30s/it] 59%|████████████████████████████████████████████▊ | 402/681 [30:11<5:39:28, 73.00s/it] {'loss': 0.2132, 'grad_norm': 26.211894989013672, 'learning_rate': 2.1673238449588665e-07, 'margin_dpo/margin_mean': 38.97754669189453, 'margin_dpo/margin_std': 23.83334732055664, 'logps/chosen': -62.62638854980469, 'logps/rejected': -131.90960693359375, 'logps/ref_chosen': -50.74037170410156, 'logps/ref_rejected': -81.0460433959961, 'logits/chosen': -0.6328971982002258, 'logits/rejected': -0.584295392036438, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 38.97754669189453, 'margin_dpo/beta_margin_mean': 3.8977549076080322, 'margin_dpo/beta_margin_std': 2.383517265319824, 'margin_dpo/beta_margin_grad_mean': -0.08531676232814789, 'margin_dpo/beta_margin_grad_std': 0.13695916533470154, 'epoch': 0.59} 59%|████████████████████████████████████████████▊ | 402/681 [30:11<5:39:28, 73.00s/it] 59%|████████████████████████████████████████████▉ | 403/681 [30:14<4:00:10, 51.84s/it] {'loss': 0.5741, 'grad_norm': 63.287567138671875, 'learning_rate': 2.154609112620295e-07, 'margin_dpo/margin_mean': 30.17224884033203, 'margin_dpo/margin_std': 28.130752563476562, 'logps/chosen': -62.53410339355469, 'logps/rejected': -122.82563781738281, 'logps/ref_chosen': -47.14731216430664, 'logps/ref_rejected': -77.2666015625, 'logits/chosen': -0.6422700881958008, 'logits/rejected': -0.6241501569747925, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.1722469329834, 'margin_dpo/beta_margin_mean': 3.0172247886657715, 'margin_dpo/beta_margin_std': 2.846843957901001, 'margin_dpo/beta_margin_grad_mean': -0.1763564795255661, 'margin_dpo/beta_margin_grad_std': 0.23426702618598938, 'epoch': 0.59} 59%|████████████████████████████████████████████▉ | 403/681 [30:14<4:00:10, 51.84s/it] 59%|█████████████████████████████████████████████ | 404/681 [30:16<2:50:56, 37.03s/it] {'loss': 0.5739, 'grad_norm': 54.917449951171875, 'learning_rate': 2.1419034816528218e-07, 'margin_dpo/margin_mean': 30.53654670715332, 'margin_dpo/margin_std': 28.8435115814209, 'logps/chosen': -63.40578079223633, 'logps/rejected': -123.22205352783203, 'logps/ref_chosen': -47.875274658203125, 'logps/ref_rejected': -77.15499877929688, 'logits/chosen': -0.6123020648956299, 'logits/rejected': -0.578801155090332, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.53654670715332, 'margin_dpo/beta_margin_mean': 3.053654670715332, 'margin_dpo/beta_margin_std': 2.8906707763671875, 'margin_dpo/beta_margin_grad_mean': -0.18759508430957794, 'margin_dpo/beta_margin_grad_std': 0.23627623915672302, 'epoch': 0.59} 59%|█████████████████████████████████████████████ | 404/681 [30:16<2:50:56, 37.03s/it] 59%|█████████████████████████████████████████████▏ | 405/681 [30:18<2:02:30, 26.63s/it] {'loss': 0.5427, 'grad_norm': 64.9948501586914, 'learning_rate': 2.129207286861638e-07, 'margin_dpo/margin_mean': 30.217140197753906, 'margin_dpo/margin_std': 27.509521484375, 'logps/chosen': -84.49642944335938, 'logps/rejected': -136.73745727539062, 'logps/ref_chosen': -65.16290283203125, 'logps/ref_rejected': -87.18678283691406, 'logits/chosen': -0.5796902179718018, 'logits/rejected': -0.549854040145874, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.217140197753906, 'margin_dpo/beta_margin_mean': 3.021714210510254, 'margin_dpo/beta_margin_std': 2.9265987873077393, 'margin_dpo/beta_margin_grad_mean': -0.18682169914245605, 'margin_dpo/beta_margin_grad_std': 0.22749853134155273, 'epoch': 0.59} 59%|█████████████████████████████████████████████▏ | 405/681 [30:18<2:02:30, 26.63s/it] 60%|█████████████████████████████████████████████▎ | 406/681 [30:21<1:29:10, 19.46s/it] {'loss': 0.5435, 'grad_norm': 61.627079010009766, 'learning_rate': 2.1165208628032861e-07, 'margin_dpo/margin_mean': 31.772830963134766, 'margin_dpo/margin_std': 27.909154891967773, 'logps/chosen': -66.44183349609375, 'logps/rejected': -140.552490234375, 'logps/ref_chosen': -49.740814208984375, 'logps/ref_rejected': -92.07862854003906, 'logits/chosen': -0.6366710662841797, 'logits/rejected': -0.6224513649940491, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.772830963134766, 'margin_dpo/beta_margin_mean': 3.1772830486297607, 'margin_dpo/beta_margin_std': 2.8382697105407715, 'margin_dpo/beta_margin_grad_mean': -0.1675260215997696, 'margin_dpo/beta_margin_grad_std': 0.22542835772037506, 'epoch': 0.6} 60%|█████████████████████████████████████████████▎ | 406/681 [30:21<1:29:10, 19.46s/it] 60%|█████████████████████████████████████████████▍ | 407/681 [30:24<1:05:55, 14.44s/it] {'loss': 0.6107, 'grad_norm': 68.63170623779297, 'learning_rate': 2.1038445437768375e-07, 'margin_dpo/margin_mean': 32.281612396240234, 'margin_dpo/margin_std': 29.094558715820312, 'logps/chosen': -72.40534973144531, 'logps/rejected': -125.86834716796875, 'logps/ref_chosen': -56.33069610595703, 'logps/ref_rejected': -77.5120849609375, 'logits/chosen': -0.6445499062538147, 'logits/rejected': -0.599348783493042, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.281612396240234, 'margin_dpo/beta_margin_mean': 3.228161334991455, 'margin_dpo/beta_margin_std': 2.978395462036133, 'margin_dpo/beta_margin_grad_mean': -0.18231885135173798, 'margin_dpo/beta_margin_grad_std': 0.25164178013801575, 'epoch': 0.6} 60%|█████████████████████████████████████████████▍ | 407/681 [30:24<1:05:55, 14.44s/it] 60%|██████████████████████████████████████████████▋ | 408/681 [30:27<49:55, 10.97s/it] {'loss': 0.6172, 'grad_norm': 81.44627380371094, 'learning_rate': 2.0911786638150872e-07, 'margin_dpo/margin_mean': 27.853038787841797, 'margin_dpo/margin_std': 27.155353546142578, 'logps/chosen': -85.27023315429688, 'logps/rejected': -133.43089294433594, 'logps/ref_chosen': -69.789306640625, 'logps/ref_rejected': -90.09693908691406, 'logits/chosen': -0.6902725696563721, 'logits/rejected': -0.6373718976974487, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.853038787841797, 'margin_dpo/beta_margin_mean': 2.785304069519043, 'margin_dpo/beta_margin_std': 2.7439894676208496, 'margin_dpo/beta_margin_grad_mean': -0.20204903185367584, 'margin_dpo/beta_margin_grad_std': 0.25068265199661255, 'epoch': 0.6} 60%|██████████████████████████████████████████████▋ | 408/681 [30:27<49:55, 10.97s/it] 60%|██████████████████████████████████████████████▊ | 409/681 [30:29<38:30, 8.49s/it] {'loss': 0.4121, 'grad_norm': 49.702667236328125, 'learning_rate': 2.0785235566757517e-07, 'margin_dpo/margin_mean': 30.92287254333496, 'margin_dpo/margin_std': 25.64594078063965, 'logps/chosen': -84.24601745605469, 'logps/rejected': -132.7557373046875, 'logps/ref_chosen': -67.31744384765625, 'logps/ref_rejected': -84.904296875, 'logits/chosen': -0.6016473770141602, 'logits/rejected': -0.5694031119346619, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.92287254333496, 'margin_dpo/beta_margin_mean': 3.092287302017212, 'margin_dpo/beta_margin_std': 2.565514326095581, 'margin_dpo/beta_margin_grad_mean': -0.1540304273366928, 'margin_dpo/beta_margin_grad_std': 0.19426687061786652, 'epoch': 0.6} 60%|██████████████████████████████████████████████▊ | 409/681 [30:30<38:30, 8.49s/it] 60%|██████████████████████████████████████████████▉ | 410/681 [30:32<30:19, 6.71s/it] {'loss': 0.5957, 'grad_norm': 67.67236328125, 'learning_rate': 2.065879555832674e-07, 'margin_dpo/margin_mean': 27.859020233154297, 'margin_dpo/margin_std': 26.202781677246094, 'logps/chosen': -70.31283569335938, 'logps/rejected': -129.9054718017578, 'logps/ref_chosen': -51.465354919433594, 'logps/ref_rejected': -83.198974609375, 'logits/chosen': -0.6346931457519531, 'logits/rejected': -0.6326348781585693, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.859020233154297, 'margin_dpo/beta_margin_mean': 2.7859020233154297, 'margin_dpo/beta_margin_std': 2.646648406982422, 'margin_dpo/beta_margin_grad_mean': -0.20088441669940948, 'margin_dpo/beta_margin_grad_std': 0.24235385656356812, 'epoch': 0.6} 60%|██████████████████████████████████████████████▉ | 410/681 [30:32<30:19, 6.71s/it] 60%|███████████████████████████████████████████████ | 411/681 [30:34<24:09, 5.37s/it] {'loss': 0.5393, 'grad_norm': 57.020423889160156, 'learning_rate': 2.0532469944670343e-07, 'margin_dpo/margin_mean': 29.69510841369629, 'margin_dpo/margin_std': 27.609901428222656, 'logps/chosen': -71.45536041259766, 'logps/rejected': -129.53814697265625, 'logps/ref_chosen': -52.30727005004883, 'logps/ref_rejected': -80.69495391845703, 'logits/chosen': -0.6736893653869629, 'logits/rejected': -0.640461802482605, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.695110321044922, 'margin_dpo/beta_margin_mean': 2.969511032104492, 'margin_dpo/beta_margin_std': 2.880366563796997, 'margin_dpo/beta_margin_grad_mean': -0.1864738166332245, 'margin_dpo/beta_margin_grad_std': 0.2311916947364807, 'epoch': 0.6} 60%|███████████████████████████████████████████████ | 411/681 [30:34<24:09, 5.37s/it] 60%|███████████████████████████████████████████████▏ | 412/681 [30:37<19:59, 4.46s/it] {'loss': 0.501, 'grad_norm': 41.312705993652344, 'learning_rate': 2.0406262054585738e-07, 'margin_dpo/margin_mean': 29.459096908569336, 'margin_dpo/margin_std': 27.16181182861328, 'logps/chosen': -68.71327209472656, 'logps/rejected': -145.08905029296875, 'logps/ref_chosen': -53.144126892089844, 'logps/ref_rejected': -100.06080627441406, 'logits/chosen': -0.702052652835846, 'logits/rejected': -0.6910427808761597, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.459095001220703, 'margin_dpo/beta_margin_mean': 2.9459095001220703, 'margin_dpo/beta_margin_std': 2.7281651496887207, 'margin_dpo/beta_margin_grad_mean': -0.18716482818126678, 'margin_dpo/beta_margin_grad_std': 0.21060419082641602, 'epoch': 0.6} 60%|███████████████████████████████████████████████▏ | 412/681 [30:37<19:59, 4.46s/it] 61%|███████████████████████████████████████████████▎ | 413/681 [30:39<17:33, 3.93s/it] {'loss': 0.4911, 'grad_norm': 59.28904724121094, 'learning_rate': 2.0280175213768205e-07, 'margin_dpo/margin_mean': 29.902387619018555, 'margin_dpo/margin_std': 25.28069496154785, 'logps/chosen': -80.49532318115234, 'logps/rejected': -148.28915405273438, 'logps/ref_chosen': -61.58196258544922, 'logps/ref_rejected': -99.47340393066406, 'logits/chosen': -0.5773541927337646, 'logits/rejected': -0.5431898832321167, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.902387619018555, 'margin_dpo/beta_margin_mean': 2.990238666534424, 'margin_dpo/beta_margin_std': 2.5804879665374756, 'margin_dpo/beta_margin_grad_mean': -0.15901578962802887, 'margin_dpo/beta_margin_grad_std': 0.21321162581443787, 'epoch': 0.61} 61%|███████████████████████████████████████████████▎ | 413/681 [30:39<17:33, 3.93s/it] 61%|███████████████████████████████████████████████▍ | 414/681 [30:42<15:39, 3.52s/it] {'loss': 0.3637, 'grad_norm': 55.32724380493164, 'learning_rate': 2.0154212744723247e-07, 'margin_dpo/margin_mean': 35.684043884277344, 'margin_dpo/margin_std': 25.48971176147461, 'logps/chosen': -62.55944061279297, 'logps/rejected': -139.25851440429688, 'logps/ref_chosen': -46.63148880004883, 'logps/ref_rejected': -87.64652252197266, 'logits/chosen': -0.6178678274154663, 'logits/rejected': -0.580098032951355, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.684043884277344, 'margin_dpo/beta_margin_mean': 3.56840443611145, 'margin_dpo/beta_margin_std': 2.785224437713623, 'margin_dpo/beta_margin_grad_mean': -0.13259248435497284, 'margin_dpo/beta_margin_grad_std': 0.19208675622940063, 'epoch': 0.61} 61%|███████████████████████████████████████████████▍ | 414/681 [30:42<15:39, 3.52s/it] 61%|███████████████████████████████████████████████▌ | 415/681 [30:45<14:36, 3.29s/it] {'loss': 0.3982, 'grad_norm': 44.93287658691406, 'learning_rate': 2.002837796667909e-07, 'margin_dpo/margin_mean': 29.6812686920166, 'margin_dpo/margin_std': 24.717784881591797, 'logps/chosen': -95.38108825683594, 'logps/rejected': -146.9215850830078, 'logps/ref_chosen': -78.6182861328125, 'logps/ref_rejected': -100.47752380371094, 'logits/chosen': -0.5938626527786255, 'logits/rejected': -0.5675798654556274, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.68126678466797, 'margin_dpo/beta_margin_mean': 2.9681267738342285, 'margin_dpo/beta_margin_std': 2.4819741249084473, 'margin_dpo/beta_margin_grad_mean': -0.1585043966770172, 'margin_dpo/beta_margin_grad_std': 0.17520886659622192, 'epoch': 0.61} 61%|███████████████████████████████████████████████▌ | 415/681 [30:45<14:36, 3.29s/it] 61%|███████████████████████████████████████████████▋ | 416/681 [30:47<13:33, 3.07s/it] {'loss': 0.3876, 'grad_norm': 49.30588150024414, 'learning_rate': 1.990267419549914e-07, 'margin_dpo/margin_mean': 36.51543426513672, 'margin_dpo/margin_std': 27.713136672973633, 'logps/chosen': -75.66851806640625, 'logps/rejected': -144.47354125976562, 'logps/ref_chosen': -58.27912521362305, 'logps/ref_rejected': -90.56871795654297, 'logits/chosen': -0.6397312879562378, 'logits/rejected': -0.6059544086456299, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 36.51543426513672, 'margin_dpo/beta_margin_mean': 3.651543617248535, 'margin_dpo/beta_margin_std': 2.8901610374450684, 'margin_dpo/beta_margin_grad_mean': -0.1345895677804947, 'margin_dpo/beta_margin_grad_std': 0.2077297866344452, 'epoch': 0.61} 61%|███████████████████████████████████████████████▋ | 416/681 [30:47<13:33, 3.07s/it] 61%|███████████████████████████████████████████████▊ | 417/681 [30:50<12:51, 2.92s/it] {'loss': 0.3154, 'grad_norm': 38.555389404296875, 'learning_rate': 1.9777104743594686e-07, 'margin_dpo/margin_mean': 34.968475341796875, 'margin_dpo/margin_std': 23.240657806396484, 'logps/chosen': -66.67837524414062, 'logps/rejected': -119.5999755859375, 'logps/ref_chosen': -50.1987190246582, 'logps/ref_rejected': -68.15184020996094, 'logits/chosen': -0.6252127289772034, 'logits/rejected': -0.5568169355392456, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.968475341796875, 'margin_dpo/beta_margin_mean': 3.496847629547119, 'margin_dpo/beta_margin_std': 2.3686065673828125, 'margin_dpo/beta_margin_grad_mean': -0.12362627685070038, 'margin_dpo/beta_margin_grad_std': 0.17288993299007416, 'epoch': 0.61} 61%|███████████████████████████████████████████████▊ | 417/681 [30:50<12:51, 2.92s/it] 61%|███████████████████████████████████████████████▉ | 418/681 [30:53<12:49, 2.93s/it] {'loss': 0.5663, 'grad_norm': 64.83741760253906, 'learning_rate': 1.965167291983757e-07, 'margin_dpo/margin_mean': 34.140960693359375, 'margin_dpo/margin_std': 31.345539093017578, 'logps/chosen': -99.16204833984375, 'logps/rejected': -156.01602172851562, 'logps/ref_chosen': -81.97846984863281, 'logps/ref_rejected': -104.69148254394531, 'logits/chosen': -0.6693556904792786, 'logits/rejected': -0.6072407960891724, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.140960693359375, 'margin_dpo/beta_margin_mean': 3.4140961170196533, 'margin_dpo/beta_margin_std': 3.2014167308807373, 'margin_dpo/beta_margin_grad_mean': -0.16488902270793915, 'margin_dpo/beta_margin_grad_std': 0.2364022433757782, 'epoch': 0.61} 61%|███████████████████████████████████████████████▉ | 418/681 [30:53<12:49, 2.93s/it] 62%|███████████████████████████████████████████████▉ | 419/681 [30:55<12:27, 2.85s/it] {'loss': 0.3088, 'grad_norm': 46.96452331542969, 'learning_rate': 1.9526382029472988e-07, 'margin_dpo/margin_mean': 33.94792938232422, 'margin_dpo/margin_std': 23.982418060302734, 'logps/chosen': -70.24662780761719, 'logps/rejected': -142.82901000976562, 'logps/ref_chosen': -52.948646545410156, 'logps/ref_rejected': -91.58309936523438, 'logits/chosen': -0.5874903202056885, 'logits/rejected': -0.5439319610595703, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.94792938232422, 'margin_dpo/beta_margin_mean': 3.3947930335998535, 'margin_dpo/beta_margin_std': 2.4165024757385254, 'margin_dpo/beta_margin_grad_mean': -0.11957548558712006, 'margin_dpo/beta_margin_grad_std': 0.16336920857429504, 'epoch': 0.62} 62%|███████████████████████████████████████████████▉ | 419/681 [30:55<12:27, 2.85s/it] 62%|████████████████████████████████████████████████ | 420/681 [30:58<12:08, 2.79s/it] {'loss': 0.4567, 'grad_norm': 61.41410827636719, 'learning_rate': 1.9401235374032425e-07, 'margin_dpo/margin_mean': 32.95981979370117, 'margin_dpo/margin_std': 27.405033111572266, 'logps/chosen': -95.875244140625, 'logps/rejected': -120.385009765625, 'logps/ref_chosen': -77.7699203491211, 'logps/ref_rejected': -69.31985473632812, 'logits/chosen': -0.6708568930625916, 'logits/rejected': -0.594412624835968, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.95982360839844, 'margin_dpo/beta_margin_mean': 3.2959823608398438, 'margin_dpo/beta_margin_std': 2.774785041809082, 'margin_dpo/beta_margin_grad_mean': -0.1499803215265274, 'margin_dpo/beta_margin_grad_std': 0.21647407114505768, 'epoch': 0.62} 62%|████████████████████████████████████████████████ | 420/681 [30:58<12:08, 2.79s/it] 62%|████████████████████████████████████████████████▏ | 421/681 [31:01<11:55, 2.75s/it] {'loss': 0.6226, 'grad_norm': 79.4913330078125, 'learning_rate': 1.9276236251246653e-07, 'margin_dpo/margin_mean': 27.947509765625, 'margin_dpo/margin_std': 26.780242919921875, 'logps/chosen': -73.95745849609375, 'logps/rejected': -137.42054748535156, 'logps/ref_chosen': -53.765865325927734, 'logps/ref_rejected': -89.28144836425781, 'logits/chosen': -0.6430982351303101, 'logits/rejected': -0.6089684963226318, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.947509765625, 'margin_dpo/beta_margin_mean': 2.794750928878784, 'margin_dpo/beta_margin_std': 2.7135543823242188, 'margin_dpo/beta_margin_grad_mean': -0.1968107521533966, 'margin_dpo/beta_margin_grad_std': 0.24985744059085846, 'epoch': 0.62} 62%|████████████████████████████████████████████████▏ | 421/681 [31:01<11:55, 2.75s/it] 62%|████████████████████████████████████████████████▎ | 422/681 [31:03<11:56, 2.77s/it] {'loss': 0.5663, 'grad_norm': 66.62350463867188, 'learning_rate': 1.9151387954958792e-07, 'margin_dpo/margin_mean': 30.136600494384766, 'margin_dpo/margin_std': 28.641185760498047, 'logps/chosen': -89.37240600585938, 'logps/rejected': -138.73875427246094, 'logps/ref_chosen': -68.6337661743164, 'logps/ref_rejected': -87.86351013183594, 'logits/chosen': -0.6613567471504211, 'logits/rejected': -0.6198326349258423, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.136600494384766, 'margin_dpo/beta_margin_mean': 3.013660192489624, 'margin_dpo/beta_margin_std': 2.8644890785217285, 'margin_dpo/beta_margin_grad_mean': -0.1885911077260971, 'margin_dpo/beta_margin_grad_std': 0.2437172681093216, 'epoch': 0.62} 62%|████████████████████████████████████████████████▎ | 422/681 [31:04<11:56, 2.77s/it] 62%|████████████████████████████████████████████████▍ | 423/681 [31:06<11:26, 2.66s/it] {'loss': 0.5527, 'grad_norm': 66.34683227539062, 'learning_rate': 1.902669377503756e-07, 'margin_dpo/margin_mean': 31.304006576538086, 'margin_dpo/margin_std': 29.631959915161133, 'logps/chosen': -74.14385986328125, 'logps/rejected': -136.7641143798828, 'logps/ref_chosen': -54.99030303955078, 'logps/ref_rejected': -86.30654907226562, 'logits/chosen': -0.6761616468429565, 'logits/rejected': -0.6586691737174988, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.304006576538086, 'margin_dpo/beta_margin_mean': 3.1304006576538086, 'margin_dpo/beta_margin_std': 2.9750967025756836, 'margin_dpo/beta_margin_grad_mean': -0.18222779035568237, 'margin_dpo/beta_margin_grad_std': 0.2336231768131256, 'epoch': 0.62} 62%|████████████████████████████████████████████████▍ | 423/681 [31:06<11:26, 2.66s/it] 62%|████████████████████████████████████████████████▌ | 424/681 [31:09<11:23, 2.66s/it] {'loss': 0.4263, 'grad_norm': 48.2248649597168, 'learning_rate': 1.890215699729057e-07, 'margin_dpo/margin_mean': 34.16087341308594, 'margin_dpo/margin_std': 30.704998016357422, 'logps/chosen': -73.47090148925781, 'logps/rejected': -118.09882354736328, 'logps/ref_chosen': -56.01191711425781, 'logps/ref_rejected': -66.47896575927734, 'logits/chosen': -0.6284000873565674, 'logits/rejected': -0.5798854231834412, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.16087341308594, 'margin_dpo/beta_margin_mean': 3.4160873889923096, 'margin_dpo/beta_margin_std': 3.078399419784546, 'margin_dpo/beta_margin_grad_mean': -0.15188807249069214, 'margin_dpo/beta_margin_grad_std': 0.21018096804618835, 'epoch': 0.62} 62%|████████████████████████████████████████████████▌ | 424/681 [31:09<11:23, 2.66s/it] 62%|████████████████████████████████████████████████▋ | 425/681 [31:11<11:10, 2.62s/it] {'loss': 0.5067, 'grad_norm': 56.79523849487305, 'learning_rate': 1.8777780903377732e-07, 'margin_dpo/margin_mean': 30.633705139160156, 'margin_dpo/margin_std': 24.710655212402344, 'logps/chosen': -65.49158477783203, 'logps/rejected': -145.18174743652344, 'logps/ref_chosen': -46.868995666503906, 'logps/ref_rejected': -95.92545318603516, 'logits/chosen': -0.6415660381317139, 'logits/rejected': -0.6306988000869751, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.63370704650879, 'margin_dpo/beta_margin_mean': 3.063370704650879, 'margin_dpo/beta_margin_std': 2.508201837539673, 'margin_dpo/beta_margin_grad_mean': -0.16396918892860413, 'margin_dpo/beta_margin_grad_std': 0.22638258337974548, 'epoch': 0.62} 62%|████████████████████████████████████████████████▋ | 425/681 [31:11<11:10, 2.62s/it] 63%|████████████████████████████████████████████████▊ | 426/681 [31:14<11:15, 2.65s/it] {'loss': 0.4413, 'grad_norm': 73.21717071533203, 'learning_rate': 1.8653568770724803e-07, 'margin_dpo/margin_mean': 33.87653732299805, 'margin_dpo/margin_std': 26.354013442993164, 'logps/chosen': -93.59241485595703, 'logps/rejected': -132.15199279785156, 'logps/ref_chosen': -76.58354187011719, 'logps/ref_rejected': -81.26658630371094, 'logits/chosen': -0.6280812621116638, 'logits/rejected': -0.5743027925491333, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.87653732299805, 'margin_dpo/beta_margin_mean': 3.3876538276672363, 'margin_dpo/beta_margin_std': 2.695366144180298, 'margin_dpo/beta_margin_grad_mean': -0.13312982022762299, 'margin_dpo/beta_margin_grad_std': 0.21179711818695068, 'epoch': 0.63} 63%|████████████████████████████████████████████████▊ | 426/681 [31:14<11:15, 2.65s/it] 63%|████████████████████████████████████████████████▉ | 427/681 [31:16<11:12, 2.65s/it] {'loss': 0.5885, 'grad_norm': 56.27901840209961, 'learning_rate': 1.8529523872436977e-07, 'margin_dpo/margin_mean': 24.72673797607422, 'margin_dpo/margin_std': 23.543621063232422, 'logps/chosen': -81.7194595336914, 'logps/rejected': -120.1583251953125, 'logps/ref_chosen': -64.8538818359375, 'logps/ref_rejected': -78.56600952148438, 'logits/chosen': -0.6733847856521606, 'logits/rejected': -0.6199424266815186, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 24.72673797607422, 'margin_dpo/beta_margin_mean': 2.4726738929748535, 'margin_dpo/beta_margin_std': 2.3600223064422607, 'margin_dpo/beta_margin_grad_mean': -0.190776988863945, 'margin_dpo/beta_margin_grad_std': 0.20414692163467407, 'epoch': 0.63} 63%|████████████████████████████████████████████████▉ | 427/681 [31:16<11:12, 2.65s/it] 63%|█████████████████████████████████████████████████ | 428/681 [31:19<11:15, 2.67s/it] {'loss': 0.3299, 'grad_norm': 44.09659957885742, 'learning_rate': 1.8405649477212697e-07, 'margin_dpo/margin_mean': 35.59562683105469, 'margin_dpo/margin_std': 27.276784896850586, 'logps/chosen': -83.10867309570312, 'logps/rejected': -159.34945678710938, 'logps/ref_chosen': -62.63666534423828, 'logps/ref_rejected': -103.28182220458984, 'logits/chosen': -0.6260280609130859, 'logits/rejected': -0.5897619724273682, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.59562683105469, 'margin_dpo/beta_margin_mean': 3.559562921524048, 'margin_dpo/beta_margin_std': 2.730299234390259, 'margin_dpo/beta_margin_grad_mean': -0.124484583735466, 'margin_dpo/beta_margin_grad_std': 0.1756177842617035, 'epoch': 0.63} 63%|█████████████████████████████████████████████████ | 428/681 [31:19<11:15, 2.67s/it] 63%|█████████████████████████████████████████████████▏ | 429/681 [31:22<11:27, 2.73s/it] {'loss': 0.595, 'grad_norm': 61.60802459716797, 'learning_rate': 1.828194884925749e-07, 'margin_dpo/margin_mean': 29.67691421508789, 'margin_dpo/margin_std': 28.60194969177246, 'logps/chosen': -101.16323852539062, 'logps/rejected': -141.40106201171875, 'logps/ref_chosen': -81.23401641845703, 'logps/ref_rejected': -91.79493713378906, 'logits/chosen': -0.636346697807312, 'logits/rejected': -0.5803790092468262, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.67691421508789, 'margin_dpo/beta_margin_mean': 2.967691421508789, 'margin_dpo/beta_margin_std': 2.905752182006836, 'margin_dpo/beta_margin_grad_mean': -0.1930510401725769, 'margin_dpo/beta_margin_grad_std': 0.24151724576950073, 'epoch': 0.63} 63%|█████████████████████████████████████████████████▏ | 429/681 [31:22<11:27, 2.73s/it] 63%|█████████████████████████████████████████████████▎ | 430/681 [31:25<11:57, 2.86s/it] {'loss': 0.4761, 'grad_norm': 51.62448501586914, 'learning_rate': 1.8158425248197928e-07, 'margin_dpo/margin_mean': 30.790908813476562, 'margin_dpo/margin_std': 26.328550338745117, 'logps/chosen': -79.01585388183594, 'logps/rejected': -153.30923461914062, 'logps/ref_chosen': -60.92032241821289, 'logps/ref_rejected': -104.42280578613281, 'logits/chosen': -0.6227689981460571, 'logits/rejected': -0.6045354008674622, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.790908813476562, 'margin_dpo/beta_margin_mean': 3.0790908336639404, 'margin_dpo/beta_margin_std': 2.675757884979248, 'margin_dpo/beta_margin_grad_mean': -0.1646682769060135, 'margin_dpo/beta_margin_grad_std': 0.2231719046831131, 'epoch': 0.63} 63%|█████████████████████████████████████████████████▎ | 430/681 [31:25<11:57, 2.86s/it] 63%|█████████████████████████████████████████████████▎ | 431/681 [31:28<11:46, 2.83s/it] {'loss': 0.3416, 'grad_norm': 45.01468276977539, 'learning_rate': 1.8035081928995788e-07, 'margin_dpo/margin_mean': 34.62909698486328, 'margin_dpo/margin_std': 26.410173416137695, 'logps/chosen': -76.03721618652344, 'logps/rejected': -146.1577911376953, 'logps/ref_chosen': -57.348751068115234, 'logps/ref_rejected': -92.84022521972656, 'logits/chosen': -0.6120933294296265, 'logits/rejected': -0.5965217351913452, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.62909698486328, 'margin_dpo/beta_margin_mean': 3.4629099369049072, 'margin_dpo/beta_margin_std': 2.6419167518615723, 'margin_dpo/beta_margin_grad_mean': -0.13492785394191742, 'margin_dpo/beta_margin_grad_std': 0.17364878952503204, 'epoch': 0.63} 63%|█████████████████████████████████████████████████▎ | 431/681 [31:28<11:46, 2.83s/it] 63%|█████████████████████████████████████████████████▍ | 432/681 [31:31<11:40, 2.81s/it] {'loss': 0.4396, 'grad_norm': 55.2720947265625, 'learning_rate': 1.791192214186223e-07, 'margin_dpo/margin_mean': 32.3117790222168, 'margin_dpo/margin_std': 27.102590560913086, 'logps/chosen': -88.92323303222656, 'logps/rejected': -148.73974609375, 'logps/ref_chosen': -71.07479095458984, 'logps/ref_rejected': -98.57952880859375, 'logits/chosen': -0.6020532250404358, 'logits/rejected': -0.5625859498977661, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.3117790222168, 'margin_dpo/beta_margin_mean': 3.231178045272827, 'margin_dpo/beta_margin_std': 2.7881994247436523, 'margin_dpo/beta_margin_grad_mean': -0.1505272537469864, 'margin_dpo/beta_margin_grad_std': 0.20940996706485748, 'epoch': 0.63} 63%|█████████████████████████████████████████████████▍ | 432/681 [31:31<11:40, 2.81s/it] 64%|█████████████████████████████████████████████████▌ | 433/681 [31:33<11:12, 2.71s/it] {'loss': 0.5849, 'grad_norm': 71.04937744140625, 'learning_rate': 1.7788949132172193e-07, 'margin_dpo/margin_mean': 28.365665435791016, 'margin_dpo/margin_std': 26.324649810791016, 'logps/chosen': -81.66122436523438, 'logps/rejected': -147.70458984375, 'logps/ref_chosen': -58.273193359375, 'logps/ref_rejected': -95.95089721679688, 'logits/chosen': -0.6384241580963135, 'logits/rejected': -0.6068836450576782, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.365663528442383, 'margin_dpo/beta_margin_mean': 2.83656644821167, 'margin_dpo/beta_margin_std': 2.67340350151062, 'margin_dpo/beta_margin_grad_mean': -0.19293591380119324, 'margin_dpo/beta_margin_grad_std': 0.2420828938484192, 'epoch': 0.64} 64%|█████████████████████████████████████████████████▌ | 433/681 [31:33<11:12, 2.71s/it] 64%|█████████████████████████████████████████████████▋ | 434/681 [31:36<11:01, 2.68s/it] {'loss': 0.4218, 'grad_norm': 48.197303771972656, 'learning_rate': 1.7666166140378853e-07, 'margin_dpo/margin_mean': 29.513980865478516, 'margin_dpo/margin_std': 25.25749969482422, 'logps/chosen': -79.50520324707031, 'logps/rejected': -125.54408264160156, 'logps/ref_chosen': -61.97370147705078, 'logps/ref_rejected': -78.49861145019531, 'logits/chosen': -0.6621353626251221, 'logits/rejected': -0.6182979345321655, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.513980865478516, 'margin_dpo/beta_margin_mean': 2.9513981342315674, 'margin_dpo/beta_margin_std': 2.5280110836029053, 'margin_dpo/beta_margin_grad_mean': -0.15651409327983856, 'margin_dpo/beta_margin_grad_std': 0.19872474670410156, 'epoch': 0.64} 64%|█████████████████████████████████████████████████▋ | 434/681 [31:36<11:01, 2.68s/it] 64%|█████████████████████████████████████████████████▊ | 435/681 [31:38<10:27, 2.55s/it] {'loss': 0.5053, 'grad_norm': 63.86077117919922, 'learning_rate': 1.7543576401928218e-07, 'margin_dpo/margin_mean': 32.3472900390625, 'margin_dpo/margin_std': 29.455238342285156, 'logps/chosen': -69.592041015625, 'logps/rejected': -138.00416564941406, 'logps/ref_chosen': -51.502052307128906, 'logps/ref_rejected': -87.56689453125, 'logits/chosen': -0.6548939943313599, 'logits/rejected': -0.6191599369049072, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.3472900390625, 'margin_dpo/beta_margin_mean': 3.234729051589966, 'margin_dpo/beta_margin_std': 2.9603843688964844, 'margin_dpo/beta_margin_grad_mean': -0.1661788821220398, 'margin_dpo/beta_margin_grad_std': 0.21013152599334717, 'epoch': 0.64} 64%|█████████████████████████████████████████████████▊ | 435/681 [31:38<10:27, 2.55s/it] 64%|█████████████████████████████████████████████████▉ | 436/681 [31:41<10:30, 2.57s/it] {'loss': 0.3539, 'grad_norm': 40.332698822021484, 'learning_rate': 1.742118314717391e-07, 'margin_dpo/margin_mean': 31.527891159057617, 'margin_dpo/margin_std': 24.248245239257812, 'logps/chosen': -88.88678741455078, 'logps/rejected': -131.7387237548828, 'logps/ref_chosen': -71.40371704101562, 'logps/ref_rejected': -82.72775268554688, 'logits/chosen': -0.632080078125, 'logits/rejected': -0.5719594955444336, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.527891159057617, 'margin_dpo/beta_margin_mean': 3.152789354324341, 'margin_dpo/beta_margin_std': 2.4287147521972656, 'margin_dpo/beta_margin_grad_mean': -0.13681164383888245, 'margin_dpo/beta_margin_grad_std': 0.17888766527175903, 'epoch': 0.64} 64%|█████████████████████████████████████████████████▉ | 436/681 [31:41<10:30, 2.57s/it] 64%|██████████████████████████████████████████████████ | 437/681 [31:43<10:42, 2.63s/it] {'loss': 0.5269, 'grad_norm': 51.00373840332031, 'learning_rate': 1.7298989601292036e-07, 'margin_dpo/margin_mean': 28.168094635009766, 'margin_dpo/margin_std': 23.416202545166016, 'logps/chosen': -81.99353790283203, 'logps/rejected': -127.4609375, 'logps/ref_chosen': -64.7442626953125, 'logps/ref_rejected': -82.04356384277344, 'logits/chosen': -0.6353539228439331, 'logits/rejected': -0.5929083824157715, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.168094635009766, 'margin_dpo/beta_margin_mean': 2.81680965423584, 'margin_dpo/beta_margin_std': 2.37345027923584, 'margin_dpo/beta_margin_grad_mean': -0.1786879152059555, 'margin_dpo/beta_margin_grad_std': 0.23251357674598694, 'epoch': 0.64} 64%|██████████████████████████████████████████████████ | 437/681 [31:43<10:42, 2.63s/it] 64%|██████████████████████████████████████████████████▏ | 438/681 [31:46<10:23, 2.56s/it] {'loss': 0.3695, 'grad_norm': 63.38606643676758, 'learning_rate': 1.7176998984196144e-07, 'margin_dpo/margin_mean': 34.36668395996094, 'margin_dpo/margin_std': 26.956180572509766, 'logps/chosen': -78.18193817138672, 'logps/rejected': -136.60678100585938, 'logps/ref_chosen': -59.0186653137207, 'logps/ref_rejected': -83.07682037353516, 'logits/chosen': -0.6576756238937378, 'logits/rejected': -0.5832280516624451, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.36668395996094, 'margin_dpo/beta_margin_mean': 3.4366683959960938, 'margin_dpo/beta_margin_std': 2.721860408782959, 'margin_dpo/beta_margin_grad_mean': -0.13736094534397125, 'margin_dpo/beta_margin_grad_std': 0.18339543044567108, 'epoch': 0.64} 64%|██████████████████████████████████████████████████▏ | 438/681 [31:46<10:23, 2.56s/it] 64%|██████████████████████████████████████████████████▎ | 439/681 [31:48<10:10, 2.52s/it] {'loss': 0.5261, 'grad_norm': 71.34723663330078, 'learning_rate': 1.7055214510452458e-07, 'margin_dpo/margin_mean': 26.974590301513672, 'margin_dpo/margin_std': 23.787738800048828, 'logps/chosen': -77.27565002441406, 'logps/rejected': -134.45162963867188, 'logps/ref_chosen': -53.784080505371094, 'logps/ref_rejected': -83.98545837402344, 'logits/chosen': -0.6156207323074341, 'logits/rejected': -0.5937438607215881, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.97458839416504, 'margin_dpo/beta_margin_mean': 2.6974589824676514, 'margin_dpo/beta_margin_std': 2.4870941638946533, 'margin_dpo/beta_margin_grad_mean': -0.18373528122901917, 'margin_dpo/beta_margin_grad_std': 0.2151244729757309, 'epoch': 0.64} 64%|██████████████████████████████████████████████████▎ | 439/681 [31:48<10:10, 2.52s/it] 65%|██████████████████████████████████████████████████▍ | 440/681 [31:51<09:59, 2.49s/it] {'loss': 0.6669, 'grad_norm': 96.4582290649414, 'learning_rate': 1.6933639389195134e-07, 'margin_dpo/margin_mean': 25.880369186401367, 'margin_dpo/margin_std': 27.07331085205078, 'logps/chosen': -96.89436340332031, 'logps/rejected': -140.70578002929688, 'logps/ref_chosen': -78.56671905517578, 'logps/ref_rejected': -96.49775695800781, 'logits/chosen': -0.6607520580291748, 'logits/rejected': -0.6199520826339722, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 25.880369186401367, 'margin_dpo/beta_margin_mean': 2.5880370140075684, 'margin_dpo/beta_margin_std': 2.716387987136841, 'margin_dpo/beta_margin_grad_mean': -0.2098814845085144, 'margin_dpo/beta_margin_grad_std': 0.25330764055252075, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▍ | 440/681 [31:51<09:59, 2.49s/it] 65%|██████████████████████████████████████████████████▌ | 441/681 [31:54<10:32, 2.63s/it] {'loss': 0.4379, 'grad_norm': 49.82929229736328, 'learning_rate': 1.681227682404166e-07, 'margin_dpo/margin_mean': 30.808923721313477, 'margin_dpo/margin_std': 23.68011474609375, 'logps/chosen': -80.72434997558594, 'logps/rejected': -147.17962646484375, 'logps/ref_chosen': -60.824440002441406, 'logps/ref_rejected': -96.47080993652344, 'logits/chosen': -0.5963351726531982, 'logits/rejected': -0.5610902309417725, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.808923721313477, 'margin_dpo/beta_margin_mean': 3.080892562866211, 'margin_dpo/beta_margin_std': 2.426534414291382, 'margin_dpo/beta_margin_grad_mean': -0.13756851851940155, 'margin_dpo/beta_margin_grad_std': 0.19719012081623077, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▌ | 441/681 [31:54<10:32, 2.63s/it] 65%|██████████████████████████████████████████████████▋ | 442/681 [31:56<10:31, 2.64s/it] {'loss': 0.2823, 'grad_norm': 36.576942443847656, 'learning_rate': 1.669113001300851e-07, 'margin_dpo/margin_mean': 37.83577346801758, 'margin_dpo/margin_std': 26.404239654541016, 'logps/chosen': -64.97787475585938, 'logps/rejected': -132.34170532226562, 'logps/ref_chosen': -47.01121520996094, 'logps/ref_rejected': -76.53926086425781, 'logits/chosen': -0.6140519380569458, 'logits/rejected': -0.5783543586730957, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.83577346801758, 'margin_dpo/beta_margin_mean': 3.7835774421691895, 'margin_dpo/beta_margin_std': 2.697312593460083, 'margin_dpo/beta_margin_grad_mean': -0.1093648299574852, 'margin_dpo/beta_margin_grad_std': 0.16080023348331451, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▋ | 442/681 [31:56<10:31, 2.64s/it] 65%|██████████████████████████████████████████████████▋ | 443/681 [31:59<10:30, 2.65s/it] {'loss': 0.6573, 'grad_norm': 79.94059753417969, 'learning_rate': 1.6570202148426815e-07, 'margin_dpo/margin_mean': 28.54714012145996, 'margin_dpo/margin_std': 27.68130111694336, 'logps/chosen': -93.62142944335938, 'logps/rejected': -137.5754852294922, 'logps/ref_chosen': -71.27301788330078, 'logps/ref_rejected': -86.679931640625, 'logits/chosen': -0.6004323959350586, 'logits/rejected': -0.5627496242523193, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.54714012145996, 'margin_dpo/beta_margin_mean': 2.8547141551971436, 'margin_dpo/beta_margin_std': 2.7842366695404053, 'margin_dpo/beta_margin_grad_mean': -0.20045503973960876, 'margin_dpo/beta_margin_grad_std': 0.263896644115448, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▋ | 443/681 [31:59<10:30, 2.65s/it] 65%|██████████████████████████████████████████████████▊ | 444/681 [32:02<10:25, 2.64s/it] {'loss': 0.4389, 'grad_norm': 47.294471740722656, 'learning_rate': 1.6449496416858282e-07, 'margin_dpo/margin_mean': 34.26472473144531, 'margin_dpo/margin_std': 28.598800659179688, 'logps/chosen': -76.857421875, 'logps/rejected': -151.163330078125, 'logps/ref_chosen': -57.213706970214844, 'logps/ref_rejected': -97.25489044189453, 'logits/chosen': -0.5860676169395447, 'logits/rejected': -0.5605667233467102, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.26472091674805, 'margin_dpo/beta_margin_mean': 3.4264724254608154, 'margin_dpo/beta_margin_std': 2.8675122261047363, 'margin_dpo/beta_margin_grad_mean': -0.14946991205215454, 'margin_dpo/beta_margin_grad_std': 0.21209140121936798, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▊ | 444/681 [32:02<10:25, 2.64s/it] 65%|██████████████████████████████████████████████████▉ | 445/681 [32:04<10:20, 2.63s/it] {'loss': 0.4624, 'grad_norm': 61.75363540649414, 'learning_rate': 1.6329015999011182e-07, 'margin_dpo/margin_mean': 31.917476654052734, 'margin_dpo/margin_std': 27.65774154663086, 'logps/chosen': -84.33077239990234, 'logps/rejected': -141.63113403320312, 'logps/ref_chosen': -67.29979705810547, 'logps/ref_rejected': -92.68267822265625, 'logits/chosen': -0.6285964250564575, 'logits/rejected': -0.5963205695152283, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.917476654052734, 'margin_dpo/beta_margin_mean': 3.1917476654052734, 'margin_dpo/beta_margin_std': 2.7972888946533203, 'margin_dpo/beta_margin_grad_mean': -0.16633237898349762, 'margin_dpo/beta_margin_grad_std': 0.21091465651988983, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▉ | 445/681 [32:04<10:20, 2.63s/it] 65%|███████████████████████████████████████████████████ | 446/681 [32:07<10:22, 2.65s/it] {'loss': 0.4368, 'grad_norm': 54.28517532348633, 'learning_rate': 1.6208764069656578e-07, 'margin_dpo/margin_mean': 30.172958374023438, 'margin_dpo/margin_std': 26.31899070739746, 'logps/chosen': -76.78812408447266, 'logps/rejected': -149.1267852783203, 'logps/ref_chosen': -59.098487854003906, 'logps/ref_rejected': -101.26419067382812, 'logits/chosen': -0.5897877216339111, 'logits/rejected': -0.568926215171814, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.172958374023438, 'margin_dpo/beta_margin_mean': 3.0172958374023438, 'margin_dpo/beta_margin_std': 2.6881942749023438, 'margin_dpo/beta_margin_grad_mean': -0.16494978964328766, 'margin_dpo/beta_margin_grad_std': 0.1965719312429428, 'epoch': 0.65} 65%|███████████████████████████████████████████████████ | 446/681 [32:07<10:22, 2.65s/it] 66%|███████████████████████████████████████████████████▏ | 447/681 [32:10<10:20, 2.65s/it] {'loss': 0.4538, 'grad_norm': 51.364315032958984, 'learning_rate': 1.608874379754465e-07, 'margin_dpo/margin_mean': 31.73101806640625, 'margin_dpo/margin_std': 28.281917572021484, 'logps/chosen': -76.43832397460938, 'logps/rejected': -150.78875732421875, 'logps/ref_chosen': -56.07533264160156, 'logps/ref_rejected': -98.69475555419922, 'logits/chosen': -0.660834014415741, 'logits/rejected': -0.6618390083312988, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.731016159057617, 'margin_dpo/beta_margin_mean': 3.1731016635894775, 'margin_dpo/beta_margin_std': 2.897716760635376, 'margin_dpo/beta_margin_grad_mean': -0.16636352241039276, 'margin_dpo/beta_margin_grad_std': 0.20971129834651947, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▏ | 447/681 [32:10<10:20, 2.65s/it] 66%|███████████████████████████████████████████████████▎ | 448/681 [32:12<10:32, 2.72s/it] {'loss': 0.3892, 'grad_norm': 47.65716552734375, 'learning_rate': 1.5968958345321177e-07, 'margin_dpo/margin_mean': 32.101654052734375, 'margin_dpo/margin_std': 25.43906021118164, 'logps/chosen': -80.88053131103516, 'logps/rejected': -155.2429962158203, 'logps/ref_chosen': -60.00384521484375, 'logps/ref_rejected': -102.26465606689453, 'logits/chosen': -0.6168828010559082, 'logits/rejected': -0.600253701210022, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.101654052734375, 'margin_dpo/beta_margin_mean': 3.2101657390594482, 'margin_dpo/beta_margin_std': 2.5543386936187744, 'margin_dpo/beta_margin_grad_mean': -0.13886789977550507, 'margin_dpo/beta_margin_grad_std': 0.18517683446407318, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▎ | 448/681 [32:12<10:32, 2.72s/it] 66%|███████████████████████████████████████████████████▍ | 449/681 [32:15<10:17, 2.66s/it] {'loss': 0.6043, 'grad_norm': 79.98429107666016, 'learning_rate': 1.584941086944423e-07, 'margin_dpo/margin_mean': 31.41507339477539, 'margin_dpo/margin_std': 30.071718215942383, 'logps/chosen': -89.62152099609375, 'logps/rejected': -142.1068878173828, 'logps/ref_chosen': -67.52661895751953, 'logps/ref_rejected': -88.59690856933594, 'logits/chosen': -0.5817546248435974, 'logits/rejected': -0.5362948179244995, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.41507339477539, 'margin_dpo/beta_margin_mean': 3.141507387161255, 'margin_dpo/beta_margin_std': 3.0401062965393066, 'margin_dpo/beta_margin_grad_mean': -0.17372801899909973, 'margin_dpo/beta_margin_grad_std': 0.23874573409557343, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▍ | 449/681 [32:15<10:17, 2.66s/it] 66%|███████████████████████████████████████████████████▌ | 450/681 [32:18<10:54, 2.83s/it] {'loss': 0.3207, 'grad_norm': 44.39156723022461, 'learning_rate': 1.573010452010098e-07, 'margin_dpo/margin_mean': 34.53790283203125, 'margin_dpo/margin_std': 25.840599060058594, 'logps/chosen': -73.27051544189453, 'logps/rejected': -153.4552459716797, 'logps/ref_chosen': -57.108116149902344, 'logps/ref_rejected': -102.75494384765625, 'logits/chosen': -0.6516839265823364, 'logits/rejected': -0.6243829727172852, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.53790283203125, 'margin_dpo/beta_margin_mean': 3.4537904262542725, 'margin_dpo/beta_margin_std': 2.6313655376434326, 'margin_dpo/beta_margin_grad_mean': -0.12840278446674347, 'margin_dpo/beta_margin_grad_std': 0.1644502729177475, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▌ | 450/681 [32:18<10:54, 2.83s/it] 66%|███████████████████████████████████████████████████▋ | 451/681 [32:21<10:46, 2.81s/it] {'loss': 0.5537, 'grad_norm': 75.2901382446289, 'learning_rate': 1.5611042441124687e-07, 'margin_dpo/margin_mean': 29.465293884277344, 'margin_dpo/margin_std': 25.818279266357422, 'logps/chosen': -80.07470703125, 'logps/rejected': -124.00057983398438, 'logps/ref_chosen': -58.46883010864258, 'logps/ref_rejected': -72.92941284179688, 'logits/chosen': -0.6581634283065796, 'logits/rejected': -0.6103047132492065, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.465293884277344, 'margin_dpo/beta_margin_mean': 2.9465293884277344, 'margin_dpo/beta_margin_std': 2.6042659282684326, 'margin_dpo/beta_margin_grad_mean': -0.16929282248020172, 'margin_dpo/beta_margin_grad_std': 0.2288813591003418, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▋ | 451/681 [32:21<10:46, 2.81s/it] 66%|███████████████████████████████████████████████████▊ | 452/681 [32:24<10:44, 2.81s/it] {'loss': 0.2857, 'grad_norm': 35.9453239440918, 'learning_rate': 1.549222776991186e-07, 'margin_dpo/margin_mean': 30.134784698486328, 'margin_dpo/margin_std': 21.948862075805664, 'logps/chosen': -66.35121154785156, 'logps/rejected': -143.86688232421875, 'logps/ref_chosen': -50.39055252075195, 'logps/ref_rejected': -97.77143096923828, 'logits/chosen': -0.546400785446167, 'logits/rejected': -0.5479906797409058, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.134784698486328, 'margin_dpo/beta_margin_mean': 3.0134785175323486, 'margin_dpo/beta_margin_std': 2.2046637535095215, 'margin_dpo/beta_margin_grad_mean': -0.12084120512008667, 'margin_dpo/beta_margin_grad_std': 0.13351190090179443, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▊ | 452/681 [32:24<10:44, 2.81s/it] 67%|███████████████████████████████████████████████████▉ | 453/681 [32:26<10:19, 2.72s/it] {'loss': 0.4664, 'grad_norm': 51.65986633300781, 'learning_rate': 1.5373663637339584e-07, 'margin_dpo/margin_mean': 29.085243225097656, 'margin_dpo/margin_std': 25.423097610473633, 'logps/chosen': -76.96781921386719, 'logps/rejected': -130.54562377929688, 'logps/ref_chosen': -57.71485137939453, 'logps/ref_rejected': -82.20741271972656, 'logits/chosen': -0.6441305875778198, 'logits/rejected': -0.5928350687026978, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.085243225097656, 'margin_dpo/beta_margin_mean': 2.90852427482605, 'margin_dpo/beta_margin_std': 2.558598279953003, 'margin_dpo/beta_margin_grad_mean': -0.16998730599880219, 'margin_dpo/beta_margin_grad_std': 0.2013465166091919, 'epoch': 0.67} 67%|███████████████████████████████████████████████████▉ | 453/681 [32:26<10:19, 2.72s/it] 67%|████████████████████████████████████████████████████ | 454/681 [32:29<10:13, 2.70s/it] {'loss': 0.4047, 'grad_norm': 59.35947036743164, 'learning_rate': 1.5255353167683017e-07, 'margin_dpo/margin_mean': 32.31932830810547, 'margin_dpo/margin_std': 25.902687072753906, 'logps/chosen': -81.52304077148438, 'logps/rejected': -137.84750366210938, 'logps/ref_chosen': -60.945648193359375, 'logps/ref_rejected': -84.9507827758789, 'logits/chosen': -0.6171283721923828, 'logits/rejected': -0.5738873481750488, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.31932830810547, 'margin_dpo/beta_margin_mean': 3.2319328784942627, 'margin_dpo/beta_margin_std': 2.6018524169921875, 'margin_dpo/beta_margin_grad_mean': -0.14290541410446167, 'margin_dpo/beta_margin_grad_std': 0.2013457864522934, 'epoch': 0.67} 67%|████████████████████████████████████████████████████ | 454/681 [32:29<10:13, 2.70s/it] 67%|████████████████████████████████████████████████████ | 455/681 [32:31<10:01, 2.66s/it] {'loss': 0.3629, 'grad_norm': 93.24932861328125, 'learning_rate': 1.5137299478533064e-07, 'margin_dpo/margin_mean': 37.203800201416016, 'margin_dpo/margin_std': 26.29052734375, 'logps/chosen': -64.90336608886719, 'logps/rejected': -172.52194213867188, 'logps/ref_chosen': -44.88671112060547, 'logps/ref_rejected': -115.30147552490234, 'logits/chosen': -0.6162554621696472, 'logits/rejected': -0.5891969203948975, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.203800201416016, 'margin_dpo/beta_margin_mean': 3.7203803062438965, 'margin_dpo/beta_margin_std': 2.6521599292755127, 'margin_dpo/beta_margin_grad_mean': -0.1229911670088768, 'margin_dpo/beta_margin_grad_std': 0.19356586039066315, 'epoch': 0.67} 67%|████████████████████████████████████████████████████ | 455/681 [32:32<10:01, 2.66s/it] 67%|████████████████████████████████████████████████████▏ | 456/681 [32:34<09:57, 2.66s/it] {'loss': 0.354, 'grad_norm': 49.41230010986328, 'learning_rate': 1.5019505680714232e-07, 'margin_dpo/margin_mean': 37.51462936401367, 'margin_dpo/margin_std': 28.42435073852539, 'logps/chosen': -74.30551147460938, 'logps/rejected': -160.00119018554688, 'logps/ref_chosen': -57.036781311035156, 'logps/ref_rejected': -105.21783447265625, 'logits/chosen': -0.6331825256347656, 'logits/rejected': -0.6310149431228638, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.51462936401367, 'margin_dpo/beta_margin_mean': 3.751462936401367, 'margin_dpo/beta_margin_std': 2.8913846015930176, 'margin_dpo/beta_margin_grad_mean': -0.13223397731781006, 'margin_dpo/beta_margin_grad_std': 0.1870342195034027, 'epoch': 0.67} 67%|████████████████████████████████████████████████████▏ | 456/681 [32:34<09:57, 2.66s/it] 67%|████████████████████████████████████████████████████▎ | 457/681 [32:37<09:52, 2.65s/it] {'loss': 0.386, 'grad_norm': 59.397212982177734, 'learning_rate': 1.4901974878202627e-07, 'margin_dpo/margin_mean': 33.051368713378906, 'margin_dpo/margin_std': 24.472869873046875, 'logps/chosen': -72.51710510253906, 'logps/rejected': -136.43548583984375, 'logps/ref_chosen': -54.24253845214844, 'logps/ref_rejected': -85.10956573486328, 'logits/chosen': -0.6320329308509827, 'logits/rejected': -0.6049121618270874, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.051368713378906, 'margin_dpo/beta_margin_mean': 3.3051366806030273, 'margin_dpo/beta_margin_std': 2.4597647190093994, 'margin_dpo/beta_margin_grad_mean': -0.13050609827041626, 'margin_dpo/beta_margin_grad_std': 0.19707661867141724, 'epoch': 0.67} 67%|████████████████████████████████████████████████████▎ | 457/681 [32:37<09:52, 2.65s/it] 67%|████████████████████████████████████████████████████▍ | 458/681 [32:39<09:37, 2.59s/it] {'loss': 0.4411, 'grad_norm': 56.77046585083008, 'learning_rate': 1.4784710168044212e-07, 'margin_dpo/margin_mean': 38.03219223022461, 'margin_dpo/margin_std': 32.41196060180664, 'logps/chosen': -74.71857452392578, 'logps/rejected': -155.025146484375, 'logps/ref_chosen': -55.40888214111328, 'logps/ref_rejected': -97.68325805664062, 'logits/chosen': -0.6297258138656616, 'logits/rejected': -0.5929204225540161, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 38.03219223022461, 'margin_dpo/beta_margin_mean': 3.8032190799713135, 'margin_dpo/beta_margin_std': 3.263706922531128, 'margin_dpo/beta_margin_grad_mean': -0.14024901390075684, 'margin_dpo/beta_margin_grad_std': 0.23043015599250793, 'epoch': 0.67} 67%|████████████████████████████████████████████████████▍ | 458/681 [32:39<09:37, 2.59s/it] 67%|████████████████████████████████████████████████████▌ | 459/681 [32:42<09:37, 2.60s/it] {'loss': 0.4592, 'grad_norm': 47.203277587890625, 'learning_rate': 1.466771464027316e-07, 'margin_dpo/margin_mean': 28.914405822753906, 'margin_dpo/margin_std': 23.49092674255371, 'logps/chosen': -67.03455352783203, 'logps/rejected': -135.55999755859375, 'logps/ref_chosen': -46.55748748779297, 'logps/ref_rejected': -86.16854095458984, 'logits/chosen': -0.592144250869751, 'logits/rejected': -0.5651764869689941, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.914405822753906, 'margin_dpo/beta_margin_mean': 2.8914406299591064, 'margin_dpo/beta_margin_std': 2.370115280151367, 'margin_dpo/beta_margin_grad_mean': -0.16323688626289368, 'margin_dpo/beta_margin_grad_std': 0.19608436524868011, 'epoch': 0.67} 67%|████████████████████████████████████████████████████▌ | 459/681 [32:42<09:37, 2.60s/it] 68%|████████████████████████████████████████████████████▋ | 460/681 [32:45<09:49, 2.67s/it] {'loss': 0.4209, 'grad_norm': 59.67298126220703, 'learning_rate': 1.4550991377830423e-07, 'margin_dpo/margin_mean': 32.565120697021484, 'margin_dpo/margin_std': 25.7642879486084, 'logps/chosen': -70.59028625488281, 'logps/rejected': -155.63986206054688, 'logps/ref_chosen': -51.63489532470703, 'logps/ref_rejected': -104.11935424804688, 'logits/chosen': -0.5806307792663574, 'logits/rejected': -0.5847660303115845, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.565120697021484, 'margin_dpo/beta_margin_mean': 3.25651216506958, 'margin_dpo/beta_margin_std': 2.5778074264526367, 'margin_dpo/beta_margin_grad_mean': -0.15195363759994507, 'margin_dpo/beta_margin_grad_std': 0.20993934571743011, 'epoch': 0.68} 68%|████████████████████████████████████████████████████▋ | 460/681 [32:45<09:49, 2.67s/it] 68%|████████████████████████████████████████████████████▊ | 461/681 [32:47<09:50, 2.68s/it] {'loss': 0.5473, 'grad_norm': 59.93415069580078, 'learning_rate': 1.4434543456482518e-07, 'margin_dpo/margin_mean': 27.815326690673828, 'margin_dpo/margin_std': 27.13003921508789, 'logps/chosen': -79.71414184570312, 'logps/rejected': -138.8244171142578, 'logps/ref_chosen': -55.18195343017578, 'logps/ref_rejected': -86.47689819335938, 'logits/chosen': -0.5920594930648804, 'logits/rejected': -0.5768572688102722, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.815326690673828, 'margin_dpo/beta_margin_mean': 2.7815327644348145, 'margin_dpo/beta_margin_std': 2.8518621921539307, 'margin_dpo/beta_margin_grad_mean': -0.18966011703014374, 'margin_dpo/beta_margin_grad_std': 0.22385801374912262, 'epoch': 0.68} 68%|████████████████████████████████████████████████████▊ | 461/681 [32:47<09:50, 2.68s/it] 68%|████████████████████████████████████████████████████▉ | 462/681 [32:50<09:37, 2.64s/it] {'loss': 0.554, 'grad_norm': 64.90670776367188, 'learning_rate': 1.4318373944740484e-07, 'margin_dpo/margin_mean': 26.862995147705078, 'margin_dpo/margin_std': 25.538467407226562, 'logps/chosen': -93.2876968383789, 'logps/rejected': -129.06378173828125, 'logps/ref_chosen': -69.92803955078125, 'logps/ref_rejected': -78.84111785888672, 'logits/chosen': -0.6181149482727051, 'logits/rejected': -0.5787901878356934, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.86299705505371, 'margin_dpo/beta_margin_mean': 2.6862998008728027, 'margin_dpo/beta_margin_std': 2.5792369842529297, 'margin_dpo/beta_margin_grad_mean': -0.1927950084209442, 'margin_dpo/beta_margin_grad_std': 0.22063319385051727, 'epoch': 0.68} 68%|████████████████████████████████████████████████████▉ | 462/681 [32:50<09:37, 2.64s/it] 68%|█████████████████████████████████████████████████████ | 463/681 [32:52<09:21, 2.57s/it] {'loss': 0.3546, 'grad_norm': 50.19252014160156, 'learning_rate': 1.4202485903778976e-07, 'margin_dpo/margin_mean': 33.929237365722656, 'margin_dpo/margin_std': 23.769535064697266, 'logps/chosen': -75.74092864990234, 'logps/rejected': -143.4207763671875, 'logps/ref_chosen': -55.27437210083008, 'logps/ref_rejected': -89.02497863769531, 'logits/chosen': -0.6169182062149048, 'logits/rejected': -0.5887913703918457, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.929237365722656, 'margin_dpo/beta_margin_mean': 3.392923593521118, 'margin_dpo/beta_margin_std': 2.4047834873199463, 'margin_dpo/beta_margin_grad_mean': -0.12193028628826141, 'margin_dpo/beta_margin_grad_std': 0.19042545557022095, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████ | 463/681 [32:52<09:21, 2.57s/it] 68%|█████████████████████████████████████████████████████▏ | 464/681 [32:55<09:02, 2.50s/it] {'loss': 0.4531, 'grad_norm': 54.16157531738281, 'learning_rate': 1.4086882387355658e-07, 'margin_dpo/margin_mean': 34.593727111816406, 'margin_dpo/margin_std': 29.88116455078125, 'logps/chosen': -73.30712890625, 'logps/rejected': -159.47793579101562, 'logps/ref_chosen': -50.91230010986328, 'logps/ref_rejected': -102.4893798828125, 'logits/chosen': -0.6251201629638672, 'logits/rejected': -0.6308864951133728, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.593727111816406, 'margin_dpo/beta_margin_mean': 3.4593725204467773, 'margin_dpo/beta_margin_std': 2.9905498027801514, 'margin_dpo/beta_margin_grad_mean': -0.14893580973148346, 'margin_dpo/beta_margin_grad_std': 0.20893022418022156, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████▏ | 464/681 [32:55<09:02, 2.50s/it] 68%|█████████████████████████████████████████████████████▎ | 465/681 [32:57<09:15, 2.57s/it] {'loss': 0.2808, 'grad_norm': 50.176815032958984, 'learning_rate': 1.3971566441730714e-07, 'margin_dpo/margin_mean': 37.622222900390625, 'margin_dpo/margin_std': 25.363601684570312, 'logps/chosen': -81.1992416381836, 'logps/rejected': -172.650634765625, 'logps/ref_chosen': -60.116851806640625, 'logps/ref_rejected': -113.94602966308594, 'logits/chosen': -0.6043756008148193, 'logits/rejected': -0.5841087102890015, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.622222900390625, 'margin_dpo/beta_margin_mean': 3.7622225284576416, 'margin_dpo/beta_margin_std': 2.543063163757324, 'margin_dpo/beta_margin_grad_mean': -0.10672765225172043, 'margin_dpo/beta_margin_grad_std': 0.16949497163295746, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████▎ | 465/681 [32:57<09:15, 2.57s/it] 68%|█████████████████████████████████████████████████████▎ | 466/681 [33:00<09:42, 2.71s/it] {'loss': 0.3827, 'grad_norm': 57.16488265991211, 'learning_rate': 1.3856541105586545e-07, 'margin_dpo/margin_mean': 34.00209045410156, 'margin_dpo/margin_std': 23.383773803710938, 'logps/chosen': -75.47810363769531, 'logps/rejected': -146.87469482421875, 'logps/ref_chosen': -52.920921325683594, 'logps/ref_rejected': -90.3154296875, 'logits/chosen': -0.6066223382949829, 'logits/rejected': -0.5759164094924927, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.00209045410156, 'margin_dpo/beta_margin_mean': 3.4002089500427246, 'margin_dpo/beta_margin_std': 2.3887243270874023, 'margin_dpo/beta_margin_grad_mean': -0.12489843368530273, 'margin_dpo/beta_margin_grad_std': 0.20490986108779907, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████▎ | 466/681 [33:00<09:42, 2.71s/it] 69%|█████████████████████████████████████████████████████▍ | 467/681 [33:03<09:36, 2.69s/it] {'loss': 0.3729, 'grad_norm': 46.38023376464844, 'learning_rate': 1.3741809409947729e-07, 'margin_dpo/margin_mean': 34.46977996826172, 'margin_dpo/margin_std': 27.862186431884766, 'logps/chosen': -101.92547607421875, 'logps/rejected': -160.5396270751953, 'logps/ref_chosen': -78.7158203125, 'logps/ref_rejected': -102.86019897460938, 'logits/chosen': -0.6104651689529419, 'logits/rejected': -0.5786043405532837, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.46977615356445, 'margin_dpo/beta_margin_mean': 3.4469778537750244, 'margin_dpo/beta_margin_std': 2.8059871196746826, 'margin_dpo/beta_margin_grad_mean': -0.1380797028541565, 'margin_dpo/beta_margin_grad_std': 0.1899155229330063, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████▍ | 467/681 [33:03<09:36, 2.69s/it] 69%|█████████████████████████████████████████████████████▌ | 468/681 [33:06<09:41, 2.73s/it] {'loss': 0.3886, 'grad_norm': 55.537410736083984, 'learning_rate': 1.362737437810114e-07, 'margin_dpo/margin_mean': 32.051353454589844, 'margin_dpo/margin_std': 26.76758575439453, 'logps/chosen': -89.64823913574219, 'logps/rejected': -152.7930450439453, 'logps/ref_chosen': -69.93536376953125, 'logps/ref_rejected': -101.02881622314453, 'logits/chosen': -0.6192047595977783, 'logits/rejected': -0.5922250747680664, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.051353454589844, 'margin_dpo/beta_margin_mean': 3.2051353454589844, 'margin_dpo/beta_margin_std': 2.6790554523468018, 'margin_dpo/beta_margin_grad_mean': -0.1464032083749771, 'margin_dpo/beta_margin_grad_std': 0.18942488729953766, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████▌ | 468/681 [33:06<09:41, 2.73s/it] 69%|█████████████████████████████████████████████████████▋ | 469/681 [33:09<09:41, 2.74s/it] {'loss': 0.4299, 'grad_norm': 57.15333938598633, 'learning_rate': 1.351323902551631e-07, 'margin_dpo/margin_mean': 33.17765426635742, 'margin_dpo/margin_std': 27.252918243408203, 'logps/chosen': -91.19867706298828, 'logps/rejected': -161.0380401611328, 'logps/ref_chosen': -68.12469482421875, 'logps/ref_rejected': -104.78640747070312, 'logits/chosen': -0.6003662347793579, 'logits/rejected': -0.5658551454544067, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.17765426635742, 'margin_dpo/beta_margin_mean': 3.317765474319458, 'margin_dpo/beta_margin_std': 2.7324230670928955, 'margin_dpo/beta_margin_grad_mean': -0.14955471456050873, 'margin_dpo/beta_margin_grad_std': 0.21639080345630646, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████▋ | 469/681 [33:09<09:41, 2.74s/it] 69%|█████████████████████████████████████████████████████▊ | 470/681 [33:11<09:32, 2.71s/it] {'loss': 0.2368, 'grad_norm': 41.5388298034668, 'learning_rate': 1.339940635976592e-07, 'margin_dpo/margin_mean': 38.34840393066406, 'margin_dpo/margin_std': 23.939483642578125, 'logps/chosen': -64.00105285644531, 'logps/rejected': -141.2603759765625, 'logps/ref_chosen': -43.79193115234375, 'logps/ref_rejected': -82.70285034179688, 'logits/chosen': -0.5871062278747559, 'logits/rejected': -0.5616201162338257, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 38.3484001159668, 'margin_dpo/beta_margin_mean': 3.8348402976989746, 'margin_dpo/beta_margin_std': 2.4011597633361816, 'margin_dpo/beta_margin_grad_mean': -0.09441064298152924, 'margin_dpo/beta_margin_grad_std': 0.15031108260154724, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████▊ | 470/681 [33:11<09:32, 2.71s/it] 69%|█████████████████████████████████████████████████████▉ | 471/681 [33:14<09:07, 2.61s/it] {'loss': 0.4208, 'grad_norm': 54.3143310546875, 'learning_rate': 1.3285879380446563e-07, 'margin_dpo/margin_mean': 31.3731689453125, 'margin_dpo/margin_std': 24.42245101928711, 'logps/chosen': -87.58413696289062, 'logps/rejected': -139.228271484375, 'logps/ref_chosen': -63.33952331542969, 'logps/ref_rejected': -83.61048126220703, 'logits/chosen': -0.5919795036315918, 'logits/rejected': -0.5648236870765686, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.373167037963867, 'margin_dpo/beta_margin_mean': 3.1373167037963867, 'margin_dpo/beta_margin_std': 2.457742929458618, 'margin_dpo/beta_margin_grad_mean': -0.1538044661283493, 'margin_dpo/beta_margin_grad_std': 0.2021295428276062, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████▉ | 471/681 [33:14<09:07, 2.61s/it] 69%|██████████████████████████████████████████████████████ | 472/681 [33:17<09:21, 2.69s/it] {'loss': 0.3109, 'grad_norm': 50.913185119628906, 'learning_rate': 1.317266107909975e-07, 'margin_dpo/margin_mean': 40.24461364746094, 'margin_dpo/margin_std': 33.13086700439453, 'logps/chosen': -104.90176391601562, 'logps/rejected': -178.68946838378906, 'logps/ref_chosen': -83.66609954833984, 'logps/ref_rejected': -117.20919799804688, 'logits/chosen': -0.6416307687759399, 'logits/rejected': -0.5852631330490112, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 40.2446174621582, 'margin_dpo/beta_margin_mean': 4.02446174621582, 'margin_dpo/beta_margin_std': 3.3216898441314697, 'margin_dpo/beta_margin_grad_mean': -0.11783776432275772, 'margin_dpo/beta_margin_grad_std': 0.1787194162607193, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████ | 472/681 [33:17<09:21, 2.69s/it] 69%|██████████████████████████████████████████████████████▏ | 473/681 [33:19<09:21, 2.70s/it] {'loss': 0.4899, 'grad_norm': 78.06228637695312, 'learning_rate': 1.3059754439133002e-07, 'margin_dpo/margin_mean': 28.15515899658203, 'margin_dpo/margin_std': 22.536598205566406, 'logps/chosen': -87.47222900390625, 'logps/rejected': -133.27700805664062, 'logps/ref_chosen': -63.49696731567383, 'logps/ref_rejected': -81.14657592773438, 'logits/chosen': -0.5605521202087402, 'logits/rejected': -0.5147773623466492, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.15515899658203, 'margin_dpo/beta_margin_mean': 2.8155159950256348, 'margin_dpo/beta_margin_std': 2.263782501220703, 'margin_dpo/beta_margin_grad_mean': -0.17109636962413788, 'margin_dpo/beta_margin_grad_std': 0.22186963260173798, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████▏ | 473/681 [33:19<09:21, 2.70s/it] 70%|██████████████████████████████████████████████████████▎ | 474/681 [33:22<09:20, 2.71s/it] {'loss': 0.4737, 'grad_norm': 60.656185150146484, 'learning_rate': 1.2947162435741277e-07, 'margin_dpo/margin_mean': 30.685691833496094, 'margin_dpo/margin_std': 25.531770706176758, 'logps/chosen': -76.55195617675781, 'logps/rejected': -144.70611572265625, 'logps/ref_chosen': -52.6119384765625, 'logps/ref_rejected': -90.08041381835938, 'logits/chosen': -0.5783928632736206, 'logits/rejected': -0.5659887790679932, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.68568992614746, 'margin_dpo/beta_margin_mean': 3.0685691833496094, 'margin_dpo/beta_margin_std': 2.5612540245056152, 'margin_dpo/beta_margin_grad_mean': -0.16651608049869537, 'margin_dpo/beta_margin_grad_std': 0.22569791972637177, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▎ | 474/681 [33:22<09:20, 2.71s/it] 70%|██████████████████████████████████████████████████████▍ | 475/681 [33:25<09:04, 2.64s/it] {'loss': 0.3844, 'grad_norm': 43.59295654296875, 'learning_rate': 1.2834888035828596e-07, 'margin_dpo/margin_mean': 34.429412841796875, 'margin_dpo/margin_std': 30.290939331054688, 'logps/chosen': -63.40654754638672, 'logps/rejected': -145.40371704101562, 'logps/ref_chosen': -42.49519348144531, 'logps/ref_rejected': -90.06295013427734, 'logits/chosen': -0.62577223777771, 'logits/rejected': -0.6225380897521973, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.429412841796875, 'margin_dpo/beta_margin_mean': 3.442941188812256, 'margin_dpo/beta_margin_std': 3.0452959537506104, 'margin_dpo/beta_margin_grad_mean': -0.1439659297466278, 'margin_dpo/beta_margin_grad_std': 0.19135436415672302, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▍ | 475/681 [33:25<09:04, 2.64s/it] 70%|██████████████████████████████████████████████████████▌ | 476/681 [33:27<09:09, 2.68s/it] {'loss': 0.5114, 'grad_norm': 69.43315124511719, 'learning_rate': 1.2722934197929802e-07, 'margin_dpo/margin_mean': 30.353801727294922, 'margin_dpo/margin_std': 26.741519927978516, 'logps/chosen': -64.73588562011719, 'logps/rejected': -125.85054016113281, 'logps/ref_chosen': -42.949378967285156, 'logps/ref_rejected': -73.71023559570312, 'logits/chosen': -0.5941322445869446, 'logits/rejected': -0.5612877607345581, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.353801727294922, 'margin_dpo/beta_margin_mean': 3.0353803634643555, 'margin_dpo/beta_margin_std': 2.6748228073120117, 'margin_dpo/beta_margin_grad_mean': -0.17846056818962097, 'margin_dpo/beta_margin_grad_std': 0.22891968488693237, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▌ | 476/681 [33:27<09:09, 2.68s/it] 70%|██████████████████████████████████████████████████████▋ | 477/681 [33:30<09:01, 2.65s/it] {'loss': 0.6021, 'grad_norm': 81.28004455566406, 'learning_rate': 1.2611303872132631e-07, 'margin_dpo/margin_mean': 31.857627868652344, 'margin_dpo/margin_std': 27.68490982055664, 'logps/chosen': -95.98014831542969, 'logps/rejected': -133.20254516601562, 'logps/ref_chosen': -70.77261352539062, 'logps/ref_rejected': -76.13737487792969, 'logits/chosen': -0.6341814994812012, 'logits/rejected': -0.5662086009979248, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.857629776000977, 'margin_dpo/beta_margin_mean': 3.185762882232666, 'margin_dpo/beta_margin_std': 2.7865076065063477, 'margin_dpo/beta_margin_grad_mean': -0.15644104778766632, 'margin_dpo/beta_margin_grad_std': 0.24381397664546967, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▋ | 477/681 [33:30<09:01, 2.65s/it] 70%|██████████████████████████████████████████████████████▋ | 478/681 [33:33<09:11, 2.72s/it] {'loss': 0.4001, 'grad_norm': 48.535404205322266, 'learning_rate': 1.2500000000000005e-07, 'margin_dpo/margin_mean': 34.497291564941406, 'margin_dpo/margin_std': 29.08106231689453, 'logps/chosen': -61.48149871826172, 'logps/rejected': -139.90025329589844, 'logps/ref_chosen': -41.440513610839844, 'logps/ref_rejected': -85.36196899414062, 'logits/chosen': -0.5645046234130859, 'logits/rejected': -0.5495598316192627, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.497291564941406, 'margin_dpo/beta_margin_mean': 3.4497292041778564, 'margin_dpo/beta_margin_std': 2.9467873573303223, 'margin_dpo/beta_margin_grad_mean': -0.14866864681243896, 'margin_dpo/beta_margin_grad_std': 0.18979746103286743, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▋ | 478/681 [33:33<09:11, 2.72s/it] 70%|██████████████████████████████████████████████████████▊ | 479/681 [33:35<09:01, 2.68s/it] {'loss': 0.4427, 'grad_norm': 56.268714904785156, 'learning_rate': 1.2389025514492456e-07, 'margin_dpo/margin_mean': 30.368532180786133, 'margin_dpo/margin_std': 22.058135986328125, 'logps/chosen': -79.05259704589844, 'logps/rejected': -150.62954711914062, 'logps/ref_chosen': -53.907920837402344, 'logps/ref_rejected': -95.1163330078125, 'logits/chosen': -0.558883786201477, 'logits/rejected': -0.5508110523223877, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.368532180786133, 'margin_dpo/beta_margin_mean': 3.036853313446045, 'margin_dpo/beta_margin_std': 2.2986533641815186, 'margin_dpo/beta_margin_grad_mean': -0.15836496651172638, 'margin_dpo/beta_margin_grad_std': 0.21574333310127258, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▊ | 479/681 [33:35<09:01, 2.68s/it] 70%|██████████████████████████████████████████████████████▉ | 480/681 [33:38<08:52, 2.65s/it] {'loss': 0.5051, 'grad_norm': 73.03453826904297, 'learning_rate': 1.227838333989088e-07, 'margin_dpo/margin_mean': 36.273189544677734, 'margin_dpo/margin_std': 31.943330764770508, 'logps/chosen': -84.984619140625, 'logps/rejected': -145.50759887695312, 'logps/ref_chosen': -58.682701110839844, 'logps/ref_rejected': -82.93248748779297, 'logits/chosen': -0.5816048979759216, 'logits/rejected': -0.523268461227417, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 36.273189544677734, 'margin_dpo/beta_margin_mean': 3.627319097518921, 'margin_dpo/beta_margin_std': 3.305205821990967, 'margin_dpo/beta_margin_grad_mean': -0.15481433272361755, 'margin_dpo/beta_margin_grad_std': 0.23069554567337036, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▉ | 480/681 [33:38<08:52, 2.65s/it] 71%|███████████████████████████████████████████████████████ | 481/681 [33:40<08:44, 2.62s/it] {'loss': 0.4399, 'grad_norm': 53.513370513916016, 'learning_rate': 1.2168076391719489e-07, 'margin_dpo/margin_mean': 34.74099349975586, 'margin_dpo/margin_std': 26.750259399414062, 'logps/chosen': -79.90770721435547, 'logps/rejected': -152.1048583984375, 'logps/ref_chosen': -54.964271545410156, 'logps/ref_rejected': -92.42044067382812, 'logits/chosen': -0.6167398691177368, 'logits/rejected': -0.5804057121276855, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.74099349975586, 'margin_dpo/beta_margin_mean': 3.4740993976593018, 'margin_dpo/beta_margin_std': 2.713843822479248, 'margin_dpo/beta_margin_grad_mean': -0.14174267649650574, 'margin_dpo/beta_margin_grad_std': 0.2210419625043869, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████ | 481/681 [33:40<08:44, 2.62s/it] 71%|███████████████████████████████████████████████████████▏ | 482/681 [33:43<08:49, 2.66s/it] {'loss': 0.4309, 'grad_norm': 54.49075698852539, 'learning_rate': 1.2058107576668938e-07, 'margin_dpo/margin_mean': 30.073631286621094, 'margin_dpo/margin_std': 25.875329971313477, 'logps/chosen': -89.89315795898438, 'logps/rejected': -140.00283813476562, 'logps/ref_chosen': -67.55347442626953, 'logps/ref_rejected': -87.58953857421875, 'logits/chosen': -0.5958288908004761, 'logits/rejected': -0.5650321841239929, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.073631286621094, 'margin_dpo/beta_margin_mean': 3.0073630809783936, 'margin_dpo/beta_margin_std': 2.6090502738952637, 'margin_dpo/beta_margin_grad_mean': -0.1658599078655243, 'margin_dpo/beta_margin_grad_std': 0.1872914731502533, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████▏ | 482/681 [33:43<08:49, 2.66s/it] 71%|███████████████████████████████████████████████████████▎ | 483/681 [33:46<08:46, 2.66s/it] {'loss': 0.3968, 'grad_norm': 65.8294677734375, 'learning_rate': 1.194847979251979e-07, 'margin_dpo/margin_mean': 35.49970245361328, 'margin_dpo/margin_std': 27.264251708984375, 'logps/chosen': -88.70866394042969, 'logps/rejected': -156.66552734375, 'logps/ref_chosen': -63.32981872558594, 'logps/ref_rejected': -95.78697204589844, 'logits/chosen': -0.6282751560211182, 'logits/rejected': -0.5697331428527832, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.49970245361328, 'margin_dpo/beta_margin_mean': 3.5499703884124756, 'margin_dpo/beta_margin_std': 2.7894442081451416, 'margin_dpo/beta_margin_grad_mean': -0.13324548304080963, 'margin_dpo/beta_margin_grad_std': 0.21172069013118744, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████▎ | 483/681 [33:46<08:46, 2.66s/it] 71%|███████████████████████████████████████████████████████▍ | 484/681 [33:48<08:24, 2.56s/it] {'loss': 0.3678, 'grad_norm': 55.688453674316406, 'learning_rate': 1.1839195928066101e-07, 'margin_dpo/margin_mean': 35.594505310058594, 'margin_dpo/margin_std': 29.648942947387695, 'logps/chosen': -80.87345886230469, 'logps/rejected': -141.7012939453125, 'logps/ref_chosen': -59.13812255859375, 'logps/ref_rejected': -84.37144470214844, 'logits/chosen': -0.6670191287994385, 'logits/rejected': -0.6306544542312622, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.594505310058594, 'margin_dpo/beta_margin_mean': 3.559450626373291, 'margin_dpo/beta_margin_std': 3.0208792686462402, 'margin_dpo/beta_margin_grad_mean': -0.1434181034564972, 'margin_dpo/beta_margin_grad_std': 0.1813618689775467, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████▍ | 484/681 [33:48<08:24, 2.56s/it] 71%|███████████████████████████████████████████████████████▌ | 485/681 [33:51<08:20, 2.55s/it] {'loss': 0.4199, 'grad_norm': 52.00883483886719, 'learning_rate': 1.1730258863039347e-07, 'margin_dpo/margin_mean': 40.278114318847656, 'margin_dpo/margin_std': 32.148040771484375, 'logps/chosen': -77.70271301269531, 'logps/rejected': -162.49534606933594, 'logps/ref_chosen': -58.849571228027344, 'logps/ref_rejected': -103.36408996582031, 'logits/chosen': -0.5754466652870178, 'logits/rejected': -0.5443192720413208, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 40.27811813354492, 'margin_dpo/beta_margin_mean': 4.0278120040893555, 'margin_dpo/beta_margin_std': 3.2175581455230713, 'margin_dpo/beta_margin_grad_mean': -0.14156897366046906, 'margin_dpo/beta_margin_grad_std': 0.2183229923248291, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████▌ | 485/681 [33:51<08:20, 2.55s/it] 71%|███████████████████████████████████████████████████████▋ | 486/681 [33:53<07:54, 2.43s/it] {'loss': 0.424, 'grad_norm': 66.21233367919922, 'learning_rate': 1.1621671468032493e-07, 'margin_dpo/margin_mean': 38.945411682128906, 'margin_dpo/margin_std': 30.78309440612793, 'logps/chosen': -77.98770904541016, 'logps/rejected': -153.8128204345703, 'logps/ref_chosen': -55.25966262817383, 'logps/ref_rejected': -92.13936614990234, 'logits/chosen': -0.6356394290924072, 'logits/rejected': -0.5828511714935303, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 38.945411682128906, 'margin_dpo/beta_margin_mean': 3.8945412635803223, 'margin_dpo/beta_margin_std': 3.0784053802490234, 'margin_dpo/beta_margin_grad_mean': -0.14394216239452362, 'margin_dpo/beta_margin_grad_std': 0.21945635974407196, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████▋ | 486/681 [33:53<07:54, 2.43s/it] 72%|███████████████████████████████████████████████████████▊ | 487/681 [33:56<08:12, 2.54s/it] {'loss': 0.3256, 'grad_norm': 49.65977096557617, 'learning_rate': 1.1513436604424378e-07, 'margin_dpo/margin_mean': 37.19938278198242, 'margin_dpo/margin_std': 26.274166107177734, 'logps/chosen': -75.19181060791016, 'logps/rejected': -151.7467041015625, 'logps/ref_chosen': -53.06330871582031, 'logps/ref_rejected': -92.4188232421875, 'logits/chosen': -0.6391937732696533, 'logits/rejected': -0.604444682598114, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.19938278198242, 'margin_dpo/beta_margin_mean': 3.7199385166168213, 'margin_dpo/beta_margin_std': 2.6885433197021484, 'margin_dpo/beta_margin_grad_mean': -0.12563364207744598, 'margin_dpo/beta_margin_grad_std': 0.17673608660697937, 'epoch': 0.72} 72%|███████████████████████████████████████████████████████▊ | 487/681 [33:56<08:12, 2.54s/it] 72%|███████████████████████████████████████████████████████▉ | 488/681 [33:58<08:17, 2.58s/it] {'loss': 0.2845, 'grad_norm': 32.75376510620117, 'learning_rate': 1.1405557124304335e-07, 'margin_dpo/margin_mean': 32.08653259277344, 'margin_dpo/margin_std': 21.324390411376953, 'logps/chosen': -72.79434204101562, 'logps/rejected': -136.65927124023438, 'logps/ref_chosen': -52.228153228759766, 'logps/ref_rejected': -84.00656127929688, 'logits/chosen': -0.5953603386878967, 'logits/rejected': -0.563835859298706, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.08653259277344, 'margin_dpo/beta_margin_mean': 3.208653450012207, 'margin_dpo/beta_margin_std': 2.13510799407959, 'margin_dpo/beta_margin_grad_mean': -0.11705981194972992, 'margin_dpo/beta_margin_grad_std': 0.1456899493932724, 'epoch': 0.72} 72%|███████████████████████████████████████████████████████▉ | 488/681 [33:58<08:17, 2.58s/it] 72%|████████████████████████████████████████████████████████ | 489/681 [34:01<08:14, 2.58s/it] {'loss': 0.4441, 'grad_norm': 55.29383850097656, 'learning_rate': 1.1298035870396985e-07, 'margin_dpo/margin_mean': 31.778709411621094, 'margin_dpo/margin_std': 27.465885162353516, 'logps/chosen': -77.7701416015625, 'logps/rejected': -132.9573516845703, 'logps/ref_chosen': -55.989627838134766, 'logps/ref_rejected': -79.39813232421875, 'logits/chosen': -0.5945910215377808, 'logits/rejected': -0.5459895730018616, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.778711318969727, 'margin_dpo/beta_margin_mean': 3.1778712272644043, 'margin_dpo/beta_margin_std': 2.766324520111084, 'margin_dpo/beta_margin_grad_mean': -0.1625826209783554, 'margin_dpo/beta_margin_grad_std': 0.21051008999347687, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████ | 489/681 [34:01<08:14, 2.58s/it] 72%|████████████████████████████████████████████████████████ | 490/681 [34:04<08:33, 2.69s/it] {'loss': 0.573, 'grad_norm': 67.01080322265625, 'learning_rate': 1.1190875675987355e-07, 'margin_dpo/margin_mean': 31.28500747680664, 'margin_dpo/margin_std': 29.551124572753906, 'logps/chosen': -72.7847900390625, 'logps/rejected': -162.11245727539062, 'logps/ref_chosen': -52.36639404296875, 'logps/ref_rejected': -110.40904998779297, 'logits/chosen': -0.613182783126831, 'logits/rejected': -0.6027116775512695, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.28500747680664, 'margin_dpo/beta_margin_mean': 3.1285009384155273, 'margin_dpo/beta_margin_std': 2.9745311737060547, 'margin_dpo/beta_margin_grad_mean': -0.186043843626976, 'margin_dpo/beta_margin_grad_std': 0.23473787307739258, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████ | 490/681 [34:04<08:33, 2.69s/it] 72%|████████████████████████████████████████████████████████▏ | 491/681 [34:07<08:48, 2.78s/it] {'loss': 0.5801, 'grad_norm': 71.59069061279297, 'learning_rate': 1.1084079364846241e-07, 'margin_dpo/margin_mean': 28.079666137695312, 'margin_dpo/margin_std': 27.83734893798828, 'logps/chosen': -82.98500061035156, 'logps/rejected': -124.22119140625, 'logps/ref_chosen': -60.11626434326172, 'logps/ref_rejected': -73.27278900146484, 'logits/chosen': -0.5881800651550293, 'logits/rejected': -0.5435885190963745, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.079666137695312, 'margin_dpo/beta_margin_mean': 2.807966709136963, 'margin_dpo/beta_margin_std': 2.785522699356079, 'margin_dpo/beta_margin_grad_mean': -0.1894461065530777, 'margin_dpo/beta_margin_grad_std': 0.22930499911308289, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████▏ | 491/681 [34:07<08:48, 2.78s/it] 72%|████████████████████████████████████████████████████████▎ | 492/681 [34:10<08:58, 2.85s/it] {'loss': 0.9317, 'grad_norm': 109.9788589477539, 'learning_rate': 1.097764975115576e-07, 'margin_dpo/margin_mean': 26.099708557128906, 'margin_dpo/margin_std': 29.874317169189453, 'logps/chosen': -77.27084350585938, 'logps/rejected': -122.03599548339844, 'logps/ref_chosen': -53.99418258666992, 'logps/ref_rejected': -72.65962219238281, 'logits/chosen': -0.6198358535766602, 'logits/rejected': -0.5758175849914551, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 26.099708557128906, 'margin_dpo/beta_margin_mean': 2.609971046447754, 'margin_dpo/beta_margin_std': 3.011613368988037, 'margin_dpo/beta_margin_grad_mean': -0.2348533272743225, 'margin_dpo/beta_margin_grad_std': 0.31157541275024414, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████▎ | 492/681 [34:10<08:58, 2.85s/it] 72%|████████████████████████████████████████████████████████▍ | 493/681 [34:13<08:58, 2.86s/it] {'loss': 0.4661, 'grad_norm': 69.1717529296875, 'learning_rate': 1.0871589639435203e-07, 'margin_dpo/margin_mean': 32.91573715209961, 'margin_dpo/margin_std': 26.319190979003906, 'logps/chosen': -95.62208557128906, 'logps/rejected': -140.3636016845703, 'logps/ref_chosen': -75.49723815917969, 'logps/ref_rejected': -87.32301330566406, 'logits/chosen': -0.6741948127746582, 'logits/rejected': -0.6164962649345398, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.91573715209961, 'margin_dpo/beta_margin_mean': 3.2915735244750977, 'margin_dpo/beta_margin_std': 2.6520345211029053, 'margin_dpo/beta_margin_grad_mean': -0.15023410320281982, 'margin_dpo/beta_margin_grad_std': 0.22642172873020172, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████▍ | 493/681 [34:13<08:58, 2.86s/it] 73%|████████████████████████████████████████████████████████▌ | 494/681 [34:15<08:39, 2.78s/it] {'loss': 0.5108, 'grad_norm': 75.6004867553711, 'learning_rate': 1.0765901824467166e-07, 'margin_dpo/margin_mean': 35.46459197998047, 'margin_dpo/margin_std': 29.650789260864258, 'logps/chosen': -63.30023956298828, 'logps/rejected': -143.49691772460938, 'logps/ref_chosen': -41.35926818847656, 'logps/ref_rejected': -86.09136962890625, 'logits/chosen': -0.5462692379951477, 'logits/rejected': -0.5368998050689697, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.46459197998047, 'margin_dpo/beta_margin_mean': 3.546459197998047, 'margin_dpo/beta_margin_std': 3.0083138942718506, 'margin_dpo/beta_margin_grad_mean': -0.1606236845254898, 'margin_dpo/beta_margin_grad_std': 0.23982644081115723, 'epoch': 0.73} 73%|████████████████████████████████████████████████████████▌ | 494/681 [34:15<08:39, 2.78s/it] 73%|████████████████████████████████████████████████████████▋ | 495/681 [34:18<08:28, 2.73s/it] {'loss': 0.5177, 'grad_norm': 67.52748107910156, 'learning_rate': 1.0660589091223854e-07, 'margin_dpo/margin_mean': 32.37921142578125, 'margin_dpo/margin_std': 27.550233840942383, 'logps/chosen': -84.92739868164062, 'logps/rejected': -145.19595336914062, 'logps/ref_chosen': -63.53507995605469, 'logps/ref_rejected': -91.42443084716797, 'logits/chosen': -0.6319386959075928, 'logits/rejected': -0.5911184549331665, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.37921142578125, 'margin_dpo/beta_margin_mean': 3.2379212379455566, 'margin_dpo/beta_margin_std': 2.7765204906463623, 'margin_dpo/beta_margin_grad_mean': -0.15922006964683533, 'margin_dpo/beta_margin_grad_std': 0.22867868840694427, 'epoch': 0.73} 73%|████████████████████████████████████████████████████████▋ | 495/681 [34:18<08:28, 2.73s/it] 73%|████████████████████████████████████████████████████████▊ | 496/681 [34:21<08:22, 2.72s/it] {'loss': 0.5292, 'grad_norm': 64.98089599609375, 'learning_rate': 1.0555654214793722e-07, 'margin_dpo/margin_mean': 28.696502685546875, 'margin_dpo/margin_std': 25.838706970214844, 'logps/chosen': -96.5443115234375, 'logps/rejected': -136.9782257080078, 'logps/ref_chosen': -72.59192657470703, 'logps/ref_rejected': -84.32933807373047, 'logits/chosen': -0.6621850728988647, 'logits/rejected': -0.6073780655860901, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.696502685546875, 'margin_dpo/beta_margin_mean': 2.869650363922119, 'margin_dpo/beta_margin_std': 2.5869693756103516, 'margin_dpo/beta_margin_grad_mean': -0.17409659922122955, 'margin_dpo/beta_margin_grad_std': 0.22257588803768158, 'epoch': 0.73} 73%|████████████████████████████████████████████████████████▊ | 496/681 [34:21<08:22, 2.72s/it] 73%|████████████████████████████████████████████████████████▉ | 497/681 [34:23<08:16, 2.70s/it] {'loss': 0.613, 'grad_norm': 77.44481658935547, 'learning_rate': 1.0451099960308374e-07, 'margin_dpo/margin_mean': 28.408655166625977, 'margin_dpo/margin_std': 27.508472442626953, 'logps/chosen': -83.82826232910156, 'logps/rejected': -129.9313201904297, 'logps/ref_chosen': -58.593971252441406, 'logps/ref_rejected': -76.28836822509766, 'logits/chosen': -0.6251211166381836, 'logits/rejected': -0.5778101682662964, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.408655166625977, 'margin_dpo/beta_margin_mean': 2.8408656120300293, 'margin_dpo/beta_margin_std': 2.808185577392578, 'margin_dpo/beta_margin_grad_mean': -0.20568180084228516, 'margin_dpo/beta_margin_grad_std': 0.23945844173431396, 'epoch': 0.73} 73%|████████████████████████████████████████████████████████▉ | 497/681 [34:23<08:16, 2.70s/it] 73%|█████████████████████████████████████████████████████████ | 498/681 [34:26<08:18, 2.72s/it] {'loss': 0.5312, 'grad_norm': 67.77000427246094, 'learning_rate': 1.0346929082869641e-07, 'margin_dpo/margin_mean': 30.984760284423828, 'margin_dpo/margin_std': 27.886219024658203, 'logps/chosen': -95.3200912475586, 'logps/rejected': -139.05723571777344, 'logps/ref_chosen': -71.20565795898438, 'logps/ref_rejected': -83.95803833007812, 'logits/chosen': -0.6193152666091919, 'logits/rejected': -0.5879042148590088, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.984760284423828, 'margin_dpo/beta_margin_mean': 3.098475933074951, 'margin_dpo/beta_margin_std': 2.798676013946533, 'margin_dpo/beta_margin_grad_mean': -0.17250090837478638, 'margin_dpo/beta_margin_grad_std': 0.23728637397289276, 'epoch': 0.73} 73%|█████████████████████████████████████████████████████████ | 498/681 [34:26<08:18, 2.72s/it] 73%|█████████████████████████████████████████████████████████▏ | 499/681 [34:29<08:06, 2.67s/it] {'loss': 0.6873, 'grad_norm': 82.45706939697266, 'learning_rate': 1.0243144327477013e-07, 'margin_dpo/margin_mean': 31.553926467895508, 'margin_dpo/margin_std': 30.37271499633789, 'logps/chosen': -74.16600036621094, 'logps/rejected': -155.54342651367188, 'logps/ref_chosen': -51.25519561767578, 'logps/ref_rejected': -101.07870483398438, 'logits/chosen': -0.6282952427864075, 'logits/rejected': -0.6203751564025879, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.553926467895508, 'margin_dpo/beta_margin_mean': 3.155392646789551, 'margin_dpo/beta_margin_std': 3.0621047019958496, 'margin_dpo/beta_margin_grad_mean': -0.179952934384346, 'margin_dpo/beta_margin_grad_std': 0.2602365016937256, 'epoch': 0.73} 73%|█████████████████████████████████████████████████████████▏ | 499/681 [34:29<08:06, 2.67s/it] 73%|█████████████████████████████████████████████████████████▎ | 500/681 [34:31<07:52, 2.61s/it] {'loss': 0.3799, 'grad_norm': 45.40144729614258, 'learning_rate': 1.0139748428955333e-07, 'margin_dpo/margin_mean': 33.82952117919922, 'margin_dpo/margin_std': 29.12575340270996, 'logps/chosen': -82.48162841796875, 'logps/rejected': -153.21792602539062, 'logps/ref_chosen': -57.027442932128906, 'logps/ref_rejected': -93.93421173095703, 'logits/chosen': -0.6029895544052124, 'logits/rejected': -0.5873157382011414, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.82952117919922, 'margin_dpo/beta_margin_mean': 3.3829522132873535, 'margin_dpo/beta_margin_std': 2.9376726150512695, 'margin_dpo/beta_margin_grad_mean': -0.13710999488830566, 'margin_dpo/beta_margin_grad_std': 0.19010357558727264, 'epoch': 0.73} 73%|█████████████████████████████████████████████████████████▎ | 500/681 [34:31<07:52, 2.61s/it][INFO|trainer.py:4307] 2026-04-17 22:01:02,011 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-17 22:01:02,011 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-17 22:01:02,011 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-17 22:06:11,319 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-17 22:06:11,319 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-600 [INFO|configuration_utils.py:419] 2026-04-17 22:07:07,760 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-600/config.json [INFO|configuration_utils.py:911] 2026-04-17 22:07:07,777 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-600/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-17 22:08:05,733 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-600/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-17 22:08:05,740 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-600/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-17 22:08:05,745 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-600/special_tokens_map.json [INFO|trainer.py:4083] 2026-04-17 22:11:52,979 >> Deleting older checkpoint [/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-200] due to args.save_total_limit 88%|██████████████████████████████████████████████████████████████████▏ | 601/681 [45:27<2:21:06, 105.84s/it] {'loss': 0.5251, 'grad_norm': 82.41748046875, 'learning_rate': 2.1301532877994742e-08, 'margin_dpo/margin_mean': 34.12665939331055, 'margin_dpo/margin_std': 28.886859893798828, 'logps/chosen': -85.03898620605469, 'logps/rejected': -154.72296142578125, 'logps/ref_chosen': -59.13360595703125, 'logps/ref_rejected': -94.69093322753906, 'logits/chosen': -0.6234362125396729, 'logits/rejected': -0.595230758190155, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.12665939331055, 'margin_dpo/beta_margin_mean': 3.412665843963623, 'margin_dpo/beta_margin_std': 2.932499885559082, 'margin_dpo/beta_margin_grad_mean': -0.15890392661094666, 'margin_dpo/beta_margin_grad_std': 0.24525830149650574, 'epoch': 0.88} 88%|██████████████████████████████████████████████████████████████████▏ | 601/681 [45:27<2:21:06, 105.84s/it] 88%|███████████████████████████████████████████████████████████████████▏ | 602/681 [45:29<1:38:31, 74.84s/it] {'loss': 0.3531, 'grad_norm': 68.01284790039062, 'learning_rate': 2.0786184285784298e-08, 'margin_dpo/margin_mean': 37.70860290527344, 'margin_dpo/margin_std': 27.594802856445312, 'logps/chosen': -66.78749084472656, 'logps/rejected': -143.57113647460938, 'logps/ref_chosen': -48.59352111816406, 'logps/ref_rejected': -87.6685562133789, 'logits/chosen': -0.6261130571365356, 'logits/rejected': -0.626197338104248, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.70860290527344, 'margin_dpo/beta_margin_mean': 3.770860195159912, 'margin_dpo/beta_margin_std': 2.8099164962768555, 'margin_dpo/beta_margin_grad_mean': -0.12587696313858032, 'margin_dpo/beta_margin_grad_std': 0.19796113669872284, 'epoch': 0.88} 88%|███████████████████████████████████████████████████████████████████▏ | 602/681 [45:29<1:38:31, 74.84s/it] 89%|███████████████████████████████████████████████████████████████████▎ | 603/681 [45:32<1:09:03, 53.12s/it] {'loss': 0.4719, 'grad_norm': 65.49505615234375, 'learning_rate': 2.0276875690788204e-08, 'margin_dpo/margin_mean': 32.04313659667969, 'margin_dpo/margin_std': 26.230998992919922, 'logps/chosen': -90.7020263671875, 'logps/rejected': -152.65615844726562, 'logps/ref_chosen': -70.41461944580078, 'logps/ref_rejected': -100.32560729980469, 'logits/chosen': -0.637772262096405, 'logits/rejected': -0.5984662175178528, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.04313659667969, 'margin_dpo/beta_margin_mean': 3.2043137550354004, 'margin_dpo/beta_margin_std': 2.628369092941284, 'margin_dpo/beta_margin_grad_mean': -0.1628403216600418, 'margin_dpo/beta_margin_grad_std': 0.2238980382680893, 'epoch': 0.89} 89%|███████████████████████████████████████████████████████████████████▎ | 603/681 [45:32<1:09:03, 53.12s/it] 89%|█████████████████████████████████████████████████████████████████████▏ | 604/681 [45:34<48:44, 37.98s/it] {'loss': 0.4607, 'grad_norm': 64.45735931396484, 'learning_rate': 1.977362051376158e-08, 'margin_dpo/margin_mean': 35.393802642822266, 'margin_dpo/margin_std': 29.511133193969727, 'logps/chosen': -65.24546813964844, 'logps/rejected': -146.03567504882812, 'logps/ref_chosen': -46.45808029174805, 'logps/ref_rejected': -91.8544921875, 'logits/chosen': -0.5643373727798462, 'logits/rejected': -0.5535662770271301, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.393802642822266, 'margin_dpo/beta_margin_mean': 3.5393803119659424, 'margin_dpo/beta_margin_std': 2.988723039627075, 'margin_dpo/beta_margin_grad_mean': -0.14535552263259888, 'margin_dpo/beta_margin_grad_std': 0.2219676375389099, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▏ | 604/681 [45:34<48:44, 37.98s/it] 89%|█████████████████████████████████████████████████████████████████████▎ | 605/681 [45:37<34:40, 27.38s/it] {'loss': 0.4569, 'grad_norm': 62.30309295654297, 'learning_rate': 1.9276432015946446e-08, 'margin_dpo/margin_mean': 31.68695831298828, 'margin_dpo/margin_std': 29.331512451171875, 'logps/chosen': -90.84162139892578, 'logps/rejected': -158.584228515625, 'logps/ref_chosen': -66.24933624267578, 'logps/ref_rejected': -102.30496978759766, 'logits/chosen': -0.6186962127685547, 'logits/rejected': -0.603484034538269, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.686960220336914, 'margin_dpo/beta_margin_mean': 3.168696165084839, 'margin_dpo/beta_margin_std': 3.0381813049316406, 'margin_dpo/beta_margin_grad_mean': -0.1487942636013031, 'margin_dpo/beta_margin_grad_std': 0.1991681158542633, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▎ | 605/681 [45:37<34:40, 27.38s/it] 89%|█████████████████████████████████████████████████████████████████████▍ | 606/681 [45:39<24:51, 19.89s/it] {'loss': 0.2954, 'grad_norm': 44.491546630859375, 'learning_rate': 1.8785323298722093e-08, 'margin_dpo/margin_mean': 36.6777458190918, 'margin_dpo/margin_std': 25.4649658203125, 'logps/chosen': -76.80615234375, 'logps/rejected': -157.0362548828125, 'logps/ref_chosen': -54.819122314453125, 'logps/ref_rejected': -98.37147521972656, 'logits/chosen': -0.5961008071899414, 'logits/rejected': -0.564789354801178, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 36.67774963378906, 'margin_dpo/beta_margin_mean': 3.6677749156951904, 'margin_dpo/beta_margin_std': 2.649290084838867, 'margin_dpo/beta_margin_grad_mean': -0.12030400335788727, 'margin_dpo/beta_margin_grad_std': 0.1559758484363556, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▍ | 606/681 [45:39<24:51, 19.89s/it] 89%|█████████████████████████████████████████████████████████████████████▌ | 607/681 [45:42<18:14, 14.79s/it] {'loss': 0.3387, 'grad_norm': 50.320865631103516, 'learning_rate': 1.8300307303259904e-08, 'margin_dpo/margin_mean': 32.44065856933594, 'margin_dpo/margin_std': 23.566665649414062, 'logps/chosen': -79.11578369140625, 'logps/rejected': -133.24951171875, 'logps/ref_chosen': -58.08403778076172, 'logps/ref_rejected': -79.777099609375, 'logits/chosen': -0.5963802337646484, 'logits/rejected': -0.5623406171798706, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.44065856933594, 'margin_dpo/beta_margin_mean': 3.244065999984741, 'margin_dpo/beta_margin_std': 2.3698348999023438, 'margin_dpo/beta_margin_grad_mean': -0.13021884858608246, 'margin_dpo/beta_margin_grad_std': 0.17474402487277985, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▌ | 607/681 [45:42<18:14, 14.79s/it] 89%|█████████████████████████████████████████████████████████████████████▋ | 608/681 [45:45<13:28, 11.07s/it] {'loss': 0.4835, 'grad_norm': 58.68054962158203, 'learning_rate': 1.7821396810182437e-08, 'margin_dpo/margin_mean': 33.237815856933594, 'margin_dpo/margin_std': 26.23067855834961, 'logps/chosen': -78.17123413085938, 'logps/rejected': -148.73159790039062, 'logps/ref_chosen': -57.450836181640625, 'logps/ref_rejected': -94.77339172363281, 'logits/chosen': -0.6197670698165894, 'logits/rejected': -0.5876868963241577, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.237815856933594, 'margin_dpo/beta_margin_mean': 3.323781728744507, 'margin_dpo/beta_margin_std': 2.665648937225342, 'margin_dpo/beta_margin_grad_mean': -0.15202751755714417, 'margin_dpo/beta_margin_grad_std': 0.22783887386322021, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▋ | 608/681 [45:45<13:28, 11.07s/it] 89%|█████████████████████████████████████████████████████████████████████▊ | 609/681 [45:47<10:07, 8.44s/it] {'loss': 0.3479, 'grad_norm': 64.48681640625, 'learning_rate': 1.7348604439226617e-08, 'margin_dpo/margin_mean': 33.63862228393555, 'margin_dpo/margin_std': 23.823345184326172, 'logps/chosen': -81.96639251708984, 'logps/rejected': -145.61566162109375, 'logps/ref_chosen': -58.805355072021484, 'logps/ref_rejected': -88.81600952148438, 'logits/chosen': -0.642276406288147, 'logits/rejected': -0.6062139272689819, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.63862228393555, 'margin_dpo/beta_margin_mean': 3.3638622760772705, 'margin_dpo/beta_margin_std': 2.390188694000244, 'margin_dpo/beta_margin_grad_mean': -0.12766654789447784, 'margin_dpo/beta_margin_grad_std': 0.18546564877033234, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▊ | 609/681 [45:47<10:07, 8.44s/it] 90%|█████████████████████████████████████████████████████████████████████▊ | 610/681 [45:49<07:50, 6.63s/it] {'loss': 0.4533, 'grad_norm': 74.75220489501953, 'learning_rate': 1.6881942648911074e-08, 'margin_dpo/margin_mean': 32.680015563964844, 'margin_dpo/margin_std': 24.990657806396484, 'logps/chosen': -90.24623107910156, 'logps/rejected': -140.6365966796875, 'logps/ref_chosen': -65.69503784179688, 'logps/ref_rejected': -83.4053955078125, 'logits/chosen': -0.6093329191207886, 'logits/rejected': -0.5498037934303284, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.680015563964844, 'margin_dpo/beta_margin_mean': 3.2680017948150635, 'margin_dpo/beta_margin_std': 2.5850718021392822, 'margin_dpo/beta_margin_grad_mean': -0.15364539623260498, 'margin_dpo/beta_margin_grad_std': 0.21927115321159363, 'epoch': 0.9} 90%|█████████████████████████████████████████████████████████████████████▊ | 610/681 [45:49<07:50, 6.63s/it] 90%|█████████████████████████████████████████████████████████████████████▉ | 611/681 [45:52<06:16, 5.38s/it] {'loss': 0.3916, 'grad_norm': 52.49784851074219, 'learning_rate': 1.6421423736208e-08, 'margin_dpo/margin_mean': 35.877716064453125, 'margin_dpo/margin_std': 27.9959774017334, 'logps/chosen': -74.63272094726562, 'logps/rejected': -144.24195861816406, 'logps/ref_chosen': -52.59947204589844, 'logps/ref_rejected': -86.33099365234375, 'logits/chosen': -0.6183408498764038, 'logits/rejected': -0.580921471118927, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.877716064453125, 'margin_dpo/beta_margin_mean': 3.587771415710449, 'margin_dpo/beta_margin_std': 2.8050875663757324, 'margin_dpo/beta_margin_grad_mean': -0.1446484923362732, 'margin_dpo/beta_margin_grad_std': 0.1988787204027176, 'epoch': 0.9} 90%|█████████████████████████████████████████████████████████████████████▉ | 611/681 [45:52<06:16, 5.38s/it] 90%|██████████████████████████████████████████████████████████████████████ | 612/681 [45:55<05:15, 4.57s/it] {'loss': 0.2722, 'grad_norm': 44.11368179321289, 'learning_rate': 1.5967059836219042e-08, 'margin_dpo/margin_mean': 40.94614028930664, 'margin_dpo/margin_std': 27.57101058959961, 'logps/chosen': -80.20808410644531, 'logps/rejected': -150.14288330078125, 'logps/ref_chosen': -59.32372283935547, 'logps/ref_rejected': -88.31239318847656, 'logits/chosen': -0.6275376081466675, 'logits/rejected': -0.5670713782310486, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 40.94614028930664, 'margin_dpo/beta_margin_mean': 4.094614028930664, 'margin_dpo/beta_margin_std': 2.759153127670288, 'margin_dpo/beta_margin_grad_mean': -0.10046197474002838, 'margin_dpo/beta_margin_grad_std': 0.16786958277225494, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████ | 612/681 [45:55<05:15, 4.57s/it] 90%|██████████████████████████████████████████████████████████████████████▏ | 613/681 [45:57<04:29, 3.97s/it] {'loss': 0.3659, 'grad_norm': 51.012901306152344, 'learning_rate': 1.551886292185553e-08, 'margin_dpo/margin_mean': 35.77855682373047, 'margin_dpo/margin_std': 27.143339157104492, 'logps/chosen': -80.63017272949219, 'logps/rejected': -161.78628540039062, 'logps/ref_chosen': -59.72996520996094, 'logps/ref_rejected': -105.10753631591797, 'logits/chosen': -0.6327238082885742, 'logits/rejected': -0.6284672021865845, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.77855682373047, 'margin_dpo/beta_margin_mean': 3.5778555870056152, 'margin_dpo/beta_margin_std': 2.7456395626068115, 'margin_dpo/beta_margin_grad_mean': -0.13130351901054382, 'margin_dpo/beta_margin_grad_std': 0.2031707614660263, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████▏ | 613/681 [45:57<04:29, 3.97s/it] 90%|██████████████████████████████████████████████████████████████████████▎ | 614/681 [46:00<03:59, 3.57s/it] {'loss': 0.3031, 'grad_norm': 48.471588134765625, 'learning_rate': 1.507684480352292e-08, 'margin_dpo/margin_mean': 35.7594108581543, 'margin_dpo/margin_std': 25.549396514892578, 'logps/chosen': -76.522705078125, 'logps/rejected': -164.02252197265625, 'logps/ref_chosen': -52.93898010253906, 'logps/ref_rejected': -104.67938232421875, 'logits/chosen': -0.5755459070205688, 'logits/rejected': -0.5698869824409485, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.7594108581543, 'margin_dpo/beta_margin_mean': 3.5759410858154297, 'margin_dpo/beta_margin_std': 2.556734800338745, 'margin_dpo/beta_margin_grad_mean': -0.11619433760643005, 'margin_dpo/beta_margin_grad_std': 0.1717434674501419, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████▎ | 614/681 [46:00<03:59, 3.57s/it] 90%|██████████████████████████████████████████████████████████████████████▍ | 615/681 [46:02<03:36, 3.29s/it] {'loss': 0.4035, 'grad_norm': 41.79697799682617, 'learning_rate': 1.4641017128809801e-08, 'margin_dpo/margin_mean': 30.23415184020996, 'margin_dpo/margin_std': 22.966224670410156, 'logps/chosen': -86.97941589355469, 'logps/rejected': -146.57379150390625, 'logps/ref_chosen': -65.81727600097656, 'logps/ref_rejected': -95.17749786376953, 'logits/chosen': -0.5839822292327881, 'logits/rejected': -0.5518302917480469, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.234153747558594, 'margin_dpo/beta_margin_mean': 3.0234153270721436, 'margin_dpo/beta_margin_std': 2.3445184230804443, 'margin_dpo/beta_margin_grad_mean': -0.15672753751277924, 'margin_dpo/beta_margin_grad_std': 0.1847524344921112, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████▍ | 615/681 [46:02<03:36, 3.29s/it] 90%|██████████████████████████████████████████████████████████████████████▌ | 616/681 [46:05<03:20, 3.09s/it] {'loss': 0.5016, 'grad_norm': 72.02803039550781, 'learning_rate': 1.4211391382180637e-08, 'margin_dpo/margin_mean': 32.59221649169922, 'margin_dpo/margin_std': 29.501014709472656, 'logps/chosen': -88.5474853515625, 'logps/rejected': -130.7073516845703, 'logps/ref_chosen': -65.13285827636719, 'logps/ref_rejected': -74.70050048828125, 'logits/chosen': -0.613810122013092, 'logits/rejected': -0.5613222122192383, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.59221649169922, 'margin_dpo/beta_margin_mean': 3.2592217922210693, 'margin_dpo/beta_margin_std': 3.033946990966797, 'margin_dpo/beta_margin_grad_mean': -0.16017135977745056, 'margin_dpo/beta_margin_grad_std': 0.230714350938797, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████▌ | 616/681 [46:05<03:20, 3.09s/it] 91%|██████████████████████████████████████████████████████████████████████▋ | 617/681 [46:08<03:10, 2.97s/it] {'loss': 0.3826, 'grad_norm': 54.11730194091797, 'learning_rate': 1.378797888467345e-08, 'margin_dpo/margin_mean': 29.974029541015625, 'margin_dpo/margin_std': 23.434463500976562, 'logps/chosen': -87.65239715576172, 'logps/rejected': -118.85501098632812, 'logps/ref_chosen': -63.005550384521484, 'logps/ref_rejected': -64.234130859375, 'logits/chosen': -0.5736366510391235, 'logits/rejected': -0.5296716094017029, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.974029541015625, 'margin_dpo/beta_margin_mean': 2.9974029064178467, 'margin_dpo/beta_margin_std': 2.35481333732605, 'margin_dpo/beta_margin_grad_mean': -0.14828212559223175, 'margin_dpo/beta_margin_grad_std': 0.17800341546535492, 'epoch': 0.91} 91%|██████████████████████████████████████████████████████████████████████▋ | 617/681 [46:08<03:10, 2.97s/it] 91%|██████████████████████████████████████████████████████████████████████▊ | 618/681 [46:10<03:03, 2.92s/it] {'loss': 0.4624, 'grad_norm': 67.99271392822266, 'learning_rate': 1.3370790793601371e-08, 'margin_dpo/margin_mean': 30.859840393066406, 'margin_dpo/margin_std': 26.370765686035156, 'logps/chosen': -90.81625366210938, 'logps/rejected': -146.72813415527344, 'logps/ref_chosen': -67.10135650634766, 'logps/ref_rejected': -92.15339660644531, 'logits/chosen': -0.6390504837036133, 'logits/rejected': -0.610953152179718, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.859838485717773, 'margin_dpo/beta_margin_mean': 3.085983991622925, 'margin_dpo/beta_margin_std': 2.6704392433166504, 'margin_dpo/beta_margin_grad_mean': -0.16537515819072723, 'margin_dpo/beta_margin_grad_std': 0.21399806439876556, 'epoch': 0.91} 91%|██████████████████████████████████████████████████████████████████████▊ | 618/681 [46:10<03:03, 2.92s/it] 91%|██████████████████████████████████████████████████████████████████████▉ | 619/681 [46:13<02:55, 2.82s/it] {'loss': 0.4702, 'grad_norm': 55.240272521972656, 'learning_rate': 1.2959838102258535e-08, 'margin_dpo/margin_mean': 32.92374038696289, 'margin_dpo/margin_std': 29.776756286621094, 'logps/chosen': -79.01873779296875, 'logps/rejected': -149.14964294433594, 'logps/ref_chosen': -55.978233337402344, 'logps/ref_rejected': -93.1854019165039, 'logits/chosen': -0.5689994096755981, 'logits/rejected': -0.5356103777885437, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.92374038696289, 'margin_dpo/beta_margin_mean': 3.2923738956451416, 'margin_dpo/beta_margin_std': 3.01572847366333, 'margin_dpo/beta_margin_grad_mean': -0.16712483763694763, 'margin_dpo/beta_margin_grad_std': 0.21881355345249176, 'epoch': 0.91} 91%|██████████████████████████████████████████████████████████████████████▉ | 619/681 [46:13<02:55, 2.82s/it] 91%|███████████████████████████████████████████████████████████████████████ | 620/681 [46:16<02:46, 2.73s/it] {'loss': 0.2579, 'grad_norm': 34.842933654785156, 'learning_rate': 1.2555131639630567e-08, 'margin_dpo/margin_mean': 35.61638259887695, 'margin_dpo/margin_std': 25.934829711914062, 'logps/chosen': -79.86566162109375, 'logps/rejected': -134.0952911376953, 'logps/ref_chosen': -59.79750061035156, 'logps/ref_rejected': -78.41075134277344, 'logits/chosen': -0.6340548396110535, 'logits/rejected': -0.5965070724487305, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.61638259887695, 'margin_dpo/beta_margin_mean': 3.561638355255127, 'margin_dpo/beta_margin_std': 2.65966534614563, 'margin_dpo/beta_margin_grad_mean': -0.10915657132863998, 'margin_dpo/beta_margin_grad_std': 0.13496600091457367, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████ | 620/681 [46:16<02:46, 2.73s/it] 91%|███████████████████████████████████████████████████████████████████████▏ | 621/681 [46:18<02:46, 2.77s/it] {'loss': 0.3092, 'grad_norm': 40.03609848022461, 'learning_rate': 1.2156682070109086e-08, 'margin_dpo/margin_mean': 36.26169967651367, 'margin_dpo/margin_std': 26.822023391723633, 'logps/chosen': -72.59913635253906, 'logps/rejected': -143.29660034179688, 'logps/ref_chosen': -53.933753967285156, 'logps/ref_rejected': -88.36952209472656, 'logits/chosen': -0.6073925495147705, 'logits/rejected': -0.5800847411155701, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 36.26169967651367, 'margin_dpo/beta_margin_mean': 3.6261699199676514, 'margin_dpo/beta_margin_std': 2.691709041595459, 'margin_dpo/beta_margin_grad_mean': -0.10899462550878525, 'margin_dpo/beta_margin_grad_std': 0.1772289127111435, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████▏ | 621/681 [46:18<02:46, 2.77s/it] 91%|███████████████████████████████████████████████████████████████████████▏ | 622/681 [46:21<02:43, 2.76s/it] {'loss': 0.3854, 'grad_norm': 48.39630889892578, 'learning_rate': 1.1764499893210878e-08, 'margin_dpo/margin_mean': 36.819705963134766, 'margin_dpo/margin_std': 28.56911277770996, 'logps/chosen': -82.65914916992188, 'logps/rejected': -144.71177673339844, 'logps/ref_chosen': -60.28582000732422, 'logps/ref_rejected': -85.51873779296875, 'logits/chosen': -0.5869364142417908, 'logits/rejected': -0.5320132970809937, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 36.819705963134766, 'margin_dpo/beta_margin_mean': 3.6819705963134766, 'margin_dpo/beta_margin_std': 2.8990466594696045, 'margin_dpo/beta_margin_grad_mean': -0.1378306895494461, 'margin_dpo/beta_margin_grad_std': 0.1983867883682251, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████▏ | 622/681 [46:21<02:43, 2.76s/it] 91%|███████████████████████████████████████████████████████████████████████▎ | 623/681 [46:23<02:29, 2.58s/it] {'loss': 0.5541, 'grad_norm': 73.08351135253906, 'learning_rate': 1.1378595443300998e-08, 'margin_dpo/margin_mean': 30.37273406982422, 'margin_dpo/margin_std': 28.88761329650879, 'logps/chosen': -88.72175598144531, 'logps/rejected': -140.02056884765625, 'logps/ref_chosen': -64.15696716308594, 'logps/ref_rejected': -85.08304595947266, 'logits/chosen': -0.6507315635681152, 'logits/rejected': -0.6161798238754272, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.37273597717285, 'margin_dpo/beta_margin_mean': 3.037273645401001, 'margin_dpo/beta_margin_std': 2.930415630340576, 'margin_dpo/beta_margin_grad_mean': -0.18612176179885864, 'margin_dpo/beta_margin_grad_std': 0.23363880813121796, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████▎ | 623/681 [46:23<02:29, 2.58s/it] 92%|███████████████████████████████████████████████████████████████████████▍ | 624/681 [46:26<02:33, 2.69s/it] {'loss': 0.4965, 'grad_norm': 71.18040466308594, 'learning_rate': 1.0998978889320582e-08, 'margin_dpo/margin_mean': 37.318939208984375, 'margin_dpo/margin_std': 27.622631072998047, 'logps/chosen': -94.83811950683594, 'logps/rejected': -157.37045288085938, 'logps/ref_chosen': -71.91862487792969, 'logps/ref_rejected': -97.13203430175781, 'logits/chosen': -0.6796859502792358, 'logits/rejected': -0.6095322966575623, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.31894302368164, 'margin_dpo/beta_margin_mean': 3.7318942546844482, 'margin_dpo/beta_margin_std': 2.772662878036499, 'margin_dpo/beta_margin_grad_mean': -0.14844343066215515, 'margin_dpo/beta_margin_grad_std': 0.25005990266799927, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████▍ | 624/681 [46:26<02:33, 2.69s/it] 92%|███████████████████████████████████████████████████████████████████████▌ | 625/681 [46:29<02:33, 2.74s/it] {'loss': 0.3591, 'grad_norm': 49.115333557128906, 'learning_rate': 1.0625660234518913e-08, 'margin_dpo/margin_mean': 35.52130889892578, 'margin_dpo/margin_std': 28.529512405395508, 'logps/chosen': -81.66363525390625, 'logps/rejected': -144.93325805664062, 'logps/ref_chosen': -58.342071533203125, 'logps/ref_rejected': -86.09038543701172, 'logits/chosen': -0.59247887134552, 'logits/rejected': -0.5529348850250244, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.52130889892578, 'margin_dpo/beta_margin_mean': 3.552130937576294, 'margin_dpo/beta_margin_std': 2.8531718254089355, 'margin_dpo/beta_margin_grad_mean': -0.13468137383460999, 'margin_dpo/beta_margin_grad_std': 0.18700142204761505, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████▌ | 625/681 [46:29<02:33, 2.74s/it] 92%|███████████████████████████████████████████████████████████████████████▋ | 626/681 [46:32<02:30, 2.74s/it] {'loss': 0.5253, 'grad_norm': 63.880088806152344, 'learning_rate': 1.0258649316189721e-08, 'margin_dpo/margin_mean': 30.09228515625, 'margin_dpo/margin_std': 28.44098472595215, 'logps/chosen': -98.9983139038086, 'logps/rejected': -153.16671752929688, 'logps/ref_chosen': -75.11260986328125, 'logps/ref_rejected': -99.18872833251953, 'logits/chosen': -0.5680443644523621, 'logits/rejected': -0.5336043834686279, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.09228515625, 'margin_dpo/beta_margin_mean': 3.009228467941284, 'margin_dpo/beta_margin_std': 2.89654541015625, 'margin_dpo/beta_margin_grad_mean': -0.18730950355529785, 'margin_dpo/beta_margin_grad_std': 0.21887990832328796, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████▋ | 626/681 [46:32<02:30, 2.74s/it] 92%|███████████████████████████████████████████████████████████████████████▊ | 627/681 [46:35<02:28, 2.74s/it] {'loss': 0.6048, 'grad_norm': 78.66019439697266, 'learning_rate': 9.897955805412e-09, 'margin_dpo/margin_mean': 33.95563507080078, 'margin_dpo/margin_std': 34.049468994140625, 'logps/chosen': -69.19841003417969, 'logps/rejected': -162.16537475585938, 'logps/ref_chosen': -47.74314880371094, 'logps/ref_rejected': -106.75448608398438, 'logits/chosen': -0.579108476638794, 'logits/rejected': -0.587154746055603, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.95563507080078, 'margin_dpo/beta_margin_mean': 3.3955633640289307, 'margin_dpo/beta_margin_std': 3.4090704917907715, 'margin_dpo/beta_margin_grad_mean': -0.18939301371574402, 'margin_dpo/beta_margin_grad_std': 0.2481917440891266, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████▊ | 627/681 [46:35<02:28, 2.74s/it] 92%|███████████████████████████████████████████████████████████████████████▉ | 628/681 [46:37<02:23, 2.70s/it] {'loss': 0.3001, 'grad_norm': 41.49999237060547, 'learning_rate': 9.543589206795238e-09, 'margin_dpo/margin_mean': 35.616455078125, 'margin_dpo/margin_std': 25.80486297607422, 'logps/chosen': -82.25130462646484, 'logps/rejected': -159.23948669433594, 'logps/ref_chosen': -60.182945251464844, 'logps/ref_rejected': -101.55467224121094, 'logits/chosen': -0.6199311017990112, 'logits/rejected': -0.6005183458328247, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.616455078125, 'margin_dpo/beta_margin_mean': 3.5616455078125, 'margin_dpo/beta_margin_std': 2.615797758102417, 'margin_dpo/beta_margin_grad_mean': -0.12019230425357819, 'margin_dpo/beta_margin_grad_std': 0.15883654356002808, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████▉ | 628/681 [46:37<02:23, 2.70s/it] 92%|████████████████████████████████████████████████████████████████████████ | 629/681 [46:40<02:20, 2.71s/it] {'loss': 0.4054, 'grad_norm': 62.908477783203125, 'learning_rate': 9.19555885822887e-09, 'margin_dpo/margin_mean': 31.810806274414062, 'margin_dpo/margin_std': 25.02639389038086, 'logps/chosen': -86.42594909667969, 'logps/rejected': -145.6768798828125, 'logps/ref_chosen': -64.21353912353516, 'logps/ref_rejected': -91.65367126464844, 'logits/chosen': -0.6567898392677307, 'logits/rejected': -0.6142420768737793, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.81080436706543, 'margin_dpo/beta_margin_mean': 3.1810803413391113, 'margin_dpo/beta_margin_std': 2.5537304878234863, 'margin_dpo/beta_margin_grad_mean': -0.1407935917377472, 'margin_dpo/beta_margin_grad_std': 0.19307489693164825, 'epoch': 0.92} 92%|████████████████████████████████████████████████████████████████████████ | 629/681 [46:40<02:20, 2.71s/it] 93%|████████████████████████████████████████████████████████████████████████▏ | 630/681 [46:43<02:17, 2.69s/it] {'loss': 0.461, 'grad_norm': 60.66549301147461, 'learning_rate': 8.85387393063622e-09, 'margin_dpo/margin_mean': 30.402843475341797, 'margin_dpo/margin_std': 25.565155029296875, 'logps/chosen': -79.63174438476562, 'logps/rejected': -134.34188842773438, 'logps/ref_chosen': -59.29100036621094, 'logps/ref_rejected': -83.59829711914062, 'logits/chosen': -0.6706698536872864, 'logits/rejected': -0.6243743896484375, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.402841567993164, 'margin_dpo/beta_margin_mean': 3.0402843952178955, 'margin_dpo/beta_margin_std': 2.556612014770508, 'margin_dpo/beta_margin_grad_mean': -0.1645456999540329, 'margin_dpo/beta_margin_grad_std': 0.20726469159126282, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▏ | 630/681 [46:43<02:17, 2.69s/it] 93%|████████████████████████████████████████████████████████████████████████▎ | 631/681 [46:45<02:11, 2.62s/it] {'loss': 0.7356, 'grad_norm': 94.32537078857422, 'learning_rate': 8.518543427732949e-09, 'margin_dpo/margin_mean': 28.286869049072266, 'margin_dpo/margin_std': 29.41876220703125, 'logps/chosen': -83.84978485107422, 'logps/rejected': -133.63461303710938, 'logps/ref_chosen': -59.45360565185547, 'logps/ref_rejected': -80.95157623291016, 'logits/chosen': -0.6151013374328613, 'logits/rejected': -0.5717021822929382, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.286867141723633, 'margin_dpo/beta_margin_mean': 2.8286869525909424, 'margin_dpo/beta_margin_std': 2.959045886993408, 'margin_dpo/beta_margin_grad_mean': -0.1976294070482254, 'margin_dpo/beta_margin_grad_std': 0.26410892605781555, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▎ | 631/681 [46:45<02:11, 2.62s/it] 93%|████████████████████████████████████████████████████████████████████████▍ | 632/681 [46:47<02:03, 2.53s/it] {'loss': 0.7093, 'grad_norm': 86.42517852783203, 'learning_rate': 8.189576185789637e-09, 'margin_dpo/margin_mean': 32.566551208496094, 'margin_dpo/margin_std': 29.249189376831055, 'logps/chosen': -85.71180725097656, 'logps/rejected': -143.08697509765625, 'logps/ref_chosen': -61.35155487060547, 'logps/ref_rejected': -86.16017150878906, 'logits/chosen': -0.619070291519165, 'logits/rejected': -0.5838553309440613, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.566551208496094, 'margin_dpo/beta_margin_mean': 3.256655216217041, 'margin_dpo/beta_margin_std': 3.002157211303711, 'margin_dpo/beta_margin_grad_mean': -0.16710862517356873, 'margin_dpo/beta_margin_grad_std': 0.26828470826148987, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▍ | 632/681 [46:47<02:03, 2.53s/it] 93%|████████████████████████████████████████████████████████████████████████▌ | 633/681 [46:50<02:01, 2.53s/it] {'loss': 0.5499, 'grad_norm': 60.10581970214844, 'learning_rate': 7.866980873399015e-09, 'margin_dpo/margin_mean': 27.419578552246094, 'margin_dpo/margin_std': 24.602121353149414, 'logps/chosen': -80.6368408203125, 'logps/rejected': -142.36219787597656, 'logps/ref_chosen': -57.278167724609375, 'logps/ref_rejected': -91.58395385742188, 'logits/chosen': -0.6361432075500488, 'logits/rejected': -0.6225095987319946, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.419578552246094, 'margin_dpo/beta_margin_mean': 2.741957902908325, 'margin_dpo/beta_margin_std': 2.554403066635132, 'margin_dpo/beta_margin_grad_mean': -0.19326050579547882, 'margin_dpo/beta_margin_grad_std': 0.22396619617938995, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▌ | 633/681 [46:50<02:01, 2.53s/it] 93%|████████████████████████████████████████████████████████████████████████▌ | 634/681 [46:53<02:00, 2.56s/it] {'loss': 0.6531, 'grad_norm': 73.96012878417969, 'learning_rate': 7.550765991247654e-09, 'margin_dpo/margin_mean': 27.910812377929688, 'margin_dpo/margin_std': 29.05972671508789, 'logps/chosen': -93.19425964355469, 'logps/rejected': -161.61175537109375, 'logps/ref_chosen': -66.61896514892578, 'logps/ref_rejected': -107.12565612792969, 'logits/chosen': -0.5560423135757446, 'logits/rejected': -0.538284420967102, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 27.910810470581055, 'margin_dpo/beta_margin_mean': 2.791081190109253, 'margin_dpo/beta_margin_std': 2.91212797164917, 'margin_dpo/beta_margin_grad_mean': -0.20842374861240387, 'margin_dpo/beta_margin_grad_std': 0.24925780296325684, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▌ | 634/681 [46:53<02:00, 2.56s/it] 93%|████████████████████████████████████████████████████████████████████████▋ | 635/681 [46:55<01:58, 2.58s/it] {'loss': 0.409, 'grad_norm': 50.861839294433594, 'learning_rate': 7.240939871891699e-09, 'margin_dpo/margin_mean': 28.63296127319336, 'margin_dpo/margin_std': 22.48883628845215, 'logps/chosen': -96.6619873046875, 'logps/rejected': -133.83990478515625, 'logps/ref_chosen': -73.95551300048828, 'logps/ref_rejected': -82.50045776367188, 'logits/chosen': -0.608803391456604, 'logits/rejected': -0.5592911243438721, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.63296127319336, 'margin_dpo/beta_margin_mean': 2.8632962703704834, 'margin_dpo/beta_margin_std': 2.2558207511901855, 'margin_dpo/beta_margin_grad_mean': -0.15432217717170715, 'margin_dpo/beta_margin_grad_std': 0.18803834915161133, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▋ | 635/681 [46:55<01:58, 2.58s/it] 93%|████████████████████████████████████████████████████████████████████████▊ | 636/681 [46:58<01:59, 2.66s/it] {'loss': 0.4012, 'grad_norm': 47.65840530395508, 'learning_rate': 6.937510679537628e-09, 'margin_dpo/margin_mean': 33.004608154296875, 'margin_dpo/margin_std': 23.834693908691406, 'logps/chosen': -82.30425262451172, 'logps/rejected': -137.65878295898438, 'logps/ref_chosen': -59.628910064697266, 'logps/ref_rejected': -81.97883605957031, 'logits/chosen': -0.5629330277442932, 'logits/rejected': -0.5346908569335938, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.00461196899414, 'margin_dpo/beta_margin_mean': 3.3004610538482666, 'margin_dpo/beta_margin_std': 2.40425968170166, 'margin_dpo/beta_margin_grad_mean': -0.13953568041324615, 'margin_dpo/beta_margin_grad_std': 0.21565653383731842, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▊ | 636/681 [46:58<01:59, 2.66s/it] 94%|████████████████████████████████████████████████████████████████████████▉ | 637/681 [47:01<01:59, 2.73s/it] {'loss': 0.3634, 'grad_norm': 53.40937042236328, 'learning_rate': 6.640486409826785e-09, 'margin_dpo/margin_mean': 32.936134338378906, 'margin_dpo/margin_std': 25.349170684814453, 'logps/chosen': -73.21141815185547, 'logps/rejected': -154.89999389648438, 'logps/ref_chosen': -49.652687072753906, 'logps/ref_rejected': -98.40513610839844, 'logits/chosen': -0.5897486209869385, 'logits/rejected': -0.5671026110649109, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.936134338378906, 'margin_dpo/beta_margin_mean': 3.2936134338378906, 'margin_dpo/beta_margin_std': 2.5701606273651123, 'margin_dpo/beta_margin_grad_mean': -0.13438232243061066, 'margin_dpo/beta_margin_grad_std': 0.18966805934906006, 'epoch': 0.94} 94%|████████████████████████████████████████████████████████████████████████▉ | 637/681 [47:01<01:59, 2.73s/it] 94%|█████████████████████████████████████████████████████████████████████████ | 638/681 [47:04<01:58, 2.75s/it] {'loss': 0.3245, 'grad_norm': 41.96897888183594, 'learning_rate': 6.349874889624962e-09, 'margin_dpo/margin_mean': 37.08156967163086, 'margin_dpo/margin_std': 27.137168884277344, 'logps/chosen': -78.70539855957031, 'logps/rejected': -136.9318084716797, 'logps/ref_chosen': -58.156646728515625, 'logps/ref_rejected': -79.3014907836914, 'logits/chosen': -0.5449614524841309, 'logits/rejected': -0.49521952867507935, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.08156967163086, 'margin_dpo/beta_margin_mean': 3.7081568241119385, 'margin_dpo/beta_margin_std': 2.8503551483154297, 'margin_dpo/beta_margin_grad_mean': -0.12071166932582855, 'margin_dpo/beta_margin_grad_std': 0.17367184162139893, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████ | 638/681 [47:04<01:58, 2.75s/it] 94%|█████████████████████████████████████████████████████████████████████████▏ | 639/681 [47:06<01:53, 2.70s/it] {'loss': 0.4397, 'grad_norm': 57.53899383544922, 'learning_rate': 6.065683776815933e-09, 'margin_dpo/margin_mean': 31.09198760986328, 'margin_dpo/margin_std': 24.611787796020508, 'logps/chosen': -97.73635864257812, 'logps/rejected': -130.7800750732422, 'logps/ref_chosen': -72.32319641113281, 'logps/ref_rejected': -74.2749252319336, 'logits/chosen': -0.58185875415802, 'logits/rejected': -0.5182079672813416, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.09198570251465, 'margin_dpo/beta_margin_mean': 3.109198570251465, 'margin_dpo/beta_margin_std': 2.4722304344177246, 'margin_dpo/beta_margin_grad_mean': -0.14476662874221802, 'margin_dpo/beta_margin_grad_std': 0.199687659740448, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████▏ | 639/681 [47:06<01:53, 2.70s/it] 94%|█████████████████████████████████████████████████████████████████████████▎ | 640/681 [47:09<01:50, 2.69s/it] {'loss': 0.3066, 'grad_norm': 44.61709213256836, 'learning_rate': 5.7879205600998296e-09, 'margin_dpo/margin_mean': 36.84351348876953, 'margin_dpo/margin_std': 29.767667770385742, 'logps/chosen': -78.43016815185547, 'logps/rejected': -167.73947143554688, 'logps/ref_chosen': -56.13436508178711, 'logps/ref_rejected': -108.60014343261719, 'logits/chosen': -0.5778528451919556, 'logits/rejected': -0.5412660241127014, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 36.84351348876953, 'margin_dpo/beta_margin_mean': 3.6843512058258057, 'margin_dpo/beta_margin_std': 3.017540454864502, 'margin_dpo/beta_margin_grad_mean': -0.12420199811458588, 'margin_dpo/beta_margin_grad_std': 0.15969912707805634, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████▎ | 640/681 [47:09<01:50, 2.69s/it] 94%|█████████████████████████████████████████████████████████████████████████▍ | 641/681 [47:12<01:46, 2.66s/it] {'loss': 0.3746, 'grad_norm': 51.515228271484375, 'learning_rate': 5.516592558795746e-09, 'margin_dpo/margin_mean': 32.17498779296875, 'margin_dpo/margin_std': 29.780851364135742, 'logps/chosen': -88.82362365722656, 'logps/rejected': -142.99404907226562, 'logps/ref_chosen': -64.99689483642578, 'logps/ref_rejected': -86.99232482910156, 'logits/chosen': -0.6603978872299194, 'logits/rejected': -0.6059365272521973, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.17498779296875, 'margin_dpo/beta_margin_mean': 3.217498779296875, 'margin_dpo/beta_margin_std': 3.0656890869140625, 'margin_dpo/beta_margin_grad_mean': -0.1466035395860672, 'margin_dpo/beta_margin_grad_std': 0.17220094799995422, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████▍ | 641/681 [47:12<01:46, 2.66s/it] 94%|█████████████████████████████████████████████████████████████████████████▌ | 642/681 [47:14<01:44, 2.68s/it] {'loss': 0.4822, 'grad_norm': 78.2542724609375, 'learning_rate': 5.251706922648868e-09, 'margin_dpo/margin_mean': 35.434226989746094, 'margin_dpo/margin_std': 30.440698623657227, 'logps/chosen': -90.29745483398438, 'logps/rejected': -170.28448486328125, 'logps/ref_chosen': -65.68924713134766, 'logps/ref_rejected': -110.24205017089844, 'logits/chosen': -0.5912165641784668, 'logits/rejected': -0.5562861561775208, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.43423080444336, 'margin_dpo/beta_margin_mean': 3.5434229373931885, 'margin_dpo/beta_margin_std': 3.0842573642730713, 'margin_dpo/beta_margin_grad_mean': -0.15366876125335693, 'margin_dpo/beta_margin_grad_std': 0.22277072072029114, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████▌ | 642/681 [47:14<01:44, 2.68s/it] 94%|█████████████████████████████████████████████████████████████████████████▋ | 643/681 [47:17<01:42, 2.69s/it] {'loss': 0.4257, 'grad_norm': 51.46054458618164, 'learning_rate': 4.993270631642038e-09, 'margin_dpo/margin_mean': 30.795534133911133, 'margin_dpo/margin_std': 24.044445037841797, 'logps/chosen': -71.25507354736328, 'logps/rejected': -137.56893920898438, 'logps/ref_chosen': -51.94999694824219, 'logps/ref_rejected': -87.46833801269531, 'logits/chosen': -0.6483656764030457, 'logits/rejected': -0.62122642993927, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.795534133911133, 'margin_dpo/beta_margin_mean': 3.0795533657073975, 'margin_dpo/beta_margin_std': 2.446993350982666, 'margin_dpo/beta_margin_grad_mean': -0.14452366530895233, 'margin_dpo/beta_margin_grad_std': 0.19863076508045197, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████▋ | 643/681 [47:17<01:42, 2.69s/it] 95%|█████████████████████████████████████████████████████████████████████████▊ | 644/681 [47:20<01:38, 2.67s/it] {'loss': 0.5657, 'grad_norm': 75.44609069824219, 'learning_rate': 4.741290495811873e-09, 'margin_dpo/margin_mean': 30.231281280517578, 'margin_dpo/margin_std': 28.730857849121094, 'logps/chosen': -79.76002502441406, 'logps/rejected': -138.11033630371094, 'logps/ref_chosen': -59.017662048339844, 'logps/ref_rejected': -87.13668823242188, 'logits/chosen': -0.6009418964385986, 'logits/rejected': -0.57252037525177, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.231281280517578, 'margin_dpo/beta_margin_mean': 3.0231282711029053, 'margin_dpo/beta_margin_std': 2.8853414058685303, 'margin_dpo/beta_margin_grad_mean': -0.1847480684518814, 'margin_dpo/beta_margin_grad_std': 0.23598462343215942, 'epoch': 0.95} 95%|█████████████████████████████████████████████████████████████████████████▊ | 644/681 [47:20<01:38, 2.67s/it] 95%|█████████████████████████████████████████████████████████████████████████▉ | 645/681 [47:22<01:35, 2.66s/it] {'loss': 0.544, 'grad_norm': 70.22451782226562, 'learning_rate': 4.495773155069299e-09, 'margin_dpo/margin_mean': 28.99817657470703, 'margin_dpo/margin_std': 27.904760360717773, 'logps/chosen': -79.71002197265625, 'logps/rejected': -150.6129913330078, 'logps/ref_chosen': -55.87602233886719, 'logps/ref_rejected': -97.78080749511719, 'logits/chosen': -0.5856224298477173, 'logits/rejected': -0.5652365684509277, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.99817657470703, 'margin_dpo/beta_margin_mean': 2.899817705154419, 'margin_dpo/beta_margin_std': 2.835707187652588, 'margin_dpo/beta_margin_grad_mean': -0.19084730744361877, 'margin_dpo/beta_margin_grad_std': 0.22894078493118286, 'epoch': 0.95} 95%|█████████████████████████████████████████████████████████████████████████▉ | 645/681 [47:22<01:35, 2.66s/it] 95%|█████████████████████████████████████████████████████████████████████████▉ | 646/681 [47:25<01:30, 2.59s/it] {'loss': 0.316, 'grad_norm': 51.758888244628906, 'learning_rate': 4.256725079024553e-09, 'margin_dpo/margin_mean': 33.072837829589844, 'margin_dpo/margin_std': 22.390499114990234, 'logps/chosen': -83.82559967041016, 'logps/rejected': -133.1284637451172, 'logps/ref_chosen': -61.275787353515625, 'logps/ref_rejected': -77.50580596923828, 'logits/chosen': -0.6095120906829834, 'logits/rejected': -0.5594819784164429, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.072837829589844, 'margin_dpo/beta_margin_mean': 3.307283878326416, 'margin_dpo/beta_margin_std': 2.259632110595703, 'margin_dpo/beta_margin_grad_mean': -0.11909312754869461, 'margin_dpo/beta_margin_grad_std': 0.17143264412879944, 'epoch': 0.95} 95%|█████████████████████████████████████████████████████████████████████████▉ | 646/681 [47:25<01:30, 2.59s/it] 95%|██████████████████████████████████████████████████████████████████████████ | 647/681 [47:28<01:31, 2.68s/it] {'loss': 0.5032, 'grad_norm': 81.30612182617188, 'learning_rate': 4.024152566816791e-09, 'margin_dpo/margin_mean': 32.761592864990234, 'margin_dpo/margin_std': 26.8262939453125, 'logps/chosen': -78.84927368164062, 'logps/rejected': -150.27786254882812, 'logps/ref_chosen': -54.852413177490234, 'logps/ref_rejected': -93.5194091796875, 'logits/chosen': -0.5496389865875244, 'logits/rejected': -0.5257160067558289, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.761592864990234, 'margin_dpo/beta_margin_mean': 3.2761595249176025, 'margin_dpo/beta_margin_std': 2.728703260421753, 'margin_dpo/beta_margin_grad_mean': -0.16137224435806274, 'margin_dpo/beta_margin_grad_std': 0.24134768545627594, 'epoch': 0.95} 95%|██████████████████████████████████████████████████████████████████████████ | 647/681 [47:28<01:31, 2.68s/it] 95%|██████████████████████████████████████████████████████████████████████████▏ | 648/681 [47:30<01:26, 2.61s/it] {'loss': 0.3728, 'grad_norm': 47.32956314086914, 'learning_rate': 3.798061746947995e-09, 'margin_dpo/margin_mean': 40.340911865234375, 'margin_dpo/margin_std': 34.24688720703125, 'logps/chosen': -73.89356231689453, 'logps/rejected': -158.77578735351562, 'logps/ref_chosen': -54.17146682739258, 'logps/ref_rejected': -98.71279907226562, 'logits/chosen': -0.6139056086540222, 'logits/rejected': -0.6051241159439087, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 40.340911865234375, 'margin_dpo/beta_margin_mean': 4.034091472625732, 'margin_dpo/beta_margin_std': 3.443060874938965, 'margin_dpo/beta_margin_grad_mean': -0.138786181807518, 'margin_dpo/beta_margin_grad_std': 0.19550208747386932, 'epoch': 0.95} 95%|██████████████████████████████████████████████████████████████████████████▏ | 648/681 [47:30<01:26, 2.61s/it] 95%|██████████████████████████████████████████████████████████████████████████▎ | 649/681 [47:33<01:25, 2.67s/it] {'loss': 0.536, 'grad_norm': 50.813629150390625, 'learning_rate': 3.5784585771215235e-09, 'margin_dpo/margin_mean': 28.49066925048828, 'margin_dpo/margin_std': 28.419557571411133, 'logps/chosen': -83.07283020019531, 'logps/rejected': -129.16033935546875, 'logps/ref_chosen': -62.4803466796875, 'logps/ref_rejected': -80.07717895507812, 'logits/chosen': -0.6515902876853943, 'logits/rejected': -0.6201357841491699, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 28.49066734313965, 'margin_dpo/beta_margin_mean': 2.849066734313965, 'margin_dpo/beta_margin_std': 2.8495798110961914, 'margin_dpo/beta_margin_grad_mean': -0.1962561458349228, 'margin_dpo/beta_margin_grad_std': 0.21189990639686584, 'epoch': 0.95} 95%|██████████████████████████████████████████████████████████████████████████▎ | 649/681 [47:33<01:25, 2.67s/it] 95%|██████████████████████████████████████████████████████████████████████████▍ | 650/681 [47:35<01:22, 2.65s/it] {'loss': 0.3545, 'grad_norm': 59.442115783691406, 'learning_rate': 3.3653488440851253e-09, 'margin_dpo/margin_mean': 36.623985290527344, 'margin_dpo/margin_std': 28.712535858154297, 'logps/chosen': -80.34698486328125, 'logps/rejected': -159.14297485351562, 'logps/ref_chosen': -56.09281921386719, 'logps/ref_rejected': -98.26483917236328, 'logits/chosen': -0.5737979412078857, 'logits/rejected': -0.5637534260749817, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 36.62398147583008, 'margin_dpo/beta_margin_mean': 3.662398338317871, 'margin_dpo/beta_margin_std': 2.9485061168670654, 'margin_dpo/beta_margin_grad_mean': -0.13270466029644012, 'margin_dpo/beta_margin_grad_std': 0.1840543895959854, 'epoch': 0.95} 95%|██████████████████████████████████████████████████████████████████████████▍ | 650/681 [47:35<01:22, 2.65s/it] 96%|██████████████████████████████████████████████████████████████████████████▌ | 651/681 [47:38<01:18, 2.63s/it] {'loss': 0.3146, 'grad_norm': 38.10145950317383, 'learning_rate': 3.158738163478475e-09, 'margin_dpo/margin_mean': 35.643516540527344, 'margin_dpo/margin_std': 26.896413803100586, 'logps/chosen': -62.947837829589844, 'logps/rejected': -155.12380981445312, 'logps/ref_chosen': -43.42544937133789, 'logps/ref_rejected': -99.9579086303711, 'logits/chosen': -0.653481125831604, 'logits/rejected': -0.6552349328994751, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.64352035522461, 'margin_dpo/beta_margin_mean': 3.564352035522461, 'margin_dpo/beta_margin_std': 2.6986141204833984, 'margin_dpo/beta_margin_grad_mean': -0.12548606097698212, 'margin_dpo/beta_margin_grad_std': 0.1639980524778366, 'epoch': 0.96} 96%|██████████████████████████████████████████████████████████████████████████▌ | 651/681 [47:38<01:18, 2.63s/it] 96%|██████████████████████████████████████████████████████████████████████████▋ | 652/681 [47:41<01:15, 2.62s/it] {'loss': 0.3386, 'grad_norm': 39.05808639526367, 'learning_rate': 2.9586319796851555e-09, 'margin_dpo/margin_mean': 35.44752502441406, 'margin_dpo/margin_std': 28.104263305664062, 'logps/chosen': -78.93205261230469, 'logps/rejected': -163.570556640625, 'logps/ref_chosen': -62.57680892944336, 'logps/ref_rejected': -111.76779174804688, 'logits/chosen': -0.6412711143493652, 'logits/rejected': -0.617784857749939, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.44752502441406, 'margin_dpo/beta_margin_mean': 3.544752597808838, 'margin_dpo/beta_margin_std': 2.8201441764831543, 'margin_dpo/beta_margin_grad_mean': -0.13532721996307373, 'margin_dpo/beta_margin_grad_std': 0.16726936399936676, 'epoch': 0.96} 96%|██████████████████████████████████████████████████████████████████████████▋ | 652/681 [47:41<01:15, 2.62s/it] 96%|██████████████████████████████████████████████████████████████████████████▊ | 653/681 [47:43<01:12, 2.60s/it] {'loss': 0.3249, 'grad_norm': 51.56984329223633, 'learning_rate': 2.7650355656892166e-09, 'margin_dpo/margin_mean': 35.66375732421875, 'margin_dpo/margin_std': 25.806888580322266, 'logps/chosen': -84.49002075195312, 'logps/rejected': -162.29043579101562, 'logps/ref_chosen': -61.11295700073242, 'logps/ref_rejected': -103.24960327148438, 'logits/chosen': -0.6192601919174194, 'logits/rejected': -0.5976792573928833, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.66375732421875, 'margin_dpo/beta_margin_mean': 3.566375970840454, 'margin_dpo/beta_margin_std': 2.590984344482422, 'margin_dpo/beta_margin_grad_mean': -0.12032375484704971, 'margin_dpo/beta_margin_grad_std': 0.18257243931293488, 'epoch': 0.96} 96%|██████████████████████████████████████████████████████████████████████████▊ | 653/681 [47:43<01:12, 2.60s/it] 96%|██████████████████████████████████████████████████████████████████████████▉ | 654/681 [47:46<01:10, 2.60s/it] {'loss': 0.5285, 'grad_norm': 72.21066284179688, 'learning_rate': 2.577954022936174e-09, 'margin_dpo/margin_mean': 29.483509063720703, 'margin_dpo/margin_std': 28.753616333007812, 'logps/chosen': -86.98482513427734, 'logps/rejected': -153.51400756835938, 'logps/ref_chosen': -61.7281379699707, 'logps/ref_rejected': -98.7738037109375, 'logits/chosen': -0.6111325025558472, 'logits/rejected': -0.6062880754470825, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 29.48350715637207, 'margin_dpo/beta_margin_mean': 2.948350667953491, 'margin_dpo/beta_margin_std': 2.8891336917877197, 'margin_dpo/beta_margin_grad_mean': -0.17636682093143463, 'margin_dpo/beta_margin_grad_std': 0.2301536500453949, 'epoch': 0.96} 96%|██████████████████████████████████████████████████████████████████████████▉ | 654/681 [47:46<01:10, 2.60s/it] 96%|███████████████████████████████████████████████████████████████████████████ | 655/681 [47:49<01:10, 2.70s/it] {'loss': 0.5089, 'grad_norm': 72.21392059326172, 'learning_rate': 2.397392281198729e-09, 'margin_dpo/margin_mean': 30.534870147705078, 'margin_dpo/margin_std': 29.086572647094727, 'logps/chosen': -70.99528503417969, 'logps/rejected': -150.2451629638672, 'logps/ref_chosen': -49.576812744140625, 'logps/ref_rejected': -98.29183197021484, 'logits/chosen': -0.6073825359344482, 'logits/rejected': -0.6081333160400391, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.53487205505371, 'margin_dpo/beta_margin_mean': 3.0534873008728027, 'margin_dpo/beta_margin_std': 2.986149311065674, 'margin_dpo/beta_margin_grad_mean': -0.18340007960796356, 'margin_dpo/beta_margin_grad_std': 0.217272087931633, 'epoch': 0.96} 96%|███████████████████████████████████████████████████████████████████████████ | 655/681 [47:49<01:10, 2.70s/it] 96%|███████████████████████████████████████████████████████████████████████████▏ | 656/681 [47:51<01:07, 2.72s/it] {'loss': 0.2412, 'grad_norm': 40.71949768066406, 'learning_rate': 2.223355098446622e-09, 'margin_dpo/margin_mean': 41.93996047973633, 'margin_dpo/margin_std': 25.561412811279297, 'logps/chosen': -73.37840270996094, 'logps/rejected': -176.44357299804688, 'logps/ref_chosen': -52.54943084716797, 'logps/ref_rejected': -113.67464447021484, 'logits/chosen': -0.5212767124176025, 'logits/rejected': -0.5257933139801025, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 41.93996047973633, 'margin_dpo/beta_margin_mean': 4.193995952606201, 'margin_dpo/beta_margin_std': 2.617056369781494, 'margin_dpo/beta_margin_grad_mean': -0.0933179035782814, 'margin_dpo/beta_margin_grad_std': 0.15993143618106842, 'epoch': 0.96} 96%|███████████████████████████████████████████████████████████████████████████▏ | 656/681 [47:51<01:07, 2.72s/it] 96%|███████████████████████████████████████████████████████████████████████████▎ | 657/681 [47:54<01:01, 2.58s/it] {'loss': 0.3432, 'grad_norm': 45.717838287353516, 'learning_rate': 2.055847060721566e-09, 'margin_dpo/margin_mean': 37.42141342163086, 'margin_dpo/margin_std': 28.687862396240234, 'logps/chosen': -68.62776184082031, 'logps/rejected': -157.26351928710938, 'logps/ref_chosen': -46.700538635253906, 'logps/ref_rejected': -97.91487121582031, 'logits/chosen': -0.6373677849769592, 'logits/rejected': -0.6168010234832764, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.42141342163086, 'margin_dpo/beta_margin_mean': 3.7421414852142334, 'margin_dpo/beta_margin_std': 2.8833959102630615, 'margin_dpo/beta_margin_grad_mean': -0.11700256913900375, 'margin_dpo/beta_margin_grad_std': 0.1847466230392456, 'epoch': 0.96} 96%|███████████████████████████████████████████████████████████████████████████▎ | 657/681 [47:54<01:01, 2.58s/it] 97%|███████████████████████████████████████████████████████████████████████████▎ | 658/681 [47:56<00:58, 2.55s/it] {'loss': 0.4487, 'grad_norm': 59.321533203125, 'learning_rate': 1.8948725820160662e-09, 'margin_dpo/margin_mean': 35.119667053222656, 'margin_dpo/margin_std': 29.735076904296875, 'logps/chosen': -86.52423095703125, 'logps/rejected': -156.62518310546875, 'logps/ref_chosen': -60.958213806152344, 'logps/ref_rejected': -95.93949127197266, 'logits/chosen': -0.6310451030731201, 'logits/rejected': -0.5929208993911743, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.11966323852539, 'margin_dpo/beta_margin_mean': 3.5119664669036865, 'margin_dpo/beta_margin_std': 3.0297350883483887, 'margin_dpo/beta_margin_grad_mean': -0.14949670433998108, 'margin_dpo/beta_margin_grad_std': 0.2181350290775299, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▎ | 658/681 [47:56<00:58, 2.55s/it] 97%|███████████████████████████████████████████████████████████████████████████▍ | 659/681 [47:59<00:57, 2.60s/it] {'loss': 0.5047, 'grad_norm': 57.14666748046875, 'learning_rate': 1.7404359041573723e-09, 'margin_dpo/margin_mean': 34.30597686767578, 'margin_dpo/margin_std': 29.283281326293945, 'logps/chosen': -96.09359741210938, 'logps/rejected': -141.1275634765625, 'logps/ref_chosen': -76.74298095703125, 'logps/ref_rejected': -87.4709701538086, 'logits/chosen': -0.6056843996047974, 'logits/rejected': -0.540166974067688, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.305973052978516, 'margin_dpo/beta_margin_mean': 3.4305975437164307, 'margin_dpo/beta_margin_std': 2.9324827194213867, 'margin_dpo/beta_margin_grad_mean': -0.16867277026176453, 'margin_dpo/beta_margin_grad_std': 0.23597054183483124, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▍ | 659/681 [47:59<00:57, 2.60s/it] 97%|███████████████████████████████████████████████████████████████████████████▌ | 660/681 [48:02<00:55, 2.62s/it] {'loss': 0.2917, 'grad_norm': 49.01050567626953, 'learning_rate': 1.592541096695571e-09, 'margin_dpo/margin_mean': 37.83965301513672, 'margin_dpo/margin_std': 27.737031936645508, 'logps/chosen': -80.31892395019531, 'logps/rejected': -135.07073974609375, 'logps/ref_chosen': -59.047882080078125, 'logps/ref_rejected': -75.96005249023438, 'logits/chosen': -0.6273288130760193, 'logits/rejected': -0.5808557271957397, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.83964920043945, 'margin_dpo/beta_margin_mean': 3.7839651107788086, 'margin_dpo/beta_margin_std': 2.7794392108917236, 'margin_dpo/beta_margin_grad_mean': -0.1133999153971672, 'margin_dpo/beta_margin_grad_std': 0.1706753671169281, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▌ | 660/681 [48:02<00:55, 2.62s/it] 97%|███████████████████████████████████████████████████████████████████████████▋ | 661/681 [48:04<00:49, 2.50s/it] {'loss': 0.4523, 'grad_norm': 64.96249389648438, 'learning_rate': 1.4511920567963908e-09, 'margin_dpo/margin_mean': 34.966941833496094, 'margin_dpo/margin_std': 29.39708709716797, 'logps/chosen': -71.31771850585938, 'logps/rejected': -141.6163787841797, 'logps/ref_chosen': -50.673973083496094, 'logps/ref_rejected': -86.00569152832031, 'logits/chosen': -0.6019885540008545, 'logits/rejected': -0.5567299127578735, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.966941833496094, 'margin_dpo/beta_margin_mean': 3.4966940879821777, 'margin_dpo/beta_margin_std': 3.0865559577941895, 'margin_dpo/beta_margin_grad_mean': -0.14642944931983948, 'margin_dpo/beta_margin_grad_std': 0.2149476855993271, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▋ | 661/681 [48:04<00:49, 2.50s/it] 97%|███████████████████████████████████████████████████████████████████████████▊ | 662/681 [48:07<00:49, 2.59s/it] {'loss': 0.378, 'grad_norm': 50.99778747558594, 'learning_rate': 1.3163925091384532e-09, 'margin_dpo/margin_mean': 30.80561065673828, 'margin_dpo/margin_std': 25.51202964782715, 'logps/chosen': -93.49765014648438, 'logps/rejected': -144.09814453125, 'logps/ref_chosen': -69.26106262207031, 'logps/ref_rejected': -89.05593872070312, 'logits/chosen': -0.6079974174499512, 'logits/rejected': -0.5556979775428772, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 30.805612564086914, 'margin_dpo/beta_margin_mean': 3.080561399459839, 'margin_dpo/beta_margin_std': 2.571131944656372, 'margin_dpo/beta_margin_grad_mean': -0.14259321987628937, 'margin_dpo/beta_margin_grad_std': 0.17750491201877594, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▊ | 662/681 [48:07<00:49, 2.59s/it] 97%|███████████████████████████████████████████████████████████████████████████▉ | 663/681 [48:09<00:47, 2.64s/it] {'loss': 0.3262, 'grad_norm': 39.297733306884766, 'learning_rate': 1.1881460058152382e-09, 'margin_dpo/margin_mean': 33.046897888183594, 'margin_dpo/margin_std': 24.47772216796875, 'logps/chosen': -83.19400024414062, 'logps/rejected': -165.287353515625, 'logps/ref_chosen': -64.87891387939453, 'logps/ref_rejected': -113.92536926269531, 'logits/chosen': -0.6374907493591309, 'logits/rejected': -0.6157968044281006, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.04689407348633, 'margin_dpo/beta_margin_mean': 3.304689407348633, 'margin_dpo/beta_margin_std': 2.461198568344116, 'margin_dpo/beta_margin_grad_mean': -0.12396994978189468, 'margin_dpo/beta_margin_grad_std': 0.15943719446659088, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▉ | 663/681 [48:09<00:47, 2.64s/it] 98%|████████████████████████████████████████████████████████████████████████████ | 664/681 [48:12<00:44, 2.64s/it] {'loss': 0.4288, 'grad_norm': 69.85308074951172, 'learning_rate': 1.066455926241383e-09, 'margin_dpo/margin_mean': 37.14732360839844, 'margin_dpo/margin_std': 26.97930145263672, 'logps/chosen': -84.34225463867188, 'logps/rejected': -166.12283325195312, 'logps/ref_chosen': -60.88847351074219, 'logps/ref_rejected': -105.521728515625, 'logits/chosen': -0.5776158571243286, 'logits/rejected': -0.5483744144439697, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.14732360839844, 'margin_dpo/beta_margin_mean': 3.7147321701049805, 'margin_dpo/beta_margin_std': 2.736865997314453, 'margin_dpo/beta_margin_grad_mean': -0.11845803260803223, 'margin_dpo/beta_margin_grad_std': 0.1965658962726593, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████ | 664/681 [48:12<00:44, 2.64s/it] 98%|████████████████████████████████████████████████████████████████████████████▏ | 665/681 [48:14<00:41, 2.57s/it] {'loss': 0.3524, 'grad_norm': 44.827796936035156, 'learning_rate': 9.513254770636137e-10, 'margin_dpo/margin_mean': 31.53219223022461, 'margin_dpo/margin_std': 23.21342658996582, 'logps/chosen': -81.45133972167969, 'logps/rejected': -137.22821044921875, 'logps/ref_chosen': -60.56413269042969, 'logps/ref_rejected': -84.8088150024414, 'logits/chosen': -0.6395413279533386, 'logits/rejected': -0.5962468385696411, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.53219223022461, 'margin_dpo/beta_margin_mean': 3.153219223022461, 'margin_dpo/beta_margin_std': 2.3784492015838623, 'margin_dpo/beta_margin_grad_mean': -0.13737604022026062, 'margin_dpo/beta_margin_grad_std': 0.17597481608390808, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▏ | 665/681 [48:14<00:41, 2.57s/it] 98%|████████████████████████████████████████████████████████████████████████████▎ | 666/681 [48:17<00:39, 2.64s/it] {'loss': 0.4262, 'grad_norm': 61.68048858642578, 'learning_rate': 8.427576920763956e-10, 'margin_dpo/margin_mean': 35.28595733642578, 'margin_dpo/margin_std': 26.031997680664062, 'logps/chosen': -88.06729125976562, 'logps/rejected': -154.82492065429688, 'logps/ref_chosen': -64.41996002197266, 'logps/ref_rejected': -95.89163208007812, 'logits/chosen': -0.6096721887588501, 'logits/rejected': -0.5720229148864746, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.28595733642578, 'margin_dpo/beta_margin_mean': 3.5285956859588623, 'margin_dpo/beta_margin_std': 2.662677049636841, 'margin_dpo/beta_margin_grad_mean': -0.13251623511314392, 'margin_dpo/beta_margin_grad_std': 0.21360599994659424, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▎ | 666/681 [48:17<00:39, 2.64s/it] 98%|████████████████████████████████████████████████████████████████████████████▍ | 667/681 [48:20<00:38, 2.72s/it] {'loss': 0.3242, 'grad_norm': 58.16268539428711, 'learning_rate': 7.407554321417764e-10, 'margin_dpo/margin_mean': 34.412696838378906, 'margin_dpo/margin_std': 24.47201919555664, 'logps/chosen': -94.41732025146484, 'logps/rejected': -147.3884735107422, 'logps/ref_chosen': -69.27703094482422, 'logps/ref_rejected': -87.83549499511719, 'logits/chosen': -0.5887176990509033, 'logits/rejected': -0.536880612373352, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 34.412696838378906, 'margin_dpo/beta_margin_mean': 3.441269636154175, 'margin_dpo/beta_margin_std': 2.472762107849121, 'margin_dpo/beta_margin_grad_mean': -0.12313113361597061, 'margin_dpo/beta_margin_grad_std': 0.16970713436603546, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▍ | 667/681 [48:20<00:38, 2.72s/it] 98%|████████████████████████████████████████████████████████████████████████████▌ | 668/681 [48:23<00:35, 2.76s/it] {'loss': 0.4507, 'grad_norm': 69.86036682128906, 'learning_rate': 6.453213851142225e-10, 'margin_dpo/margin_mean': 33.146949768066406, 'margin_dpo/margin_std': 25.9494571685791, 'logps/chosen': -96.0662841796875, 'logps/rejected': -160.34828186035156, 'logps/ref_chosen': -72.60400390625, 'logps/ref_rejected': -103.73905181884766, 'logits/chosen': -0.6267153024673462, 'logits/rejected': -0.5883671641349792, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 33.146949768066406, 'margin_dpo/beta_margin_mean': 3.314695358276367, 'margin_dpo/beta_margin_std': 2.6596431732177734, 'margin_dpo/beta_margin_grad_mean': -0.1533161848783493, 'margin_dpo/beta_margin_grad_std': 0.22066539525985718, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▌ | 668/681 [48:23<00:35, 2.76s/it] 98%|████████████████████████████████████████████████████████████████████████████▋ | 669/681 [48:26<00:33, 2.76s/it] {'loss': 0.5021, 'grad_norm': 68.40164947509766, 'learning_rate': 5.564580657695939e-10, 'margin_dpo/margin_mean': 38.299198150634766, 'margin_dpo/margin_std': 32.602203369140625, 'logps/chosen': -65.71624755859375, 'logps/rejected': -135.82337951660156, 'logps/ref_chosen': -46.116416931152344, 'logps/ref_rejected': -77.92434692382812, 'logits/chosen': -0.6119288802146912, 'logits/rejected': -0.5665886998176575, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 38.299198150634766, 'margin_dpo/beta_margin_mean': 3.8299198150634766, 'margin_dpo/beta_margin_std': 3.285043716430664, 'margin_dpo/beta_margin_grad_mean': -0.15446214377880096, 'margin_dpo/beta_margin_grad_std': 0.23781202733516693, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▋ | 669/681 [48:26<00:33, 2.76s/it] 98%|████████████████████████████████████████████████████████████████████████████▋ | 670/681 [48:28<00:29, 2.73s/it] {'loss': 0.2702, 'grad_norm': 44.809444427490234, 'learning_rate': 4.741678157389739e-10, 'margin_dpo/margin_mean': 39.02031707763672, 'margin_dpo/margin_std': 25.866138458251953, 'logps/chosen': -83.17808532714844, 'logps/rejected': -156.79319763183594, 'logps/ref_chosen': -62.34575653076172, 'logps/ref_rejected': -96.9405517578125, 'logits/chosen': -0.5694983005523682, 'logits/rejected': -0.5347045660018921, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 39.020320892333984, 'margin_dpo/beta_margin_mean': 3.9020321369171143, 'margin_dpo/beta_margin_std': 2.6109135150909424, 'margin_dpo/beta_margin_grad_mean': -0.10760509222745895, 'margin_dpo/beta_margin_grad_std': 0.15676988661289215, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▋ | 670/681 [48:28<00:29, 2.73s/it] 99%|████████████████████████████████████████████████████████████████████████████▊ | 671/681 [48:31<00:26, 2.67s/it] {'loss': 0.3555, 'grad_norm': 48.393497467041016, 'learning_rate': 3.9845280344705245e-10, 'margin_dpo/margin_mean': 35.541099548339844, 'margin_dpo/margin_std': 28.457447052001953, 'logps/chosen': -72.3186264038086, 'logps/rejected': -143.67893981933594, 'logps/ref_chosen': -48.00010681152344, 'logps/ref_rejected': -83.81932067871094, 'logits/chosen': -0.5919187068939209, 'logits/rejected': -0.5590361952781677, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 35.54109573364258, 'margin_dpo/beta_margin_mean': 3.554109811782837, 'margin_dpo/beta_margin_std': 2.9027013778686523, 'margin_dpo/beta_margin_grad_mean': -0.13741746544837952, 'margin_dpo/beta_margin_grad_std': 0.1758362054824829, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████▊ | 671/681 [48:31<00:26, 2.67s/it] 99%|████████████████████████████████████████████████████████████████████████████▉ | 672/681 [48:33<00:23, 2.64s/it] {'loss': 0.4842, 'grad_norm': 66.19140625, 'learning_rate': 3.293150240547549e-10, 'margin_dpo/margin_mean': 32.394371032714844, 'margin_dpo/margin_std': 29.72500228881836, 'logps/chosen': -82.76466369628906, 'logps/rejected': -149.71588134765625, 'logps/ref_chosen': -58.583290100097656, 'logps/ref_rejected': -93.14014434814453, 'logits/chosen': -0.6310614347457886, 'logits/rejected': -0.5937498211860657, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 32.394371032714844, 'margin_dpo/beta_margin_mean': 3.2394371032714844, 'margin_dpo/beta_margin_std': 3.0078792572021484, 'margin_dpo/beta_margin_grad_mean': -0.1728401631116867, 'margin_dpo/beta_margin_grad_std': 0.21923863887786865, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████▉ | 672/681 [48:33<00:23, 2.64s/it] 99%|█████████████████████████████████████████████████████████████████████████████ | 673/681 [48:36<00:20, 2.55s/it] {'loss': 0.3112, 'grad_norm': 43.1835823059082, 'learning_rate': 2.6675629940689504e-10, 'margin_dpo/margin_mean': 37.077301025390625, 'margin_dpo/margin_std': 27.354259490966797, 'logps/chosen': -67.85647583007812, 'logps/rejected': -143.50682067871094, 'logps/ref_chosen': -46.72320556640625, 'logps/ref_rejected': -85.29623413085938, 'logits/chosen': -0.6048033237457275, 'logits/rejected': -0.5747998952865601, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.07730484008789, 'margin_dpo/beta_margin_mean': 3.707730531692505, 'margin_dpo/beta_margin_std': 2.7531516551971436, 'margin_dpo/beta_margin_grad_mean': -0.1224966049194336, 'margin_dpo/beta_margin_grad_std': 0.16890129446983337, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████ | 673/681 [48:36<00:20, 2.55s/it] 99%|█████████████████████████████████████████████████████████████████████████████▏| 674/681 [48:39<00:18, 2.60s/it] {'loss': 0.2851, 'grad_norm': 36.11240005493164, 'learning_rate': 2.1077827798404725e-10, 'margin_dpo/margin_mean': 37.874568939208984, 'margin_dpo/margin_std': 28.31113052368164, 'logps/chosen': -67.47659301757812, 'logps/rejected': -129.95156860351562, 'logps/ref_chosen': -45.445526123046875, 'logps/ref_rejected': -70.04593658447266, 'logits/chosen': -0.5830689668655396, 'logits/rejected': -0.5558980703353882, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.874568939208984, 'margin_dpo/beta_margin_mean': 3.78745698928833, 'margin_dpo/beta_margin_std': 2.833228826522827, 'margin_dpo/beta_margin_grad_mean': -0.11657389253377914, 'margin_dpo/beta_margin_grad_std': 0.1537243127822876, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████▏| 674/681 [48:39<00:18, 2.60s/it] 99%|█████████████████████████████████████████████████████████████████████████████▎| 675/681 [48:41<00:15, 2.58s/it] {'loss': 0.3902, 'grad_norm': 66.7328109741211, 'learning_rate': 1.6138243485910863e-10, 'margin_dpo/margin_mean': 39.495147705078125, 'margin_dpo/margin_std': 27.606351852416992, 'logps/chosen': -64.9262924194336, 'logps/rejected': -134.33714294433594, 'logps/ref_chosen': -44.17628479003906, 'logps/ref_rejected': -74.09197998046875, 'logits/chosen': -0.5768786668777466, 'logits/rejected': -0.5500950813293457, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 39.495147705078125, 'margin_dpo/beta_margin_mean': 3.949514865875244, 'margin_dpo/beta_margin_std': 2.772658109664917, 'margin_dpo/beta_margin_grad_mean': -0.11795066297054291, 'margin_dpo/beta_margin_grad_std': 0.21005932986736298, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████▎| 675/681 [48:41<00:15, 2.58s/it] 99%|█████████████████████████████████████████████████████████████████████████████▍| 676/681 [48:44<00:13, 2.64s/it] {'loss': 0.4162, 'grad_norm': 79.0772933959961, 'learning_rate': 1.1857007165852472e-10, 'margin_dpo/margin_mean': 36.29708480834961, 'margin_dpo/margin_std': 28.651344299316406, 'logps/chosen': -96.71508026123047, 'logps/rejected': -149.972412109375, 'logps/ref_chosen': -71.39852142333984, 'logps/ref_rejected': -88.3587646484375, 'logits/chosen': -0.615682065486908, 'logits/rejected': -0.5794901847839355, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 36.29708480834961, 'margin_dpo/beta_margin_mean': 3.629708766937256, 'margin_dpo/beta_margin_std': 2.896389961242676, 'margin_dpo/beta_margin_grad_mean': -0.1375354677438736, 'margin_dpo/beta_margin_grad_std': 0.20353099703788757, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████▍| 676/681 [48:44<00:13, 2.64s/it] 99%|█████████████████████████████████████████████████████████████████████████████▌| 677/681 [48:46<00:10, 2.55s/it] {'loss': 0.4482, 'grad_norm': 65.4261245727539, 'learning_rate': 8.23423165278725e-11, 'margin_dpo/margin_mean': 37.449485778808594, 'margin_dpo/margin_std': 28.472801208496094, 'logps/chosen': -79.63191986083984, 'logps/rejected': -138.780517578125, 'logps/ref_chosen': -56.52743911743164, 'logps/ref_rejected': -78.22654724121094, 'logits/chosen': -0.5974393486976624, 'logits/rejected': -0.5463284254074097, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 37.449485778808594, 'margin_dpo/beta_margin_mean': 3.744948625564575, 'margin_dpo/beta_margin_std': 2.872178792953491, 'margin_dpo/beta_margin_grad_mean': -0.13777747750282288, 'margin_dpo/beta_margin_grad_std': 0.22497375309467316, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████▌| 677/681 [48:46<00:10, 2.55s/it] 100%|█████████████████████████████████████████████████████████████████████████████▋| 678/681 [48:49<00:07, 2.51s/it] {'loss': 0.4485, 'grad_norm': 50.95887756347656, 'learning_rate': 5.270012410216185e-11, 'margin_dpo/margin_mean': 36.70520782470703, 'margin_dpo/margin_std': 31.16322135925293, 'logps/chosen': -67.8372802734375, 'logps/rejected': -139.0126495361328, 'logps/ref_chosen': -46.13447570800781, 'logps/ref_rejected': -80.60462951660156, 'logits/chosen': -0.5905472040176392, 'logits/rejected': -0.5667222738265991, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 36.7052116394043, 'margin_dpo/beta_margin_mean': 3.6705210208892822, 'margin_dpo/beta_margin_std': 3.1234190464019775, 'margin_dpo/beta_margin_grad_mean': -0.16534043848514557, 'margin_dpo/beta_margin_grad_std': 0.21265582740306854, 'epoch': 1.0} 100%|█████████████████████████████████████████████████████████████████████████████▋| 678/681 [48:49<00:07, 2.51s/it] 100%|█████████████████████████████████████████████████████████████████████████████▊| 679/681 [48:51<00:05, 2.60s/it] {'loss': 0.3274, 'grad_norm': 47.8213005065918, 'learning_rate': 2.9644275480772416e-11, 'margin_dpo/margin_mean': 36.85191345214844, 'margin_dpo/margin_std': 26.87795639038086, 'logps/chosen': -72.65241241455078, 'logps/rejected': -135.8075408935547, 'logps/ref_chosen': -50.294921875, 'logps/ref_rejected': -76.59813690185547, 'logits/chosen': -0.6013349294662476, 'logits/rejected': -0.5681812167167664, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 36.85191345214844, 'margin_dpo/beta_margin_mean': 3.6851911544799805, 'margin_dpo/beta_margin_std': 2.695528745651245, 'margin_dpo/beta_margin_grad_mean': -0.11131599545478821, 'margin_dpo/beta_margin_grad_std': 0.1771748960018158, 'epoch': 1.0} 100%|█████████████████████████████████████████████████████████████████████████████▊| 679/681 [48:51<00:05, 2.60s/it] 100%|█████████████████████████████████████████████████████████████████████████████▉| 680/681 [48:55<00:02, 2.75s/it] {'loss': 0.381, 'grad_norm': 57.492881774902344, 'learning_rate': 1.31753782067201e-11, 'margin_dpo/margin_mean': 36.17931365966797, 'margin_dpo/margin_std': 29.298704147338867, 'logps/chosen': -99.56130981445312, 'logps/rejected': -171.20968627929688, 'logps/ref_chosen': -76.91569519042969, 'logps/ref_rejected': -112.384765625, 'logits/chosen': -0.6063967347145081, 'logits/rejected': -0.5727298259735107, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 36.1793098449707, 'margin_dpo/beta_margin_mean': 3.6179311275482178, 'margin_dpo/beta_margin_std': 2.948613405227661, 'margin_dpo/beta_margin_grad_mean': -0.1346154808998108, 'margin_dpo/beta_margin_grad_std': 0.1998624950647354, 'epoch': 1.0} 100%|█████████████████████████████████████████████████████████████████████████████▉| 680/681 [48:55<00:02, 2.75s/it] 100%|██████████████████████████████████████████████████████████████████████████████| 681/681 [48:57<00:00, 2.70s/it] {'loss': 0.4583, 'grad_norm': 52.16978073120117, 'learning_rate': 3.2938662507808745e-12, 'margin_dpo/margin_mean': 31.793434143066406, 'margin_dpo/margin_std': 28.037933349609375, 'logps/chosen': -84.20474243164062, 'logps/rejected': -143.598876953125, 'logps/ref_chosen': -60.957279205322266, 'logps/ref_rejected': -88.5579833984375, 'logits/chosen': -0.6496413946151733, 'logits/rejected': -0.6223350167274475, 'margin_dpo/beta': 0.10000000149011612, 'margin_dpo/loss_margin_mean': 31.793434143066406, 'margin_dpo/beta_margin_mean': 3.1793434619903564, 'margin_dpo/beta_margin_std': 2.862551212310791, 'margin_dpo/beta_margin_grad_mean': -0.16029776632785797, 'margin_dpo/beta_margin_grad_std': 0.20890314877033234, 'epoch': 1.0} 100%|██████████████████████████████████████████████████████████████████████████████| 681/681 [48:57<00:00, 2.70s/it][INFO|trainer.py:3984] 2026-04-17 22:15:44,156 >> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-681 [INFO|configuration_utils.py:419] 2026-04-17 22:15:44,173 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-681/config.json [INFO|configuration_utils.py:911] 2026-04-17 22:15:44,189 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-681/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-17 22:16:50,024 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-681/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-17 22:16:50,066 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-681/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-17 22:16:50,129 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-681/special_tokens_map.json [INFO|trainer.py:4083] 2026-04-17 22:20:52,552 >> Deleting older checkpoint [/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/checkpoint-400] due to args.save_total_limit [INFO|trainer.py:2681] 2026-04-17 22:20:55,506 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 3273.0613, 'train_samples_per_second': 13.32, 'train_steps_per_second': 0.208, 'train_loss': 0.5730435011495403, 'epoch': 1.0} 100%|██████████████████████████████████████████████████████████████████████████████| 681/681 [54:25<00:00, 2.70s/it] 100%|██████████████████████████████████████████████████████████████████████████████| 681/681 [54:25<00:00, 4.79s/it] ***** train metrics ***** epoch = 1.0 total_flos = 0GF train_loss = 0.573 train_runtime = 0:54:33.06 train_samples = 43598 train_samples_per_second = 13.32 train_steps_per_second = 0.208 2026-04-17 22:20:55 - INFO - __main__ - *** Training complete *** 2026-04-17 22:20:55 - INFO - __main__ - *** Save model *** [INFO|configuration_utils.py:419] 2026-04-17 22:21:14,696 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/config.json [INFO|configuration_utils.py:911] 2026-04-17 22:21:14,701 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-17 22:22:17,207 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-17 22:22:17,351 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-17 22:22:17,546 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/special_tokens_map.json 2026-04-17 22:22:17 - INFO - __main__ - Saved HF-compatible model artifacts to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312 [INFO|modelcard.py:450] 2026-04-17 22:22:18,393 >> Dropping the following result as it does not have all the necessary fields: {'dataset': {'name': 'Anthropic/hh-rlhf', 'type': 'Anthropic/hh-rlhf'}} [INFO|configuration_utils.py:419] 2026-04-17 22:22:18,543 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-margin-dpo-hh-helpful-4xh200-batch-64-20260417-212312/config.json 2026-04-17 22:22:18 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:4307] 2026-04-17 22:22:18,546 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-17 22:22:18,546 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-17 22:22:18,546 >> Batch size = 8 0%| | 0/73 [00:00