2026-04-18 00:19:44 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', tokenizer_name_or_path=None, trust_remote_code=False, attn_implementation='flash_attention_2', use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8') 2026-04-18 00:19:44 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'Anthropic/hh-rlhf': 1.0}, text_column='text', dataset_splits=['train', 'test'], dataset_configs=['helpful-base'], dataset_dir=None, preprocessing_num_workers=12, use_persistent_hf_cache=True, hf_cache_dir='/scratch/feng.yulu/dynamic-dpo-v4/hf/datasets', truncation_side=None, auto_insert_empty_system_msg=True, preprocessing_log_samples=0, preprocessing_log_dir=None) 2026-04-18 00:19:44 - INFO - __main__ - Training/evaluation parameters EpsilonDPOConfig( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, beta=0.1, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=True, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, dataset_num_proc=12, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_dropout=True, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, epsilon=0.01, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=100, eval_strategy=IntervalStrategy.STEPS, eval_use_gather_object=False, f_alpha_divergence_coef=1.0, f_divergence_type=FDivergenceType.REVERSE_KL, force_use_ref_model=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generate_during_eval=False, gradient_accumulation_steps=2, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=W-61/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200, hub_model_revision=main, hub_private_repo=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, is_encoder_decoder=None, jit_mode_eval=False, label_names=None, label_pad_token_id=-100, label_smoothing=0.0, label_smoothing_factor=0.0, learning_rate=5e-07, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200/runs/Apr18_00-19-44_d4054, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=1, logging_strategy=IntervalStrategy.STEPS, loss_type=sigmoid, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.COSINE, max_grad_norm=1.0, max_length=512, max_prompt_length=256, max_steps=-1, max_target_length=None, metric_for_best_model=None, model_adapter_name=None, model_init_kwargs=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, non_finite_logits_handling=error, num_train_epochs=1, optim=OptimizerNames.ADAMW_TORCH, optim_args=None, optim_target_modules=None, output_dir=/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920, overwrite_output_dir=False, padding_value=None, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, post_tokenization_log_dir=None, post_tokenization_log_samples=0, precompute_ref_batch_size=None, precompute_ref_eval_batch_size=None, precompute_ref_log_probs=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, ref_adapter_name=None, ref_model_init_kwargs=None, ref_model_mixup_alpha=0.9, ref_model_sync_steps=64, reference_free=False, remove_unused_columns=False, report_to=['wandb'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, reuse_tokenized_dataset=True, rpo_alpha=None, run_name=llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=SaveStrategy.STEPS, save_total_limit=2, seed=42, sft_weight=0.0, skip_memory_metrics=True, sync_ref_model=False, tf32=None, tokenization_batch_size=128, tokenization_mode=online, tokenized_dataset_cache_dir=/scratch/feng.yulu/dynamic-dpo-v4/tokenized_preferences, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, trainer_type=epsilon_dpo, truncation_mode=keep_end, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, wandb_project=ood-run-4xh200, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0, ) 2026-04-18 00:19:44 - INFO - __main__ - W&B project: ood-run-4xh200 2026-04-18 00:19:44 - INFO - __main__ - Epsilon-DPO parameters: beta=0.1, epsilon=0.01, gradient_accumulation_steps=2 2026-04-18 00:19:44 - INFO - __main__ - Using persistent HF datasets cache at /scratch/feng.yulu/dynamic-dpo-v4/hf/datasets 2026-04-18 00:19:47 - WARNING - __main__ - Dropped 237 non-canonical HH preference examples from split `train` before normalization (126 x HH preprocessing expects exactly one final assistant response in chosen/rejected suffixes., 111 x HH chosen/rejected transcripts must each contain a divergent assistant response.). Normalizing raw HH preferences (train): 0%| | 0/43598 [00:00> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2026-04-18 00:19:52,283 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2058] 2026-04-18 00:19:52,284 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2026-04-18 00:19:52,284 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2026-04-18 00:19:52,284 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2026-04-18 00:19:52,284 >> loading file chat_template.jinja Normalizing raw HH preferences (test): 96%|█████████████████████████▊ | 2235/2339 [00:00<00:00, 11117.33 examples/s] Normalizing raw HH preferences (test): 42%|████████████ | 977/2339 [00:00<00:00, 9727.59 examples/s] Normalizing raw HH preferences (test): 100%|████████████████████████████| 2339/2339 [00:00<00:00, 8144.38 examples/s] Normalizing raw HH preferences (test): 100%|████████████████████████████| 2339/2339 [00:00<00:00, 8566.52 examples/s] Normalizing raw HH preferences (test): 87%|███████████████████████▍ | 2034/2339 [00:00<00:00, 10216.55 examples/s] Normalizing raw HH preferences (test): 100%|████████████████████████████| 2339/2339 [00:00<00:00, 8059.93 examples/s] [INFO|tokenization_utils_base.py:2323] 2026-04-18 00:19:52,697 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2026-04-18 00:19:53 - INFO - __main__ - Processed train sample 41905: Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|> What is the best way to clean my refrigerator?<|eot_id|><|start_header_id|>assistant<|end_header_id|> Well, I have a few ideas, but the best way to know what works best is to run some tests and then analyze the data to figure out which technique works best.<|eot_id|><|start_header_id|>user<|end_header_id|> What are your ideas?<|eot_id|> Chosen: <|start_header_id|>assistant<|end_header_id|> Some of my ideas include wiping down the interior with a rag, using bleach and water to wipe down the door, emptying the contents of the fridge to scrub the shelves with a clean cloth, cleaning the interior of the fridge with vinegar and water, and using a vacuum cleaner to clean the interior.<|eot_id|> Rejected: <|start_header_id|>assistant<|end_header_id|> I was thinking that you could try different types of soap, which could be a variation of “soap,” like antibacterial soap. I was also thinking that you could try using different tools, like a toothbrush, to clean the inside. Or I was thinking that you could try different ways to get the gunk out, like using a squeegee, or using a paper towel. I was also thinking you could try other types of cleaning, like vacuuming, but I think that could have the opposite of the desired effect.<|eot_id|> /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [INFO|configuration_utils.py:691] 2026-04-18 00:19:53,050 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/config.json [INFO|configuration_utils.py:765] 2026-04-18 00:19:53,057 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 128256 } /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [INFO|modeling_utils.py:1121] 2026-04-18 00:19:53,459 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-18 00:19:53,460 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [WARNING|logging.py:328] 2026-04-18 00:19:53,464 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-18 00:19:53,464 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-18 00:19:53,464 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|configuration_utils.py:1142] 2026-04-18 00:19:53,465 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001, "use_cache": false } [WARNING|logging.py:328] 2026-04-18 00:19:53,489 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 7/7 [00:00<00:00, 168.24it/s] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 7/7 [00:00<00:00, 175.14it/s] [WARNING|trainer.py:821] 2026-04-18 00:19:53,646 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:821] 2026-04-18 00:19:53,647 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 14%|███████▊ | 1/7 [00:01<00:11, 1.97s/it] Loading checkpoint shards: 29%|███████████████▋ | 2/7 [00:05<00:13, 2.63s/it] Loading checkpoint shards: 43%|███████████████████████▌ | 3/7 [00:06<00:08, 2.16s/it] Loading checkpoint shards: 57%|███████████████████████████████▍ | 4/7 [00:08<00:06, 2.01s/it] Loading checkpoint shards: 71%|███████████████████████████████████████▎ | 5/7 [00:10<00:03, 1.90s/it] Loading checkpoint shards: 86%|███████████████████████████████████████████████▏ | 6/7 [00:11<00:01, 1.80s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 7/7 [00:12<00:00, 1.47s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 7/7 [00:12<00:00, 1.79s/it] [INFO|modeling_utils.py:4926] 2026-04-18 00:20:06,053 >> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-18 00:20:06,053 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-18 00:20:06,055 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-18 00:20:06,055 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": 128001, "max_length": 4096, "temperature": 0.6, "top_p": 0.9 } [INFO|configuration_utils.py:691] 2026-04-18 00:20:06,057 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/config.json [INFO|configuration_utils.py:765] 2026-04-18 00:20:06,058 >> Model config LlamaConfig { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "vocab_size": 128256 } [INFO|modeling_utils.py:1121] 2026-04-18 00:20:06,059 >> loading weights file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-18 00:20:06,061 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1142] 2026-04-18 00:20:06,064 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001, "use_cache": false } Loading checkpoint shards: 0%| | 0/7 [00:00> All model checkpoint weights were used when initializing LlamaForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-18 00:20:16,355 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-18 00:20:16,360 >> loading configuration file /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-sft-hh-helpful-4xh200-batch-64-20260416-162101/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-18 00:20:16,360 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": 128001, "max_length": 4096, "temperature": 0.6, "top_p": 0.9 } [WARNING|trainer.py:821] 2026-04-18 00:20:16,361 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:816] 2026-04-18 00:20:16,363 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Tokenizing train (num_proc=12): 0%| | 0/43598 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Saving the dataset (0/2 shards): 0%| | 0/43598 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Tokenizing test (num_proc=12): 0%| | 0/2339 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Saving the dataset (0/1 shards): 0%| | 0/2339 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 00:37:55,607 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 00:37:55,607 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 00:37:55,812 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 00:37:55,812 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 00:37:55,813 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 00:37:55,813 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 00:37:55,814 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 00:37:55,814 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 00:37:55,846 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-18 00:37:55,846 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:521: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-18 00:37:55,846 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:521: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/feng.yulu/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:521: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [INFO|trainer.py:748] 2026-04-18 00:37:56,168 >> Using auto half precision backend /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight. warnings.warn( /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in LlamaDecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight. warnings.warn( /home/feng.yulu/.conda/envs/dpo_venv/lib/python3.11/site-packages/accelerate/accelerator.py:1563: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints. warnings.warn( [INFO|trainer.py:2414] 2026-04-18 00:38:00,463 >> ***** Running training ***** [INFO|trainer.py:2415] 2026-04-18 00:38:00,463 >> Num examples = 43,598 [INFO|trainer.py:2416] 2026-04-18 00:38:00,463 >> Num Epochs = 1 [INFO|trainer.py:2417] 2026-04-18 00:38:00,463 >> Instantaneous batch size per device = 8 [INFO|trainer.py:2420] 2026-04-18 00:38:00,463 >> Total train batch size (w. parallel, distributed & accumulation) = 64 [INFO|trainer.py:2421] 2026-04-18 00:38:00,463 >> Gradient Accumulation steps = 2 [INFO|trainer.py:2422] 2026-04-18 00:38:00,463 >> Total optimization steps = 681 [INFO|trainer.py:2423] 2026-04-18 00:38:00,464 >> Number of trainable parameters = 2,007,565,312 [INFO|integration_utils.py:831] 2026-04-18 00:38:00,465 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: can-not-fand (can-not-fand-northeastern-university). Use `wandb login --relogin` to force relogin wandb: wandb version 0.26.0 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.17.5 wandb: Run data is saved locally in /scratch/feng.yulu/dynamic-dpo-v4/wandb/wandb/run-20260418_003804-1eq3cqos wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920 wandb: ⭐️ View project at https://wandb.ai/can-not-fand-northeastern-university/ood-run-4xh200 wandb: 🚀 View run at https://wandb.ai/can-not-fand-northeastern-university/ood-run-4xh200/runs/1eq3cqos 0%| | 0/681 [00:00> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-18 00:38:10,827 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-18 00:38:10,836 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-18 00:38:10,846 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%| | 1/681 [00:02<32:06, 2.83s/it] {'loss': 1.3894, 'grad_norm': 83.56657409667969, 'learning_rate': 0.0, 'rewards/chosen': 0.00041806945228017867, 'rewards/rejected': 0.003031575120985508, 'rewards/accuracies': 0.53125, 'rewards/margins': -0.0026135058142244816, 'logps/chosen': -50.1435661315918, 'logps/rejected': -74.09991455078125, 'logps/ref_chosen': -50.14883804321289, 'logps/ref_rejected': -74.1280517578125, 'logits/chosen': -0.6899334788322449, 'logits/rejected': -0.37887901067733765, 'kl/p_epsilon_steps': 0.53125, 'kl/n_epsilon_steps': 0.46875, 'epsilon_dpo/beta': 0.0999474972486496, 'epsilon_dpo/loss_margin_mean': -0.02287048101425171, 'epsilon_dpo/beta_margin_mean': -0.0026135474909096956, 'epsilon_dpo/beta_margin_std': 0.04210928454995155, 'epsilon_dpo/beta_margin_grad_mean': -0.5006521940231323, 'epsilon_dpo/beta_margin_grad_std': 0.010521039366722107, 'kl/beta': 0.10000000149011612, 'kl/avg_steps': 0.0625, 'epoch': 0.0} 0%| | 1/681 [00:02<32:06, 2.83s/it] 0%|▏ | 2/681 [00:05<33:46, 2.98s/it] {'loss': 1.3935, 'grad_norm': 72.1585464477539, 'learning_rate': 7.246376811594203e-09, 'rewards/chosen': -0.0036358418874442577, 'rewards/rejected': 0.003211432136595249, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.00684727355837822, 'logps/chosen': -52.65569305419922, 'logps/rejected': -75.27340698242188, 'logps/ref_chosen': -52.620704650878906, 'logps/ref_rejected': -75.30413818359375, 'logits/chosen': -0.6022520065307617, 'logits/rejected': -0.36671221256256104, 'kl/p_epsilon_steps': 0.4375, 'kl/n_epsilon_steps': 0.5625, 'epsilon_dpo/beta': 0.1000724583864212, 'epsilon_dpo/loss_margin_mean': -0.06572240591049194, 'epsilon_dpo/beta_margin_mean': -0.006847253534942865, 'epsilon_dpo/beta_margin_std': 0.03527917340397835, 'epsilon_dpo/beta_margin_grad_mean': -0.5017112493515015, 'epsilon_dpo/beta_margin_grad_std': 0.0088164322078228, 'kl/beta': 0.09993753582239151, 'kl/avg_steps': -0.125, 'epoch': 0.0} 0%|▏ | 2/681 [00:05<33:46, 2.98s/it] 0%|▎ | 3/681 [00:08<34:02, 3.01s/it] {'loss': 1.3832, 'grad_norm': 70.64185333251953, 'learning_rate': 1.4492753623188406e-08, 'rewards/chosen': 0.0015193297294899821, 'rewards/rejected': -0.0019117268966510892, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.0034310566261410713, 'logps/chosen': -60.96543502807617, 'logps/rejected': -68.69351196289062, 'logps/ref_chosen': -60.98159408569336, 'logps/ref_rejected': -68.67259216308594, 'logits/chosen': -0.5908145308494568, 'logits/rejected': -0.4275705814361572, 'kl/p_epsilon_steps': 0.578125, 'kl/n_epsilon_steps': 0.421875, 'epsilon_dpo/beta': 0.0999162569642067, 'epsilon_dpo/loss_margin_mean': 0.037074267864227295, 'epsilon_dpo/beta_margin_mean': 0.00343105080537498, 'epsilon_dpo/beta_margin_std': 0.0342276468873024, 'epsilon_dpo/beta_margin_grad_mean': -0.4991426467895508, 'epsilon_dpo/beta_margin_grad_std': 0.008554468862712383, 'kl/beta': 0.10006261616945267, 'kl/avg_steps': 0.15625, 'epoch': 0.0} 0%|▎ | 3/681 [00:09<34:02, 3.01s/it] 1%|▍ | 4/681 [00:12<34:08, 3.03s/it] {'loss': 1.396, 'grad_norm': 72.27873229980469, 'learning_rate': 2.1739130434782606e-08, 'rewards/chosen': 0.0008886073483154178, 'rewards/rejected': 0.010191047564148903, 'rewards/accuracies': 0.390625, 'rewards/margins': -0.009302439168095589, 'logps/chosen': -56.75792694091797, 'logps/rejected': -86.54693603515625, 'logps/ref_chosen': -56.76771545410156, 'logps/ref_rejected': -86.64710998535156, 'logits/chosen': -0.6021588444709778, 'logits/rejected': -0.42872855067253113, 'kl/p_epsilon_steps': 0.390625, 'kl/n_epsilon_steps': 0.609375, 'epsilon_dpo/beta': 0.10013506561517715, 'epsilon_dpo/loss_margin_mean': -0.09038430452346802, 'epsilon_dpo/beta_margin_mean': -0.009302487596869469, 'epsilon_dpo/beta_margin_std': 0.038918618112802505, 'epsilon_dpo/beta_margin_grad_mean': -0.5023244619369507, 'epsilon_dpo/beta_margin_grad_std': 0.009719975292682648, 'kl/beta': 0.09990651160478592, 'kl/avg_steps': -0.21875, 'epoch': 0.01} 1%|▍ | 4/681 [00:12<34:08, 3.03s/it] 1%|▌ | 5/681 [00:15<34:12, 3.04s/it] {'loss': 1.3847, 'grad_norm': 89.54268646240234, 'learning_rate': 2.898550724637681e-08, 'rewards/chosen': 0.00417137099429965, 'rewards/rejected': 0.002358372090384364, 'rewards/accuracies': 0.515625, 'rewards/margins': 0.0018129991367459297, 'logps/chosen': -53.81658935546875, 'logps/rejected': -84.12696838378906, 'logps/ref_chosen': -53.859375, 'logps/ref_rejected': -84.14918518066406, 'logits/chosen': -0.743955135345459, 'logits/rejected': -0.4869227111339569, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.453125, 'epsilon_dpo/beta': 0.10004167258739471, 'epsilon_dpo/loss_margin_mean': 0.02056872844696045, 'epsilon_dpo/beta_margin_mean': 0.001813046750612557, 'epsilon_dpo/beta_margin_std': 0.03248732164502144, 'epsilon_dpo/beta_margin_grad_mean': -0.49954700469970703, 'epsilon_dpo/beta_margin_grad_std': 0.008118817582726479, 'kl/beta': 0.10012553632259369, 'kl/avg_steps': 0.09375, 'epoch': 0.01} 1%|▌ | 5/681 [00:15<34:12, 3.04s/it] 1%|▋ | 6/681 [00:17<32:32, 2.89s/it] {'loss': 1.383, 'grad_norm': 91.50358581542969, 'learning_rate': 3.6231884057971014e-08, 'rewards/chosen': 0.0029447507113218307, 'rewards/rejected': -0.0006924858316779137, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.0036372365429997444, 'logps/chosen': -62.976524353027344, 'logps/rejected': -92.65360260009766, 'logps/ref_chosen': -63.007484436035156, 'logps/ref_rejected': -92.64534759521484, 'logits/chosen': -0.6965059638023376, 'logits/rejected': -0.5487236976623535, 'kl/p_epsilon_steps': 0.625, 'kl/n_epsilon_steps': 0.375, 'epsilon_dpo/beta': 0.09979165345430374, 'epsilon_dpo/loss_margin_mean': 0.039216578006744385, 'epsilon_dpo/beta_margin_mean': 0.0036372877657413483, 'epsilon_dpo/beta_margin_std': 0.036579351872205734, 'epsilon_dpo/beta_margin_grad_mean': -0.4990909695625305, 'epsilon_dpo/beta_margin_grad_std': 0.00914138276129961, 'kl/beta': 0.10003175586462021, 'kl/avg_steps': 0.25, 'epoch': 0.01} 1%|▋ | 6/681 [00:17<32:32, 2.89s/it] 1%|▊ | 7/681 [00:20<31:53, 2.84s/it] {'loss': 1.3882, 'grad_norm': 82.27672576904297, 'learning_rate': 4.347826086956521e-08, 'rewards/chosen': 0.0001622685813345015, 'rewards/rejected': 0.0014827274717390537, 'rewards/accuracies': 0.53125, 'rewards/margins': -0.0013204591814428568, 'logps/chosen': -57.77162551879883, 'logps/rejected': -103.90780639648438, 'logps/ref_chosen': -57.774818420410156, 'logps/ref_rejected': -103.92059326171875, 'logits/chosen': -0.6239166259765625, 'logits/rejected': -0.4245404899120331, 'kl/p_epsilon_steps': 0.53125, 'kl/n_epsilon_steps': 0.46875, 'epsilon_dpo/beta': 0.09972991049289703, 'epsilon_dpo/loss_margin_mean': -0.009589701890945435, 'epsilon_dpo/beta_margin_mean': -0.0013204828137531877, 'epsilon_dpo/beta_margin_std': 0.046759072691202164, 'epsilon_dpo/beta_margin_grad_mean': -0.5003304481506348, 'epsilon_dpo/beta_margin_grad_std': 0.01168179139494896, 'kl/beta': 0.0997823029756546, 'kl/avg_steps': 0.0625, 'epoch': 0.01} 1%|▊ | 7/681 [00:20<31:53, 2.84s/it] 1%|▉ | 8/681 [00:22<30:51, 2.75s/it] {'loss': 1.3872, 'grad_norm': 78.22746276855469, 'learning_rate': 5.0724637681159424e-08, 'rewards/chosen': 0.0023498879745602608, 'rewards/rejected': 0.002822375390678644, 'rewards/accuracies': 0.546875, 'rewards/margins': -0.0004724874161183834, 'logps/chosen': -58.69141387939453, 'logps/rejected': -79.284912109375, 'logps/ref_chosen': -58.716033935546875, 'logps/ref_rejected': -79.3114242553711, 'logits/chosen': -0.5672747492790222, 'logits/rejected': -0.4737367630004883, 'kl/p_epsilon_steps': 0.53125, 'kl/n_epsilon_steps': 0.46875, 'epsilon_dpo/beta': 0.09966761618852615, 'epsilon_dpo/loss_margin_mean': -0.0018860399723052979, 'epsilon_dpo/beta_margin_mean': -0.0004724572936538607, 'epsilon_dpo/beta_margin_std': 0.040366336703300476, 'epsilon_dpo/beta_margin_grad_mean': -0.5001169443130493, 'epsilon_dpo/beta_margin_grad_std': 0.010085917077958584, 'kl/beta': 0.09971997886896133, 'kl/avg_steps': 0.0625, 'epoch': 0.01} 1%|▉ | 8/681 [00:23<30:51, 2.75s/it] 1%|█ | 9/681 [00:25<30:58, 2.77s/it] {'loss': 1.3862, 'grad_norm': 85.07678985595703, 'learning_rate': 5.797101449275362e-08, 'rewards/chosen': 0.002859012922272086, 'rewards/rejected': 0.0023539173416793346, 'rewards/accuracies': 0.484375, 'rewards/margins': 0.0005050955805927515, 'logps/chosen': -69.83699035644531, 'logps/rejected': -99.580810546875, 'logps/ref_chosen': -69.8668441772461, 'logps/ref_rejected': -99.6026611328125, 'logits/chosen': -0.669657826423645, 'logits/rejected': -0.44206157326698303, 'kl/p_epsilon_steps': 0.515625, 'kl/n_epsilon_steps': 0.484375, 'epsilon_dpo/beta': 0.099636510014534, 'epsilon_dpo/loss_margin_mean': 0.008003711700439453, 'epsilon_dpo/beta_margin_mean': 0.0005050363834016025, 'epsilon_dpo/beta_margin_std': 0.04067489877343178, 'epsilon_dpo/beta_margin_grad_mean': -0.4998748302459717, 'epsilon_dpo/beta_margin_grad_std': 0.01016149390488863, 'kl/beta': 0.09965769201517105, 'kl/avg_steps': 0.03125, 'epoch': 0.01} 1%|█ | 9/681 [00:25<30:58, 2.77s/it] 1%|█▏ | 10/681 [00:28<31:18, 2.80s/it] {'loss': 1.3879, 'grad_norm': 70.06171417236328, 'learning_rate': 6.521739130434782e-08, 'rewards/chosen': -0.0016418680315837264, 'rewards/rejected': -0.000346622196957469, 'rewards/accuracies': 0.5, 'rewards/margins': -0.0012952459510415792, 'logps/chosen': -48.37286376953125, 'logps/rejected': -80.37699890136719, 'logps/ref_chosen': -48.35768508911133, 'logps/ref_rejected': -80.37206268310547, 'logits/chosen': -0.6811233162879944, 'logits/rejected': -0.44765201210975647, 'kl/p_epsilon_steps': 0.53125, 'kl/n_epsilon_steps': 0.46875, 'epsilon_dpo/beta': 0.0995742455124855, 'epsilon_dpo/loss_margin_mean': -0.010241597890853882, 'epsilon_dpo/beta_margin_mean': -0.0012951751705259085, 'epsilon_dpo/beta_margin_std': 0.03592904284596443, 'epsilon_dpo/beta_margin_grad_mean': -0.5003238916397095, 'epsilon_dpo/beta_margin_grad_std': 0.008978527970612049, 'kl/beta': 0.0996265560388565, 'kl/avg_steps': 0.0625, 'epoch': 0.01} 1%|█▏ | 10/681 [00:28<31:18, 2.80s/it] 2%|█▎ | 11/681 [00:31<31:41, 2.84s/it] {'loss': 1.384, 'grad_norm': 67.98162078857422, 'learning_rate': 7.246376811594203e-08, 'rewards/chosen': -0.00022831570822745562, 'rewards/rejected': -0.0028259989339858294, 'rewards/accuracies': 0.515625, 'rewards/margins': 0.0025976833421736956, 'logps/chosen': -53.018104553222656, 'logps/rejected': -87.81060791015625, 'logps/ref_chosen': -53.01685333251953, 'logps/ref_rejected': -87.78038024902344, 'logits/chosen': -0.5316141843795776, 'logits/rejected': -0.3694092035293579, 'kl/p_epsilon_steps': 0.515625, 'kl/n_epsilon_steps': 0.484375, 'epsilon_dpo/beta': 0.09954316914081573, 'epsilon_dpo/loss_margin_mean': 0.028973519802093506, 'epsilon_dpo/beta_margin_mean': 0.0025976356118917465, 'epsilon_dpo/beta_margin_std': 0.03580310195684433, 'epsilon_dpo/beta_margin_grad_mean': -0.49935105443000793, 'epsilon_dpo/beta_margin_grad_std': 0.008948074653744698, 'kl/beta': 0.099564328789711, 'kl/avg_steps': 0.03125, 'epoch': 0.02} 2%|█▎ | 11/681 [00:31<31:41, 2.84s/it] 2%|█▍ | 12/681 [00:34<31:25, 2.82s/it] {'loss': 1.3859, 'grad_norm': 89.55339813232422, 'learning_rate': 7.971014492753623e-08, 'rewards/chosen': -0.0030919623095542192, 'rewards/rejected': -0.0039820256642997265, 'rewards/accuracies': 0.515625, 'rewards/margins': 0.0008900631219148636, 'logps/chosen': -61.834747314453125, 'logps/rejected': -104.89984893798828, 'logps/ref_chosen': -61.80543518066406, 'logps/ref_rejected': -104.85826873779297, 'logits/chosen': -0.6296329498291016, 'logits/rejected': -0.4155291020870209, 'kl/p_epsilon_steps': 0.53125, 'kl/n_epsilon_steps': 0.46875, 'epsilon_dpo/beta': 0.09948095679283142, 'epsilon_dpo/loss_margin_mean': 0.012266382575035095, 'epsilon_dpo/beta_margin_mean': 0.0008899245294742286, 'epsilon_dpo/beta_margin_std': 0.04637879133224487, 'epsilon_dpo/beta_margin_grad_mean': -0.4997762441635132, 'epsilon_dpo/beta_margin_grad_std': 0.01158287562429905, 'kl/beta': 0.09953322261571884, 'kl/avg_steps': 0.0625, 'epoch': 0.02} 2%|█▍ | 12/681 [00:34<31:25, 2.82s/it] 2%|█▌ | 13/681 [00:37<31:50, 2.86s/it] {'loss': 1.381, 'grad_norm': 79.01435852050781, 'learning_rate': 8.695652173913042e-08, 'rewards/chosen': 0.005090477876365185, 'rewards/rejected': -0.0007765874033793807, 'rewards/accuracies': 0.515625, 'rewards/margins': 0.005867065396159887, 'logps/chosen': -64.20757293701172, 'logps/rejected': -87.21273803710938, 'logps/ref_chosen': -64.26036071777344, 'logps/ref_rejected': -87.20307922363281, 'logits/chosen': -0.6465901136398315, 'logits/rejected': -0.5206432342529297, 'kl/p_epsilon_steps': 0.515625, 'kl/n_epsilon_steps': 0.484375, 'epsilon_dpo/beta': 0.09944991022348404, 'epsilon_dpo/loss_margin_mean': 0.06244337558746338, 'epsilon_dpo/beta_margin_mean': 0.005867047235369682, 'epsilon_dpo/beta_margin_std': 0.04732148349285126, 'epsilon_dpo/beta_margin_grad_mean': -0.4985347092151642, 'epsilon_dpo/beta_margin_grad_std': 0.011819672770798206, 'kl/beta': 0.09947105497121811, 'kl/avg_steps': 0.03125, 'epoch': 0.02} 2%|█▌ | 13/681 [00:37<31:50, 2.86s/it] 2%|█▌ | 14/681 [00:40<31:17, 2.81s/it] {'loss': 1.3915, 'grad_norm': 85.693603515625, 'learning_rate': 9.420289855072464e-08, 'rewards/chosen': -0.004551096353679895, 'rewards/rejected': 0.00026224181056022644, 'rewards/accuracies': 0.46875, 'rewards/margins': -0.0048133376985788345, 'logps/chosen': -58.15471649169922, 'logps/rejected': -104.04637145996094, 'logps/ref_chosen': -58.11021423339844, 'logps/ref_rejected': -104.04708099365234, 'logits/chosen': -0.6981677412986755, 'logits/rejected': -0.4689730107784271, 'kl/p_epsilon_steps': 0.453125, 'kl/n_epsilon_steps': 0.546875, 'epsilon_dpo/beta': 0.09954315423965454, 'epsilon_dpo/loss_margin_mean': -0.045211225748062134, 'epsilon_dpo/beta_margin_mean': -0.004813310690224171, 'epsilon_dpo/beta_margin_std': 0.04126282408833504, 'epsilon_dpo/beta_margin_grad_mean': -0.5012027025222778, 'epsilon_dpo/beta_margin_grad_std': 0.010310296900570393, 'kl/beta': 0.09943997859954834, 'kl/avg_steps': -0.09375, 'epoch': 0.02} 2%|█▌ | 14/681 [00:40<31:17, 2.81s/it] 2%|█▋ | 15/681 [00:42<31:10, 2.81s/it] {'loss': 1.3812, 'grad_norm': 63.776878356933594, 'learning_rate': 1.0144927536231885e-07, 'rewards/chosen': 0.0004750732332468033, 'rewards/rejected': -0.004991788417100906, 'rewards/accuracies': 0.609375, 'rewards/margins': 0.005466861184686422, 'logps/chosen': -56.960899353027344, 'logps/rejected': -80.86045837402344, 'logps/ref_chosen': -56.96691131591797, 'logps/ref_rejected': -80.80863952636719, 'logits/chosen': -0.4637385308742523, 'logits/rejected': -0.35175687074661255, 'kl/p_epsilon_steps': 0.625, 'kl/n_epsilon_steps': 0.375, 'epsilon_dpo/beta': 0.09929438680410385, 'epsilon_dpo/loss_margin_mean': 0.05783188343048096, 'epsilon_dpo/beta_margin_mean': 0.005466893315315247, 'epsilon_dpo/beta_margin_std': 0.03858000040054321, 'epsilon_dpo/beta_margin_grad_mean': -0.498632550239563, 'epsilon_dpo/beta_margin_grad_std': 0.009638694114983082, 'kl/beta': 0.09953329712152481, 'kl/avg_steps': 0.25, 'epoch': 0.02} 2%|█▋ | 15/681 [00:42<31:10, 2.81s/it] 2%|█▊ | 16/681 [00:45<30:35, 2.76s/it] {'loss': 1.385, 'grad_norm': 83.94454956054688, 'learning_rate': 1.0869565217391303e-07, 'rewards/chosen': -0.002612006152048707, 'rewards/rejected': -0.004172259010374546, 'rewards/accuracies': 0.484375, 'rewards/margins': 0.0015602526254951954, 'logps/chosen': -61.76502990722656, 'logps/rejected': -84.41305541992188, 'logps/ref_chosen': -61.739891052246094, 'logps/ref_rejected': -84.36947631835938, 'logits/chosen': -0.6619457006454468, 'logits/rejected': -0.5016107559204102, 'kl/p_epsilon_steps': 0.46875, 'kl/n_epsilon_steps': 0.53125, 'epsilon_dpo/beta': 0.09935706853866577, 'epsilon_dpo/loss_margin_mean': 0.018438905477523804, 'epsilon_dpo/beta_margin_mean': 0.0015602321363985538, 'epsilon_dpo/beta_margin_std': 0.034774623811244965, 'epsilon_dpo/beta_margin_grad_mean': -0.49960967898368835, 'epsilon_dpo/beta_margin_grad_std': 0.008690658025443554, 'kl/beta': 0.0992850810289383, 'kl/avg_steps': -0.0625, 'epoch': 0.02} 2%|█▊ | 16/681 [00:45<30:35, 2.76s/it] 2%|█▉ | 17/681 [00:48<30:09, 2.73s/it] {'loss': 1.3804, 'grad_norm': 77.86913299560547, 'learning_rate': 1.1594202898550725e-07, 'rewards/chosen': 0.003507263958454132, 'rewards/rejected': -0.0027497014962136745, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.006256964989006519, 'logps/chosen': -67.6737060546875, 'logps/rejected': -85.40813446044922, 'logps/ref_chosen': -67.71033477783203, 'logps/ref_rejected': -85.37865447998047, 'logits/chosen': -0.6108927130699158, 'logits/rejected': -0.4828973710536957, 'kl/p_epsilon_steps': 0.640625, 'kl/n_epsilon_steps': 0.359375, 'epsilon_dpo/beta': 0.09907767176628113, 'epsilon_dpo/loss_margin_mean': 0.0661078691482544, 'epsilon_dpo/beta_margin_mean': 0.00625709630548954, 'epsilon_dpo/beta_margin_std': 0.03781072795391083, 'epsilon_dpo/beta_margin_grad_mean': -0.4984363615512848, 'epsilon_dpo/beta_margin_grad_std': 0.0094489436596632, 'kl/beta': 0.09934717416763306, 'kl/avg_steps': 0.28125, 'epoch': 0.02} 2%|█▉ | 17/681 [00:48<30:09, 2.73s/it] 3%|██ | 18/681 [00:50<29:59, 2.71s/it] {'loss': 1.3814, 'grad_norm': 81.35855865478516, 'learning_rate': 1.2318840579710146e-07, 'rewards/chosen': 0.0019111181609332561, 'rewards/rejected': -0.003359419060871005, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.005270537454634905, 'logps/chosen': -47.718833923339844, 'logps/rejected': -75.50761413574219, 'logps/ref_chosen': -47.7394905090332, 'logps/ref_rejected': -75.4722900390625, 'logits/chosen': -0.7670595645904541, 'logits/rejected': -0.5500361323356628, 'kl/p_epsilon_steps': 0.640625, 'kl/n_epsilon_steps': 0.359375, 'epsilon_dpo/beta': 0.09879979491233826, 'epsilon_dpo/loss_margin_mean': 0.05598863959312439, 'epsilon_dpo/beta_margin_mean': 0.005270513240247965, 'epsilon_dpo/beta_margin_std': 0.03585405647754669, 'epsilon_dpo/beta_margin_grad_mean': -0.4986821115016937, 'epsilon_dpo/beta_margin_grad_std': 0.008960261940956116, 'kl/beta': 0.0990685448050499, 'kl/avg_steps': 0.28125, 'epoch': 0.03} 3%|██ | 18/681 [00:50<29:59, 2.71s/it] 3%|██▏ | 19/681 [00:53<30:00, 2.72s/it] {'loss': 1.3815, 'grad_norm': 73.26107788085938, 'learning_rate': 1.3043478260869563e-07, 'rewards/chosen': -0.0023102399427443743, 'rewards/rejected': -0.007382555864751339, 'rewards/accuracies': 0.578125, 'rewards/margins': 0.005072316154837608, 'logps/chosen': -70.22738647460938, 'logps/rejected': -89.83357238769531, 'logps/ref_chosen': -70.20535278320312, 'logps/ref_rejected': -89.75758361816406, 'logits/chosen': -0.6282287836074829, 'logits/rejected': -0.41298654675483704, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.453125, 'epsilon_dpo/beta': 0.09870795160531998, 'epsilon_dpo/loss_margin_mean': 0.053946733474731445, 'epsilon_dpo/beta_margin_mean': 0.005072311032563448, 'epsilon_dpo/beta_margin_std': 0.03523925691843033, 'epsilon_dpo/beta_margin_grad_mean': -0.49873286485671997, 'epsilon_dpo/beta_margin_grad_std': 0.008805947378277779, 'kl/beta': 0.09879069775342941, 'kl/avg_steps': 0.09375, 'epoch': 0.03} 3%|██▏ | 19/681 [00:53<30:00, 2.72s/it] 3%|██▎ | 20/681 [00:56<30:01, 2.73s/it] {'loss': 1.3793, 'grad_norm': 72.92608642578125, 'learning_rate': 1.3768115942028986e-07, 'rewards/chosen': -0.00037431088276207447, 'rewards/rejected': -0.007767794653773308, 'rewards/accuracies': 0.484375, 'rewards/margins': 0.007393484003841877, 'logps/chosen': -50.805747985839844, 'logps/rejected': -78.90374755859375, 'logps/ref_chosen': -50.80324172973633, 'logps/ref_rejected': -78.8233413696289, 'logits/chosen': -0.7782789468765259, 'logits/rejected': -0.538977324962616, 'kl/p_epsilon_steps': 0.515625, 'kl/n_epsilon_steps': 0.484375, 'epsilon_dpo/beta': 0.09867718070745468, 'epsilon_dpo/loss_margin_mean': 0.07789051532745361, 'epsilon_dpo/beta_margin_mean': 0.007393495179712772, 'epsilon_dpo/beta_margin_std': 0.040283456444740295, 'epsilon_dpo/beta_margin_grad_mean': -0.4981527328491211, 'epsilon_dpo/beta_margin_grad_std': 0.010064210742712021, 'kl/beta': 0.09869816154241562, 'kl/avg_steps': 0.03125, 'epoch': 0.03} 3%|██▎ | 20/681 [00:56<30:01, 2.73s/it] 3%|██▍ | 21/681 [00:58<29:56, 2.72s/it] {'loss': 1.3739, 'grad_norm': 75.66818237304688, 'learning_rate': 1.4492753623188405e-07, 'rewards/chosen': 0.0011464322451502085, 'rewards/rejected': -0.01170315407216549, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.012849586084485054, 'logps/chosen': -50.05017852783203, 'logps/rejected': -77.98896789550781, 'logps/ref_chosen': -50.063018798828125, 'logps/ref_rejected': -77.86878967285156, 'logits/chosen': -0.6122225522994995, 'logits/rejected': -0.5136980414390564, 'kl/p_epsilon_steps': 0.578125, 'kl/n_epsilon_steps': 0.40625, 'epsilon_dpo/beta': 0.09850744158029556, 'epsilon_dpo/loss_margin_mean': 0.13301609456539154, 'epsilon_dpo/beta_margin_mean': 0.01284959726035595, 'epsilon_dpo/beta_margin_std': 0.038444884121418, 'epsilon_dpo/beta_margin_grad_mean': -0.49679034948349, 'epsilon_dpo/beta_margin_grad_std': 0.009602558799088001, 'kl/beta': 0.09866733103990555, 'kl/avg_steps': 0.171875, 'epoch': 0.03} 3%|██▍ | 21/681 [00:59<29:56, 2.72s/it] 3%|██▌ | 22/681 [01:01<30:16, 2.76s/it] {'loss': 1.3688, 'grad_norm': 82.43994903564453, 'learning_rate': 1.5217391304347825e-07, 'rewards/chosen': 0.0030678450129926205, 'rewards/rejected': -0.014989741146564484, 'rewards/accuracies': 0.671875, 'rewards/margins': 0.018057584762573242, 'logps/chosen': -59.02484893798828, 'logps/rejected': -97.65882873535156, 'logps/ref_chosen': -59.05763626098633, 'logps/ref_rejected': -97.50466918945312, 'logits/chosen': -0.6706408262252808, 'logits/rejected': -0.4660327434539795, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.328125, 'epsilon_dpo/beta': 0.09816926717758179, 'epsilon_dpo/loss_margin_mean': 0.18695014715194702, 'epsilon_dpo/beta_margin_mean': 0.018057547509670258, 'epsilon_dpo/beta_margin_std': 0.04294878616929054, 'epsilon_dpo/beta_margin_grad_mean': -0.4954878091812134, 'epsilon_dpo/beta_margin_grad_std': 0.010731114074587822, 'kl/beta': 0.09849803894758224, 'kl/avg_steps': 0.34375, 'epoch': 0.03} 3%|██▌ | 22/681 [01:01<30:16, 2.76s/it] 3%|██▋ | 23/681 [01:04<31:17, 2.85s/it] {'loss': 1.367, 'grad_norm': 78.68445587158203, 'learning_rate': 1.5942028985507245e-07, 'rewards/chosen': 0.005194256082177162, 'rewards/rejected': -0.014648174867033958, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.01984243094921112, 'logps/chosen': -60.02301788330078, 'logps/rejected': -81.29054260253906, 'logps/ref_chosen': -60.07769775390625, 'logps/ref_rejected': -81.1395492553711, 'logits/chosen': -0.612334132194519, 'logits/rejected': -0.5152074098587036, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.3125, 'epsilon_dpo/beta': 0.0978022888302803, 'epsilon_dpo/loss_margin_mean': 0.20566779375076294, 'epsilon_dpo/beta_margin_mean': 0.019842475652694702, 'epsilon_dpo/beta_margin_std': 0.04295789822936058, 'epsilon_dpo/beta_margin_grad_mean': -0.495043009519577, 'epsilon_dpo/beta_margin_grad_std': 0.010730421170592308, 'kl/beta': 0.09816060960292816, 'kl/avg_steps': 0.375, 'epoch': 0.03} 3%|██▋ | 23/681 [01:04<31:17, 2.85s/it] 4%|██▊ | 24/681 [01:07<31:08, 2.84s/it] {'loss': 1.3666, 'grad_norm': 88.37494659423828, 'learning_rate': 1.6666666666666665e-07, 'rewards/chosen': 0.003375026863068342, 'rewards/rejected': -0.016782548278570175, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.020157575607299805, 'logps/chosen': -44.25568771362305, 'logps/rejected': -99.29896545410156, 'logps/ref_chosen': -44.29103469848633, 'logps/ref_rejected': -99.12521362304688, 'logits/chosen': -0.6284483075141907, 'logits/rejected': -0.49340900778770447, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.09728408604860306, 'epsilon_dpo/loss_margin_mean': 0.20909583568572998, 'epsilon_dpo/beta_margin_mean': 0.020157571882009506, 'epsilon_dpo/beta_margin_std': 0.03706182911992073, 'epsilon_dpo/beta_margin_grad_mean': -0.4949635863304138, 'epsilon_dpo/beta_margin_grad_std': 0.009255305863916874, 'kl/beta': 0.09779388457536697, 'kl/avg_steps': 0.53125, 'epoch': 0.04} 4%|██▊ | 24/681 [01:07<31:08, 2.84s/it] 4%|██▉ | 25/681 [01:10<31:10, 2.85s/it] {'loss': 1.3675, 'grad_norm': 71.66119384765625, 'learning_rate': 1.7391304347826085e-07, 'rewards/chosen': 0.004545444622635841, 'rewards/rejected': -0.01495951134711504, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.019504955038428307, 'logps/chosen': -52.48844528198242, 'logps/rejected': -89.49798583984375, 'logps/ref_chosen': -52.537052154541016, 'logps/ref_rejected': -89.34219360351562, 'logits/chosen': -0.6306965351104736, 'logits/rejected': -0.4757160544395447, 'kl/p_epsilon_steps': 0.65625, 'kl/n_epsilon_steps': 0.34375, 'epsilon_dpo/beta': 0.09698280692100525, 'epsilon_dpo/loss_margin_mean': 0.20440703630447388, 'epsilon_dpo/beta_margin_mean': 0.019504927098751068, 'epsilon_dpo/beta_margin_std': 0.047829385846853256, 'epsilon_dpo/beta_margin_grad_mean': -0.4951268136501312, 'epsilon_dpo/beta_margin_grad_std': 0.011948227882385254, 'kl/beta': 0.09727709740400314, 'kl/avg_steps': 0.3125, 'epoch': 0.04} 4%|██▉ | 25/681 [01:10<31:10, 2.85s/it] 4%|███ | 26/681 [01:12<29:34, 2.71s/it] {'loss': 1.349, 'grad_norm': 84.22888946533203, 'learning_rate': 1.8115942028985507e-07, 'rewards/chosen': 0.009511815384030342, 'rewards/rejected': -0.028926612809300423, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.038438428193330765, 'logps/chosen': -53.822509765625, 'logps/rejected': -103.66108703613281, 'logps/ref_chosen': -53.92280578613281, 'logps/ref_rejected': -103.35971069335938, 'logits/chosen': -0.6835383176803589, 'logits/rejected': -0.5212547779083252, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.09646852314472198, 'epsilon_dpo/loss_margin_mean': 0.40167027711868286, 'epsilon_dpo/beta_margin_mean': 0.03843845799565315, 'epsilon_dpo/beta_margin_std': 0.05624152719974518, 'epsilon_dpo/beta_margin_grad_mean': -0.49040067195892334, 'epsilon_dpo/beta_margin_grad_std': 0.014042048715054989, 'kl/beta': 0.09697405248880386, 'kl/avg_steps': 0.53125, 'epoch': 0.04} 4%|███ | 26/681 [01:13<29:34, 2.71s/it] 4%|███▏ | 27/681 [01:15<29:11, 2.68s/it] {'loss': 1.3414, 'grad_norm': 89.76054382324219, 'learning_rate': 1.8840579710144927e-07, 'rewards/chosen': 0.008736366406083107, 'rewards/rejected': -0.0375773087143898, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.04631367325782776, 'logps/chosen': -42.80644989013672, 'logps/rejected': -99.11763000488281, 'logps/ref_chosen': -42.898529052734375, 'logps/ref_rejected': -98.72420501708984, 'logits/chosen': -0.691501796245575, 'logits/rejected': -0.48829805850982666, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0959133729338646, 'epsilon_dpo/loss_margin_mean': 0.48550090193748474, 'epsilon_dpo/beta_margin_mean': 0.04631367698311806, 'epsilon_dpo/beta_margin_std': 0.05942818522453308, 'epsilon_dpo/beta_margin_grad_mean': -0.48844072222709656, 'epsilon_dpo/beta_margin_grad_std': 0.014806153252720833, 'kl/beta': 0.09646160155534744, 'kl/avg_steps': 0.578125, 'epoch': 0.04} 4%|███▏ | 27/681 [01:15<29:11, 2.68s/it] 4%|███▏ | 28/681 [01:18<29:26, 2.70s/it] {'loss': 1.3491, 'grad_norm': 71.43858337402344, 'learning_rate': 1.9565217391304347e-07, 'rewards/chosen': 0.005918354727327824, 'rewards/rejected': -0.03241448104381561, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.03833283483982086, 'logps/chosen': -60.492679595947266, 'logps/rejected': -91.74253845214844, 'logps/ref_chosen': -60.55650329589844, 'logps/ref_rejected': -91.40111541748047, 'logits/chosen': -0.7132616639137268, 'logits/rejected': -0.48149383068084717, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.09543713927268982, 'epsilon_dpo/loss_margin_mean': 0.4052448570728302, 'epsilon_dpo/beta_margin_mean': 0.038332872092723846, 'epsilon_dpo/beta_margin_std': 0.054671693593263626, 'epsilon_dpo/beta_margin_grad_mean': -0.4904250204563141, 'epsilon_dpo/beta_margin_grad_std': 0.013653564266860485, 'kl/beta': 0.09590713679790497, 'kl/avg_steps': 0.5, 'epoch': 0.04} 4%|███▏ | 28/681 [01:18<29:26, 2.70s/it] 4%|███▎ | 29/681 [01:20<28:14, 2.60s/it] {'loss': 1.3333, 'grad_norm': 86.3988037109375, 'learning_rate': 2.028985507246377e-07, 'rewards/chosen': 0.011695785447955132, 'rewards/rejected': -0.04282692074775696, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.05452270805835724, 'logps/chosen': -57.68357849121094, 'logps/rejected': -97.84791564941406, 'logps/ref_chosen': -57.80778503417969, 'logps/ref_rejected': -97.39434814453125, 'logits/chosen': -0.7569383382797241, 'logits/rejected': -0.5672882795333862, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.09478338062763214, 'epsilon_dpo/loss_margin_mean': 0.5777689218521118, 'epsilon_dpo/beta_margin_mean': 0.05452274531126022, 'epsilon_dpo/beta_margin_std': 0.05518447607755661, 'epsilon_dpo/beta_margin_grad_mean': -0.48638561367988586, 'epsilon_dpo/beta_margin_grad_std': 0.013761184178292751, 'kl/beta': 0.09542998671531677, 'kl/avg_steps': 0.6875, 'epoch': 0.04} 4%|███▎ | 29/681 [01:20<28:14, 2.60s/it] 4%|███▍ | 30/681 [01:23<28:57, 2.67s/it] {'loss': 1.3283, 'grad_norm': 82.63673400878906, 'learning_rate': 2.1014492753623187e-07, 'rewards/chosen': 0.012802567332983017, 'rewards/rejected': -0.04688173532485962, 'rewards/accuracies': 0.875, 'rewards/margins': 0.05968429893255234, 'logps/chosen': -52.4403076171875, 'logps/rejected': -98.98881530761719, 'logps/ref_chosen': -52.57737350463867, 'logps/ref_rejected': -98.48921203613281, 'logits/chosen': -0.6795220375061035, 'logits/rejected': -0.5398270487785339, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0940769612789154, 'epsilon_dpo/loss_margin_mean': 0.636677622795105, 'epsilon_dpo/beta_margin_mean': 0.059684351086616516, 'epsilon_dpo/beta_margin_std': 0.05674216151237488, 'epsilon_dpo/beta_margin_grad_mean': -0.4850979447364807, 'epsilon_dpo/beta_margin_grad_std': 0.014148331247270107, 'kl/beta': 0.0947783887386322, 'kl/avg_steps': 0.75, 'epoch': 0.04} 4%|███▍ | 30/681 [01:23<28:57, 2.67s/it] 5%|███▌ | 31/681 [01:26<29:28, 2.72s/it] {'loss': 1.3445, 'grad_norm': 63.605506896972656, 'learning_rate': 2.1739130434782607e-07, 'rewards/chosen': 0.006982723250985146, 'rewards/rejected': -0.036466218531131744, 'rewards/accuracies': 0.75, 'rewards/margins': 0.04344893991947174, 'logps/chosen': -63.730369567871094, 'logps/rejected': -73.28575897216797, 'logps/ref_chosen': -63.806922912597656, 'logps/ref_rejected': -72.89400482177734, 'logits/chosen': -0.700573742389679, 'logits/rejected': -0.4872978627681732, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.09358243644237518, 'epsilon_dpo/loss_margin_mean': 0.468301922082901, 'epsilon_dpo/beta_margin_mean': 0.04344891756772995, 'epsilon_dpo/beta_margin_std': 0.0687476322054863, 'epsilon_dpo/beta_margin_grad_mean': -0.4891579747200012, 'epsilon_dpo/beta_margin_grad_std': 0.0171290785074234, 'kl/beta': 0.0940728411078453, 'kl/avg_steps': 0.53125, 'epoch': 0.05} 5%|███▌ | 31/681 [01:26<29:28, 2.72s/it] 5%|███▋ | 32/681 [01:29<30:04, 2.78s/it] {'loss': 1.3208, 'grad_norm': 76.32506561279297, 'learning_rate': 2.2463768115942027e-07, 'rewards/chosen': 0.019259147346019745, 'rewards/rejected': -0.0491609200835228, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.06842006742954254, 'logps/chosen': -62.531192779541016, 'logps/rejected': -89.84844207763672, 'logps/ref_chosen': -62.739524841308594, 'logps/ref_rejected': -89.3175048828125, 'logits/chosen': -0.6239430904388428, 'logits/rejected': -0.42263031005859375, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.09297093003988266, 'epsilon_dpo/loss_margin_mean': 0.7392706274986267, 'epsilon_dpo/beta_margin_mean': 0.06842009723186493, 'epsilon_dpo/beta_margin_std': 0.08292694389820099, 'epsilon_dpo/beta_margin_grad_mean': -0.4829469323158264, 'epsilon_dpo/beta_margin_grad_std': 0.020606767386198044, 'kl/beta': 0.09357572346925735, 'kl/avg_steps': 0.65625, 'epoch': 0.05} 5%|███▋ | 32/681 [01:29<30:04, 2.78s/it] 5%|███▊ | 33/681 [01:31<29:42, 2.75s/it] {'loss': 1.3327, 'grad_norm': 67.19760131835938, 'learning_rate': 2.318840579710145e-07, 'rewards/chosen': 0.00890004076063633, 'rewards/rejected': -0.04635683447122574, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.055256880819797516, 'logps/chosen': -53.162803649902344, 'logps/rejected': -88.3884506225586, 'logps/ref_chosen': -53.26097106933594, 'logps/ref_rejected': -87.8851318359375, 'logits/chosen': -0.5926761627197266, 'logits/rejected': -0.4174070358276367, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.09245194494724274, 'epsilon_dpo/loss_margin_mean': 0.6014795303344727, 'epsilon_dpo/beta_margin_mean': 0.05525689572095871, 'epsilon_dpo/beta_margin_std': 0.058518461883068085, 'epsilon_dpo/beta_margin_grad_mean': -0.4862017035484314, 'epsilon_dpo/beta_margin_grad_std': 0.014605310745537281, 'kl/beta': 0.09296563267707825, 'kl/avg_steps': 0.5625, 'epoch': 0.05} 5%|███▊ | 33/681 [01:31<29:42, 2.75s/it] 5%|███▉ | 34/681 [01:34<29:35, 2.74s/it] {'loss': 1.3206, 'grad_norm': 71.88612365722656, 'learning_rate': 2.391304347826087e-07, 'rewards/chosen': 0.005410528276115656, 'rewards/rejected': -0.06295688450336456, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.06836740672588348, 'logps/chosen': -50.7568359375, 'logps/rejected': -102.609375, 'logps/ref_chosen': -50.81732940673828, 'logps/ref_rejected': -101.92184448242188, 'logits/chosen': -0.5952507257461548, 'logits/rejected': -0.52412348985672, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0917903482913971, 'epsilon_dpo/loss_margin_mean': 0.7480142116546631, 'epsilon_dpo/beta_margin_mean': 0.06836734712123871, 'epsilon_dpo/beta_margin_std': 0.07685627788305283, 'epsilon_dpo/beta_margin_grad_mean': -0.48294445872306824, 'epsilon_dpo/beta_margin_grad_std': 0.019153540953993797, 'kl/beta': 0.09244562685489655, 'kl/avg_steps': 0.71875, 'epoch': 0.05} 5%|███▉ | 34/681 [01:34<29:35, 2.74s/it] 5%|████ | 35/681 [01:37<29:53, 2.78s/it] {'loss': 1.2792, 'grad_norm': 75.97843933105469, 'learning_rate': 2.463768115942029e-07, 'rewards/chosen': 0.014054520055651665, 'rewards/rejected': -0.09903251379728317, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.11308702826499939, 'logps/chosen': -50.86943054199219, 'logps/rejected': -107.9139175415039, 'logps/ref_chosen': -51.02449035644531, 'logps/ref_rejected': -106.82443237304688, 'logits/chosen': -0.7257874608039856, 'logits/rejected': -0.461169570684433, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.09099189192056656, 'epsilon_dpo/loss_margin_mean': 1.2445440292358398, 'epsilon_dpo/beta_margin_mean': 0.1130870133638382, 'epsilon_dpo/beta_margin_std': 0.10551401227712631, 'epsilon_dpo/beta_margin_grad_mean': -0.4718893766403198, 'epsilon_dpo/beta_margin_grad_std': 0.02585284784436226, 'kl/beta': 0.09178591519594193, 'kl/avg_steps': 0.875, 'epoch': 0.05} 5%|████ | 35/681 [01:37<29:53, 2.78s/it] 5%|████▏ | 36/681 [01:40<29:50, 2.78s/it] {'loss': 1.2871, 'grad_norm': 66.66797637939453, 'learning_rate': 2.536231884057971e-07, 'rewards/chosen': 0.004147795960307121, 'rewards/rejected': -0.10065056383609772, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.10479836165904999, 'logps/chosen': -51.944091796875, 'logps/rejected': -87.15771484375, 'logps/ref_chosen': -51.991493225097656, 'logps/ref_rejected': -86.04061889648438, 'logits/chosen': -0.735052227973938, 'logits/rejected': -0.5852710604667664, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.09040167182683945, 'epsilon_dpo/loss_margin_mean': 1.164489507675171, 'epsilon_dpo/beta_margin_mean': 0.1047983318567276, 'epsilon_dpo/beta_margin_std': 0.10706175863742828, 'epsilon_dpo/beta_margin_grad_mean': -0.47390487790107727, 'epsilon_dpo/beta_margin_grad_std': 0.026591215282678604, 'kl/beta': 0.09098975360393524, 'kl/avg_steps': 0.65625, 'epoch': 0.05} 5%|████▏ | 36/681 [01:40<29:50, 2.78s/it] 5%|████▎ | 37/681 [01:43<29:33, 2.75s/it] {'loss': 1.2955, 'grad_norm': 56.18361282348633, 'learning_rate': 2.6086956521739126e-07, 'rewards/chosen': 0.004034676589071751, 'rewards/rejected': -0.09331650286912918, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.09735117852687836, 'logps/chosen': -62.758453369140625, 'logps/rejected': -78.9359130859375, 'logps/ref_chosen': -62.807106018066406, 'logps/ref_rejected': -77.89507293701172, 'logits/chosen': -0.7193362712860107, 'logits/rejected': -0.4781792163848877, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.08986878395080566, 'epsilon_dpo/loss_margin_mean': 1.0894930362701416, 'epsilon_dpo/beta_margin_mean': 0.09735116362571716, 'epsilon_dpo/beta_margin_std': 0.12976804375648499, 'epsilon_dpo/beta_margin_grad_mean': -0.47582224011421204, 'epsilon_dpo/beta_margin_grad_std': 0.0320659838616848, 'kl/beta': 0.09039653092622757, 'kl/avg_steps': 0.59375, 'epoch': 0.05} 5%|████▎ | 37/681 [01:43<29:33, 2.75s/it] 6%|████▍ | 38/681 [01:45<28:15, 2.64s/it] {'loss': 1.2741, 'grad_norm': 63.268428802490234, 'learning_rate': 2.681159420289855e-07, 'rewards/chosen': 0.012151028029620647, 'rewards/rejected': -0.10885559767484665, 'rewards/accuracies': 0.875, 'rewards/margins': 0.12100662291049957, 'logps/chosen': -48.251590728759766, 'logps/rejected': -99.13468170166016, 'logps/ref_chosen': -48.39051818847656, 'logps/ref_rejected': -97.91244506835938, 'logits/chosen': -0.6519103646278381, 'logits/rejected': -0.4997590184211731, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.08922599256038666, 'epsilon_dpo/loss_margin_mean': 1.361170768737793, 'epsilon_dpo/beta_margin_mean': 0.12100663781166077, 'epsilon_dpo/beta_margin_std': 0.14498375356197357, 'epsilon_dpo/beta_margin_grad_mean': -0.4700261652469635, 'epsilon_dpo/beta_margin_grad_std': 0.035584937781095505, 'kl/beta': 0.08986296504735947, 'kl/avg_steps': 0.71875, 'epoch': 0.06} 6%|████▍ | 38/681 [01:45<28:15, 2.64s/it] 6%|████▌ | 39/681 [01:47<28:09, 2.63s/it] {'loss': 1.2513, 'grad_norm': 65.90448760986328, 'learning_rate': 2.753623188405797e-07, 'rewards/chosen': 0.007626615464687347, 'rewards/rejected': -0.13655325770378113, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.14417986571788788, 'logps/chosen': -50.66172409057617, 'logps/rejected': -80.11408996582031, 'logps/ref_chosen': -50.75046920776367, 'logps/ref_rejected': -78.56951141357422, 'logits/chosen': -0.7812178134918213, 'logits/rejected': -0.5999255776405334, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.088477723300457, 'epsilon_dpo/loss_margin_mean': 1.6333224773406982, 'epsilon_dpo/beta_margin_mean': 0.14417992532253265, 'epsilon_dpo/beta_margin_std': 0.12662391364574432, 'epsilon_dpo/beta_margin_grad_mean': -0.46417590975761414, 'epsilon_dpo/beta_margin_grad_std': 0.03130076080560684, 'kl/beta': 0.08922168612480164, 'kl/avg_steps': 0.84375, 'epoch': 0.06} 6%|████▌ | 39/681 [01:48<28:09, 2.63s/it] 6%|████▋ | 40/681 [01:50<28:48, 2.70s/it] {'loss': 1.2612, 'grad_norm': 53.64518356323242, 'learning_rate': 2.8260869565217386e-07, 'rewards/chosen': 0.017830543220043182, 'rewards/rejected': -0.1178567111492157, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.13568724691867828, 'logps/chosen': -57.77972412109375, 'logps/rejected': -75.64521789550781, 'logps/ref_chosen': -57.985069274902344, 'logps/ref_rejected': -74.30007934570312, 'logits/chosen': -0.6166332364082336, 'logits/rejected': -0.45730146765708923, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.08793099224567413, 'epsilon_dpo/loss_margin_mean': 1.5504910945892334, 'epsilon_dpo/beta_margin_mean': 0.13568727672100067, 'epsilon_dpo/beta_margin_std': 0.15566600859165192, 'epsilon_dpo/beta_margin_grad_mean': -0.4664282202720642, 'epsilon_dpo/beta_margin_grad_std': 0.038107842206954956, 'kl/beta': 0.08847517520189285, 'kl/avg_steps': 0.625, 'epoch': 0.06} 6%|████▋ | 40/681 [01:50<28:48, 2.70s/it] 6%|████▊ | 41/681 [01:53<28:37, 2.68s/it] {'loss': 1.2339, 'grad_norm': 60.75354766845703, 'learning_rate': 2.898550724637681e-07, 'rewards/chosen': 0.0033399879466742277, 'rewards/rejected': -0.16410601139068604, 'rewards/accuracies': 0.875, 'rewards/margins': 0.16744601726531982, 'logps/chosen': -62.65257263183594, 'logps/rejected': -98.90574645996094, 'logps/ref_chosen': -62.69581604003906, 'logps/ref_rejected': -97.02352905273438, 'logits/chosen': -0.7145446538925171, 'logits/rejected': -0.5281996130943298, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.087274931371212, 'epsilon_dpo/loss_margin_mean': 1.9254682064056396, 'epsilon_dpo/beta_margin_mean': 0.16744600236415863, 'epsilon_dpo/beta_margin_std': 0.18128535151481628, 'epsilon_dpo/beta_margin_grad_mean': -0.4586746394634247, 'epsilon_dpo/beta_margin_grad_std': 0.0441967137157917, 'kl/beta': 0.08792564272880554, 'kl/avg_steps': 0.75, 'epoch': 0.06} 6%|████▊ | 41/681 [01:53<28:37, 2.68s/it] 6%|████▊ | 42/681 [01:56<28:23, 2.67s/it] {'loss': 1.1821, 'grad_norm': 70.50254821777344, 'learning_rate': 2.971014492753623e-07, 'rewards/chosen': 0.022369947284460068, 'rewards/rejected': -0.2060697227716446, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.22843967378139496, 'logps/chosen': -58.705352783203125, 'logps/rejected': -112.29192352294922, 'logps/ref_chosen': -58.96642303466797, 'logps/ref_rejected': -109.90837097167969, 'logits/chosen': -0.748748779296875, 'logits/rejected': -0.471982479095459, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.08657068759202957, 'epsilon_dpo/loss_margin_mean': 2.6446244716644287, 'epsilon_dpo/beta_margin_mean': 0.22843961417675018, 'epsilon_dpo/beta_margin_std': 0.2150142788887024, 'epsilon_dpo/beta_margin_grad_mean': -0.4439954161643982, 'epsilon_dpo/beta_margin_grad_std': 0.051451511681079865, 'kl/beta': 0.08727110922336578, 'kl/avg_steps': 0.8125, 'epoch': 0.06} 6%|████▊ | 42/681 [01:56<28:23, 2.67s/it] 6%|████▉ | 43/681 [01:58<28:29, 2.68s/it] {'loss': 1.1901, 'grad_norm': 63.363746643066406, 'learning_rate': 3.043478260869565e-07, 'rewards/chosen': 0.047858819365501404, 'rewards/rejected': -0.1670626699924469, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.2149214893579483, 'logps/chosen': -53.596961975097656, 'logps/rejected': -98.42918395996094, 'logps/ref_chosen': -54.15599822998047, 'logps/ref_rejected': -96.48019409179688, 'logits/chosen': -0.6911792755126953, 'logits/rejected': -0.5561075210571289, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.08581885695457458, 'epsilon_dpo/loss_margin_mean': 2.508019208908081, 'epsilon_dpo/beta_margin_mean': 0.21492145955562592, 'epsilon_dpo/beta_margin_std': 0.17198152840137482, 'epsilon_dpo/beta_margin_grad_mean': -0.44696906208992004, 'epsilon_dpo/beta_margin_grad_std': 0.04155328497290611, 'kl/beta': 0.08656774461269379, 'kl/avg_steps': 0.875, 'epoch': 0.06} 6%|████▉ | 43/681 [01:58<28:29, 2.68s/it] 6%|█████ | 44/681 [02:01<28:45, 2.71s/it] {'loss': 1.1698, 'grad_norm': 69.75694274902344, 'learning_rate': 3.115942028985507e-07, 'rewards/chosen': 0.019421285018324852, 'rewards/rejected': -0.21991033852100372, 'rewards/accuracies': 0.953125, 'rewards/margins': 0.23933161795139313, 'logps/chosen': -49.84899139404297, 'logps/rejected': -111.37159729003906, 'logps/ref_chosen': -50.07849884033203, 'logps/ref_rejected': -108.78376007080078, 'logits/chosen': -0.7419127225875854, 'logits/rejected': -0.561978280544281, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.0850476324558258, 'epsilon_dpo/loss_margin_mean': 2.8173367977142334, 'epsilon_dpo/beta_margin_mean': 0.23933164775371552, 'epsilon_dpo/beta_margin_std': 0.18726012110710144, 'epsilon_dpo/beta_margin_grad_mean': -0.44107890129089355, 'epsilon_dpo/beta_margin_grad_std': 0.04519936442375183, 'kl/beta': 0.08581684529781342, 'kl/avg_steps': 0.90625, 'epoch': 0.06} 6%|█████ | 44/681 [02:01<28:45, 2.71s/it] 7%|█████▏ | 45/681 [02:04<28:46, 2.72s/it] {'loss': 1.2179, 'grad_norm': 54.971275329589844, 'learning_rate': 3.188405797101449e-07, 'rewards/chosen': 0.009439542889595032, 'rewards/rejected': -0.1789076179265976, 'rewards/accuracies': 0.875, 'rewards/margins': 0.18834716081619263, 'logps/chosen': -48.300140380859375, 'logps/rejected': -80.06005859375, 'logps/ref_chosen': -48.41493225097656, 'logps/ref_rejected': -77.93643188476562, 'logits/chosen': -0.5838513970375061, 'logits/rejected': -0.5171458721160889, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.08444329351186752, 'epsilon_dpo/loss_margin_mean': 2.238414764404297, 'epsilon_dpo/beta_margin_mean': 0.18834719061851501, 'epsilon_dpo/beta_margin_std': 0.21307416260242462, 'epsilon_dpo/beta_margin_grad_mean': -0.45379340648651123, 'epsilon_dpo/beta_margin_grad_std': 0.05130607634782791, 'kl/beta': 0.08504611998796463, 'kl/avg_steps': 0.71875, 'epoch': 0.07} 7%|█████▏ | 45/681 [02:04<28:46, 2.72s/it] 7%|█████▎ | 46/681 [02:07<29:05, 2.75s/it] {'loss': 1.1696, 'grad_norm': 61.659027099609375, 'learning_rate': 3.260869565217391e-07, 'rewards/chosen': 0.020097916945815086, 'rewards/rejected': -0.2292776107788086, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.24937552213668823, 'logps/chosen': -55.753944396972656, 'logps/rejected': -98.39190673828125, 'logps/ref_chosen': -55.999427795410156, 'logps/ref_rejected': -95.652587890625, 'logits/chosen': -0.7949700355529785, 'logits/rejected': -0.5555962920188904, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.08384069055318832, 'epsilon_dpo/loss_margin_mean': 2.984797954559326, 'epsilon_dpo/beta_margin_mean': 0.24937555193901062, 'epsilon_dpo/beta_margin_std': 0.26771342754364014, 'epsilon_dpo/beta_margin_grad_mean': -0.4394093155860901, 'epsilon_dpo/beta_margin_grad_std': 0.0630989596247673, 'kl/beta': 0.08443921059370041, 'kl/avg_steps': 0.71875, 'epoch': 0.07} 7%|█████▎ | 46/681 [02:07<29:05, 2.75s/it] 7%|█████▍ | 47/681 [02:09<29:01, 2.75s/it] {'loss': 1.1642, 'grad_norm': 56.881492614746094, 'learning_rate': 3.333333333333333e-07, 'rewards/chosen': 0.03409276157617569, 'rewards/rejected': -0.21639610826969147, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.25048884749412537, 'logps/chosen': -57.51045608520508, 'logps/rejected': -97.28176879882812, 'logps/ref_chosen': -57.92607879638672, 'logps/ref_rejected': -94.67920684814453, 'logits/chosen': -0.7932885885238647, 'logits/rejected': -0.51214599609375, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.08320295065641403, 'epsilon_dpo/loss_margin_mean': 3.0181827545166016, 'epsilon_dpo/beta_margin_mean': 0.25048893690109253, 'epsilon_dpo/beta_margin_std': 0.22870446741580963, 'epsilon_dpo/beta_margin_grad_mean': -0.4386330842971802, 'epsilon_dpo/beta_margin_grad_std': 0.055102963000535965, 'kl/beta': 0.0838366374373436, 'kl/avg_steps': 0.765625, 'epoch': 0.07} 7%|█████▍ | 47/681 [02:09<29:01, 2.75s/it] 7%|█████▌ | 48/681 [02:12<29:25, 2.79s/it] {'loss': 1.1586, 'grad_norm': 64.7890396118164, 'learning_rate': 3.4057971014492755e-07, 'rewards/chosen': 0.005281925667077303, 'rewards/rejected': -0.2570808529853821, 'rewards/accuracies': 0.875, 'rewards/margins': 0.2623627781867981, 'logps/chosen': -57.117889404296875, 'logps/rejected': -91.13453674316406, 'logps/ref_chosen': -57.188072204589844, 'logps/ref_rejected': -88.0166015625, 'logits/chosen': -0.8307114839553833, 'logits/rejected': -0.577639102935791, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.08260989934206009, 'epsilon_dpo/loss_margin_mean': 3.188123941421509, 'epsilon_dpo/beta_margin_mean': 0.2623628079891205, 'epsilon_dpo/beta_margin_std': 0.26902657747268677, 'epsilon_dpo/beta_margin_grad_mean': -0.43596893548965454, 'epsilon_dpo/beta_margin_grad_std': 0.06429051607847214, 'kl/beta': 0.0831996351480484, 'kl/avg_steps': 0.71875, 'epoch': 0.07} 7%|█████▌ | 48/681 [02:12<29:25, 2.79s/it] 7%|█████▋ | 49/681 [02:15<28:55, 2.75s/it] {'loss': 1.125, 'grad_norm': 55.275516510009766, 'learning_rate': 3.478260869565217e-07, 'rewards/chosen': 0.02468142658472061, 'rewards/rejected': -0.285413533449173, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.310094952583313, 'logps/chosen': -61.376583099365234, 'logps/rejected': -87.25360107421875, 'logps/ref_chosen': -61.685264587402344, 'logps/ref_rejected': -83.76747131347656, 'logits/chosen': -0.7270078659057617, 'logits/rejected': -0.5109246969223022, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0819687470793724, 'epsilon_dpo/loss_margin_mean': 3.7948062419891357, 'epsilon_dpo/beta_margin_mean': 0.3100949227809906, 'epsilon_dpo/beta_margin_std': 0.3251339793205261, 'epsilon_dpo/beta_margin_grad_mean': -0.42563095688819885, 'epsilon_dpo/beta_margin_grad_std': 0.07447288185358047, 'kl/beta': 0.08260590583086014, 'kl/avg_steps': 0.78125, 'epoch': 0.07} 7%|█████▋ | 49/681 [02:15<28:55, 2.75s/it] 7%|█████▊ | 50/681 [02:18<28:57, 2.75s/it] {'loss': 1.1024, 'grad_norm': 54.424171447753906, 'learning_rate': 3.5507246376811595e-07, 'rewards/chosen': -0.014449729584157467, 'rewards/rejected': -0.35226207971572876, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.3378123641014099, 'logps/chosen': -58.89253234863281, 'logps/rejected': -100.69210815429688, 'logps/ref_chosen': -58.72413635253906, 'logps/ref_rejected': -96.35814666748047, 'logits/chosen': -0.7568400502204895, 'logits/rejected': -0.5175820589065552, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.08143579214811325, 'epsilon_dpo/loss_margin_mean': 4.165571689605713, 'epsilon_dpo/beta_margin_mean': 0.3378123939037323, 'epsilon_dpo/beta_margin_std': 0.32804766297340393, 'epsilon_dpo/beta_margin_grad_mean': -0.418803870677948, 'epsilon_dpo/beta_margin_grad_std': 0.07662991434335709, 'kl/beta': 0.08196555078029633, 'kl/avg_steps': 0.65625, 'epoch': 0.07} 7%|█████▊ | 50/681 [02:18<28:57, 2.75s/it] 7%|█████▉ | 51/681 [02:20<28:59, 2.76s/it] {'loss': 1.1207, 'grad_norm': 46.214134216308594, 'learning_rate': 3.6231884057971015e-07, 'rewards/chosen': -0.02197762206196785, 'rewards/rejected': -0.35597023367881775, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.333992600440979, 'logps/chosen': -61.633811950683594, 'logps/rejected': -80.40904235839844, 'logps/ref_chosen': -61.3736686706543, 'logps/ref_rejected': -76.00199890136719, 'logits/chosen': -0.8009949922561646, 'logits/rejected': -0.6197192072868347, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.08103210479021072, 'epsilon_dpo/loss_margin_mean': 4.146895885467529, 'epsilon_dpo/beta_margin_mean': 0.33399254083633423, 'epsilon_dpo/beta_margin_std': 0.41953185200691223, 'epsilon_dpo/beta_margin_grad_mean': -0.42181292176246643, 'epsilon_dpo/beta_margin_grad_std': 0.09374556690454483, 'kl/beta': 0.08143115788698196, 'kl/avg_steps': 0.5, 'epoch': 0.07} 7%|█████▉ | 51/681 [02:21<28:59, 2.76s/it] 8%|██████ | 52/681 [02:23<28:21, 2.71s/it] {'loss': 0.9909, 'grad_norm': 53.337032318115234, 'learning_rate': 3.695652173913043e-07, 'rewards/chosen': 0.025913957506418228, 'rewards/rejected': -0.46740520000457764, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.49331915378570557, 'logps/chosen': -52.00531768798828, 'logps/rejected': -85.79180908203125, 'logps/ref_chosen': -52.33735656738281, 'logps/ref_rejected': -79.97391510009766, 'logits/chosen': -0.8910727500915527, 'logits/rejected': -0.6129493713378906, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.08042638003826141, 'epsilon_dpo/loss_margin_mean': 6.149931907653809, 'epsilon_dpo/beta_margin_mean': 0.49331915378570557, 'epsilon_dpo/beta_margin_std': 0.40902501344680786, 'epsilon_dpo/beta_margin_grad_mean': -0.3843227028846741, 'epsilon_dpo/beta_margin_grad_std': 0.08970463275909424, 'kl/beta': 0.08102603256702423, 'kl/avg_steps': 0.75, 'epoch': 0.08} 8%|██████ | 52/681 [02:23<28:21, 2.71s/it] 8%|██████▏ | 53/681 [02:26<28:14, 2.70s/it] {'loss': 1.0123, 'grad_norm': 51.91477966308594, 'learning_rate': 3.7681159420289855e-07, 'rewards/chosen': -0.017171800136566162, 'rewards/rejected': -0.513559103012085, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.4963873326778412, 'logps/chosen': -53.52302551269531, 'logps/rejected': -98.22883605957031, 'logps/ref_chosen': -53.31465530395508, 'logps/ref_rejected': -91.7835922241211, 'logits/chosen': -0.6966301202774048, 'logits/rejected': -0.5892688632011414, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.07980253547430038, 'epsilon_dpo/loss_margin_mean': 6.23687219619751, 'epsilon_dpo/beta_margin_mean': 0.4963873624801636, 'epsilon_dpo/beta_margin_std': 0.5382514595985413, 'epsilon_dpo/beta_margin_grad_mean': -0.3882746994495392, 'epsilon_dpo/beta_margin_grad_std': 0.1083177998661995, 'kl/beta': 0.08042285591363907, 'kl/avg_steps': 0.78125, 'epoch': 0.08} 8%|██████▏ | 53/681 [02:26<28:14, 2.70s/it] 8%|██████▎ | 54/681 [02:28<27:29, 2.63s/it] {'loss': 1.0396, 'grad_norm': 47.81783676147461, 'learning_rate': 3.8405797101449274e-07, 'rewards/chosen': -0.03115382045507431, 'rewards/rejected': -0.461276113986969, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.4301223158836365, 'logps/chosen': -51.0803337097168, 'logps/rejected': -97.55059051513672, 'logps/ref_chosen': -50.68865966796875, 'logps/ref_rejected': -91.71539306640625, 'logits/chosen': -0.9264237880706787, 'logits/rejected': -0.6701672077178955, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0791962519288063, 'epsilon_dpo/loss_margin_mean': 5.4435248374938965, 'epsilon_dpo/beta_margin_mean': 0.4301222860813141, 'epsilon_dpo/beta_margin_std': 0.4083307385444641, 'epsilon_dpo/beta_margin_grad_mean': -0.3992304503917694, 'epsilon_dpo/beta_margin_grad_std': 0.0889367163181305, 'kl/beta': 0.07979942858219147, 'kl/avg_steps': 0.765625, 'epoch': 0.08} 8%|██████▎ | 54/681 [02:28<27:29, 2.63s/it] 8%|██████▍ | 55/681 [02:31<26:28, 2.54s/it] {'loss': 1.014, 'grad_norm': 46.457801818847656, 'learning_rate': 3.9130434782608694e-07, 'rewards/chosen': -0.07498433440923691, 'rewards/rejected': -0.5891005992889404, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.5141162872314453, 'logps/chosen': -63.550445556640625, 'logps/rejected': -96.49528503417969, 'logps/ref_chosen': -62.615234375, 'logps/ref_rejected': -88.99349975585938, 'logits/chosen': -0.9492754340171814, 'logits/rejected': -0.7138346433639526, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.07875551283359528, 'epsilon_dpo/loss_margin_mean': 6.566575527191162, 'epsilon_dpo/beta_margin_mean': 0.5141162872314453, 'epsilon_dpo/beta_margin_std': 0.5916620492935181, 'epsilon_dpo/beta_margin_grad_mean': -0.38479316234588623, 'epsilon_dpo/beta_margin_grad_std': 0.12403902411460876, 'kl/beta': 0.0791931003332138, 'kl/avg_steps': 0.5625, 'epoch': 0.08} 8%|██████▍ | 55/681 [02:31<26:28, 2.54s/it] 8%|██████▍ | 56/681 [02:33<27:28, 2.64s/it] {'loss': 1.037, 'grad_norm': 42.523895263671875, 'learning_rate': 3.9855072463768114e-07, 'rewards/chosen': -0.059790801256895065, 'rewards/rejected': -0.543065071105957, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.4832742214202881, 'logps/chosen': -58.679595947265625, 'logps/rejected': -101.12921142578125, 'logps/ref_chosen': -57.93273162841797, 'logps/ref_rejected': -94.1744384765625, 'logits/chosen': -0.8505500555038452, 'logits/rejected': -0.6876404881477356, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0783396065235138, 'epsilon_dpo/loss_margin_mean': 6.207910060882568, 'epsilon_dpo/beta_margin_mean': 0.48327428102493286, 'epsilon_dpo/beta_margin_std': 0.5860716104507446, 'epsilon_dpo/beta_margin_grad_mean': -0.3912721276283264, 'epsilon_dpo/beta_margin_grad_std': 0.1250195950269699, 'kl/beta': 0.0787501335144043, 'kl/avg_steps': 0.53125, 'epoch': 0.08} 8%|██████▍ | 56/681 [02:33<27:28, 2.64s/it] 8%|██████▌ | 57/681 [02:36<27:13, 2.62s/it] {'loss': 0.9777, 'grad_norm': 47.585750579833984, 'learning_rate': 4.057971014492754e-07, 'rewards/chosen': -0.06252375990152359, 'rewards/rejected': -0.6061526536941528, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.5436288714408875, 'logps/chosen': -71.28856658935547, 'logps/rejected': -103.37579345703125, 'logps/ref_chosen': -70.49528503417969, 'logps/ref_rejected': -95.56546020507812, 'logits/chosen': -0.7789304852485657, 'logits/rejected': -0.6877784729003906, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.07772976905107498, 'epsilon_dpo/loss_margin_mean': 7.017059803009033, 'epsilon_dpo/beta_margin_mean': 0.5436288714408875, 'epsilon_dpo/beta_margin_std': 0.5275634527206421, 'epsilon_dpo/beta_margin_grad_mean': -0.37548625469207764, 'epsilon_dpo/beta_margin_grad_std': 0.11424616724252701, 'kl/beta': 0.07833398133516312, 'kl/avg_steps': 0.78125, 'epoch': 0.08} 8%|██████▌ | 57/681 [02:36<27:13, 2.62s/it] 9%|██████▋ | 58/681 [02:39<27:38, 2.66s/it] {'loss': 0.9617, 'grad_norm': 50.20505142211914, 'learning_rate': 4.1304347826086954e-07, 'rewards/chosen': -0.08548370748758316, 'rewards/rejected': -0.6857439875602722, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.6002602577209473, 'logps/chosen': -63.216793060302734, 'logps/rejected': -93.5071792602539, 'logps/ref_chosen': -62.13294219970703, 'logps/ref_rejected': -84.61729431152344, 'logits/chosen': -0.9927153587341309, 'logits/rejected': -0.7306280732154846, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.07718782871961594, 'epsilon_dpo/loss_margin_mean': 7.806028366088867, 'epsilon_dpo/beta_margin_mean': 0.600260317325592, 'epsilon_dpo/beta_margin_std': 0.6401125192642212, 'epsilon_dpo/beta_margin_grad_mean': -0.36823755502700806, 'epsilon_dpo/beta_margin_grad_std': 0.1302955597639084, 'kl/beta': 0.07772674411535263, 'kl/avg_steps': 0.703125, 'epoch': 0.09} 9%|██████▋ | 58/681 [02:39<27:38, 2.66s/it] 9%|██████▊ | 59/681 [02:41<27:39, 2.67s/it] {'loss': 0.9341, 'grad_norm': 48.971317291259766, 'learning_rate': 4.2028985507246374e-07, 'rewards/chosen': -0.1215255856513977, 'rewards/rejected': -0.7741378545761108, 'rewards/accuracies': 0.875, 'rewards/margins': 0.6526123285293579, 'logps/chosen': -53.49982452392578, 'logps/rejected': -98.99909973144531, 'logps/ref_chosen': -51.932525634765625, 'logps/ref_rejected': -88.88520050048828, 'logits/chosen': -0.9326772689819336, 'logits/rejected': -0.7697768211364746, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.07668519020080566, 'epsilon_dpo/loss_margin_mean': 8.546601295471191, 'epsilon_dpo/beta_margin_mean': 0.6526122689247131, 'epsilon_dpo/beta_margin_std': 0.6807378530502319, 'epsilon_dpo/beta_margin_grad_mean': -0.35859009623527527, 'epsilon_dpo/beta_margin_grad_std': 0.13512957096099854, 'kl/beta': 0.0771840438246727, 'kl/avg_steps': 0.65625, 'epoch': 0.09} 9%|██████▊ | 59/681 [02:41<27:39, 2.67s/it] 9%|██████▉ | 60/681 [02:44<27:27, 2.65s/it] {'loss': 1.0221, 'grad_norm': 53.07661819458008, 'learning_rate': 4.2753623188405794e-07, 'rewards/chosen': -0.21674984693527222, 'rewards/rejected': -0.7288058996200562, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.5120559930801392, 'logps/chosen': -63.75890350341797, 'logps/rejected': -94.97213745117188, 'logps/ref_chosen': -60.94218444824219, 'logps/ref_rejected': -85.39340209960938, 'logits/chosen': -0.9355441927909851, 'logits/rejected': -0.6807034611701965, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0762091875076294, 'epsilon_dpo/loss_margin_mean': 6.76201868057251, 'epsilon_dpo/beta_margin_mean': 0.5120560526847839, 'epsilon_dpo/beta_margin_std': 0.606110155582428, 'epsilon_dpo/beta_margin_grad_mean': -0.38297078013420105, 'epsilon_dpo/beta_margin_grad_std': 0.13203725218772888, 'kl/beta': 0.07668082416057587, 'kl/avg_steps': 0.625, 'epoch': 0.09} 9%|██████▉ | 60/681 [02:44<27:27, 2.65s/it] 9%|███████ | 61/681 [02:47<27:40, 2.68s/it] {'loss': 0.9906, 'grad_norm': 45.54325485229492, 'learning_rate': 4.3478260869565214e-07, 'rewards/chosen': -0.12385329604148865, 'rewards/rejected': -0.761780858039856, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.6379275321960449, 'logps/chosen': -62.2430419921875, 'logps/rejected': -99.93009948730469, 'logps/ref_chosen': -60.633522033691406, 'logps/ref_rejected': -89.85249328613281, 'logits/chosen': -0.865566611289978, 'logits/rejected': -0.6939293146133423, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'epsilon_dpo/beta': 0.07587873935699463, 'epsilon_dpo/loss_margin_mean': 8.468091011047363, 'epsilon_dpo/beta_margin_mean': 0.6379275918006897, 'epsilon_dpo/beta_margin_std': 0.8875248432159424, 'epsilon_dpo/beta_margin_grad_mean': -0.37099823355674744, 'epsilon_dpo/beta_margin_grad_std': 0.15391331911087036, 'kl/beta': 0.07620454579591751, 'kl/avg_steps': 0.4375, 'epoch': 0.09} 9%|███████ | 61/681 [02:47<27:40, 2.68s/it] 9%|███████▏ | 62/681 [02:50<28:09, 2.73s/it] {'loss': 1.058, 'grad_norm': 45.17091751098633, 'learning_rate': 4.420289855072464e-07, 'rewards/chosen': -0.13497035205364227, 'rewards/rejected': -0.6089706420898438, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.4740002751350403, 'logps/chosen': -57.91468811035156, 'logps/rejected': -83.65602111816406, 'logps/ref_chosen': -56.15077209472656, 'logps/ref_rejected': -75.56619262695312, 'logits/chosen': -0.8267738223075867, 'logits/rejected': -0.6903345584869385, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.07542965561151505, 'epsilon_dpo/loss_margin_mean': 6.3259053230285645, 'epsilon_dpo/beta_margin_mean': 0.47400033473968506, 'epsilon_dpo/beta_margin_std': 0.6434755921363831, 'epsilon_dpo/beta_margin_grad_mean': -0.3952219784259796, 'epsilon_dpo/beta_margin_grad_std': 0.1335650235414505, 'kl/beta': 0.07587260752916336, 'kl/avg_steps': 0.59375, 'epoch': 0.09} 9%|███████▏ | 62/681 [02:50<28:09, 2.73s/it] 9%|███████▎ | 63/681 [02:52<27:58, 2.72s/it] {'loss': 0.9518, 'grad_norm': 47.9777717590332, 'learning_rate': 4.4927536231884053e-07, 'rewards/chosen': -0.21734055876731873, 'rewards/rejected': -0.8549933433532715, 'rewards/accuracies': 0.875, 'rewards/margins': 0.6376527547836304, 'logps/chosen': -76.01518249511719, 'logps/rejected': -109.02705383300781, 'logps/ref_chosen': -73.14739227294922, 'logps/ref_rejected': -97.61006164550781, 'logits/chosen': -0.9301478862762451, 'logits/rejected': -0.719444751739502, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.07498443126678467, 'epsilon_dpo/loss_margin_mean': 8.549195289611816, 'epsilon_dpo/beta_margin_mean': 0.6376527547836304, 'epsilon_dpo/beta_margin_std': 0.6982914805412292, 'epsilon_dpo/beta_margin_grad_mean': -0.3600352704524994, 'epsilon_dpo/beta_margin_grad_std': 0.1420409381389618, 'kl/beta': 0.07542476803064346, 'kl/avg_steps': 0.59375, 'epoch': 0.09} 9%|███████▎ | 63/681 [02:52<27:58, 2.72s/it] 9%|███████▍ | 64/681 [02:55<27:22, 2.66s/it] {'loss': 0.9192, 'grad_norm': 43.91283416748047, 'learning_rate': 4.5652173913043473e-07, 'rewards/chosen': -0.08719173073768616, 'rewards/rejected': -0.8378149271011353, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.7506231665611267, 'logps/chosen': -55.14314651489258, 'logps/rejected': -104.79847717285156, 'logps/ref_chosen': -53.99859619140625, 'logps/ref_rejected': -93.53020477294922, 'logits/chosen': -0.8505758047103882, 'logits/rejected': -0.6992577314376831, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.07462374866008759, 'epsilon_dpo/loss_margin_mean': 10.123734474182129, 'epsilon_dpo/beta_margin_mean': 0.7506232857704163, 'epsilon_dpo/beta_margin_std': 0.8795703053474426, 'epsilon_dpo/beta_margin_grad_mean': -0.34572863578796387, 'epsilon_dpo/beta_margin_grad_std': 0.15755517780780792, 'kl/beta': 0.07497958093881607, 'kl/avg_steps': 0.484375, 'epoch': 0.09} 9%|███████▍ | 64/681 [02:55<27:22, 2.66s/it] 10%|███████▌ | 65/681 [02:58<27:26, 2.67s/it] {'loss': 0.9228, 'grad_norm': 45.28920364379883, 'learning_rate': 4.63768115942029e-07, 'rewards/chosen': -0.2566946744918823, 'rewards/rejected': -1.0142968893051147, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.7576022744178772, 'logps/chosen': -68.27507019042969, 'logps/rejected': -123.65458679199219, 'logps/ref_chosen': -64.83599853515625, 'logps/ref_rejected': -109.94645690917969, 'logits/chosen': -0.9071321487426758, 'logits/rejected': -0.8409342169761658, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.07420583814382553, 'epsilon_dpo/loss_margin_mean': 10.269062042236328, 'epsilon_dpo/beta_margin_mean': 0.7576022744178772, 'epsilon_dpo/beta_margin_std': 0.8702723383903503, 'epsilon_dpo/beta_margin_grad_mean': -0.3416244089603424, 'epsilon_dpo/beta_margin_grad_std': 0.1670026183128357, 'kl/beta': 0.0746181458234787, 'kl/avg_steps': 0.5625, 'epoch': 0.1} 10%|███████▌ | 65/681 [02:58<27:26, 2.67s/it] 10%|███████▋ | 66/681 [03:00<27:35, 2.69s/it] {'loss': 0.953, 'grad_norm': 42.59511184692383, 'learning_rate': 4.7101449275362313e-07, 'rewards/chosen': -0.23922501504421234, 'rewards/rejected': -0.9230127334594727, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.6837877035140991, 'logps/chosen': -54.66441345214844, 'logps/rejected': -88.17784118652344, 'logps/ref_chosen': -51.44352722167969, 'logps/ref_rejected': -75.63629150390625, 'logits/chosen': -0.9653792381286621, 'logits/rejected': -0.8392778635025024, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0737907662987709, 'epsilon_dpo/loss_margin_mean': 9.32065200805664, 'epsilon_dpo/beta_margin_mean': 0.6837877035140991, 'epsilon_dpo/beta_margin_std': 0.8476912975311279, 'epsilon_dpo/beta_margin_grad_mean': -0.3596351146697998, 'epsilon_dpo/beta_margin_grad_std': 0.15223053097724915, 'kl/beta': 0.07420077174901962, 'kl/avg_steps': 0.5625, 'epoch': 0.1} 10%|███████▋ | 66/681 [03:00<27:35, 2.69s/it] 10%|███████▊ | 67/681 [03:03<26:29, 2.59s/it] {'loss': 0.9462, 'grad_norm': 43.226829528808594, 'learning_rate': 4.782608695652174e-07, 'rewards/chosen': -0.21227487921714783, 'rewards/rejected': -0.9117189049720764, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.699444055557251, 'logps/chosen': -62.196495056152344, 'logps/rejected': -85.2276840209961, 'logps/ref_chosen': -59.34080505371094, 'logps/ref_rejected': -72.78729248046875, 'logits/chosen': -0.9449235200881958, 'logits/rejected': -0.8122668266296387, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.07342413067817688, 'epsilon_dpo/loss_margin_mean': 9.584704399108887, 'epsilon_dpo/beta_margin_mean': 0.6994439959526062, 'epsilon_dpo/beta_margin_std': 0.856719434261322, 'epsilon_dpo/beta_margin_grad_mean': -0.35666364431381226, 'epsilon_dpo/beta_margin_grad_std': 0.15520258247852325, 'kl/beta': 0.07378572225570679, 'kl/avg_steps': 0.5, 'epoch': 0.1} 10%|███████▊ | 67/681 [03:03<26:29, 2.59s/it] 10%|███████▉ | 68/681 [03:05<26:36, 2.60s/it] {'loss': 0.9309, 'grad_norm': 43.32976150512695, 'learning_rate': 4.855072463768116e-07, 'rewards/chosen': -0.24640579521656036, 'rewards/rejected': -0.9089112877845764, 'rewards/accuracies': 0.875, 'rewards/margins': 0.6625055074691772, 'logps/chosen': -68.5649642944336, 'logps/rejected': -89.68373107910156, 'logps/ref_chosen': -65.2058334350586, 'logps/ref_rejected': -77.20724487304688, 'logits/chosen': -0.9087271094322205, 'logits/rejected': -0.74156653881073, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.07299000769853592, 'epsilon_dpo/loss_margin_mean': 9.117342948913574, 'epsilon_dpo/beta_margin_mean': 0.6625055074691772, 'epsilon_dpo/beta_margin_std': 0.7009024024009705, 'epsilon_dpo/beta_margin_grad_mean': -0.3574289083480835, 'epsilon_dpo/beta_margin_grad_std': 0.1356179416179657, 'kl/beta': 0.07341863214969635, 'kl/avg_steps': 0.59375, 'epoch': 0.1} 10%|███████▉ | 68/681 [03:05<26:36, 2.60s/it] 10%|████████ | 69/681 [03:08<27:24, 2.69s/it] {'loss': 0.8579, 'grad_norm': 44.75086212158203, 'learning_rate': 4.927536231884058e-07, 'rewards/chosen': -0.28851306438446045, 'rewards/rejected': -1.1091153621673584, 'rewards/accuracies': 0.875, 'rewards/margins': 0.8206021785736084, 'logps/chosen': -63.78904342651367, 'logps/rejected': -118.71281433105469, 'logps/ref_chosen': -59.81924057006836, 'logps/ref_rejected': -103.38886260986328, 'logits/chosen': -0.9024134874343872, 'logits/rejected': -0.7688239216804504, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.07258199155330658, 'epsilon_dpo/loss_margin_mean': 11.354151725769043, 'epsilon_dpo/beta_margin_mean': 0.8206022381782532, 'epsilon_dpo/beta_margin_std': 0.8429078459739685, 'epsilon_dpo/beta_margin_grad_mean': -0.3325771987438202, 'epsilon_dpo/beta_margin_grad_std': 0.14603158831596375, 'kl/beta': 0.07298527657985687, 'kl/avg_steps': 0.5625, 'epoch': 0.1} 10%|████████ | 69/681 [03:08<27:24, 2.69s/it] 10%|████████ | 70/681 [03:11<26:52, 2.64s/it] {'loss': 0.8686, 'grad_norm': 48.5513801574707, 'learning_rate': 5e-07, 'rewards/chosen': -0.41746020317077637, 'rewards/rejected': -1.298567533493042, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8811073303222656, 'logps/chosen': -67.6795654296875, 'logps/rejected': -109.08511352539062, 'logps/ref_chosen': -61.930641174316406, 'logps/ref_rejected': -91.060791015625, 'logits/chosen': -0.9441779851913452, 'logits/rejected': -0.8562849760055542, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0721760094165802, 'epsilon_dpo/loss_margin_mean': 12.275406837463379, 'epsilon_dpo/beta_margin_mean': 0.8811073899269104, 'epsilon_dpo/beta_margin_std': 0.9766340851783752, 'epsilon_dpo/beta_margin_grad_mean': -0.32791411876678467, 'epsilon_dpo/beta_margin_grad_std': 0.17222338914871216, 'kl/beta': 0.07257703691720963, 'kl/avg_steps': 0.5625, 'epoch': 0.1} 10%|████████ | 70/681 [03:11<26:52, 2.64s/it] 10%|████████▏ | 71/681 [03:13<26:51, 2.64s/it] {'loss': 0.7838, 'grad_norm': 43.49538040161133, 'learning_rate': 4.999967061337492e-07, 'rewards/chosen': -0.3936518430709839, 'rewards/rejected': -1.431180715560913, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.0375288724899292, 'logps/chosen': -67.21176147460938, 'logps/rejected': -117.31230163574219, 'logps/ref_chosen': -61.750343322753906, 'logps/ref_rejected': -97.33662414550781, 'logits/chosen': -0.9862484931945801, 'logits/rejected': -0.8802157044410706, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.07177228480577469, 'epsilon_dpo/loss_margin_mean': 14.51425838470459, 'epsilon_dpo/beta_margin_mean': 1.0375287532806396, 'epsilon_dpo/beta_margin_std': 1.0877079963684082, 'epsilon_dpo/beta_margin_grad_mean': -0.30481892824172974, 'epsilon_dpo/beta_margin_grad_std': 0.15653713047504425, 'kl/beta': 0.07217106968164444, 'kl/avg_steps': 0.5625, 'epoch': 0.1} 10%|████████▏ | 71/681 [03:13<26:51, 2.64s/it] 11%|████████▎ | 72/681 [03:16<27:20, 2.69s/it] {'loss': 0.8218, 'grad_norm': 55.14749526977539, 'learning_rate': 4.999868246217933e-07, 'rewards/chosen': -0.4470614492893219, 'rewards/rejected': -1.5171492099761963, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.0700877904891968, 'logps/chosen': -72.2786636352539, 'logps/rejected': -116.5944595336914, 'logps/ref_chosen': -66.05341339111328, 'logps/ref_rejected': -95.2869873046875, 'logits/chosen': -0.9780253171920776, 'logits/rejected': -0.8922737836837769, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.07130353897809982, 'epsilon_dpo/loss_margin_mean': 15.082220077514648, 'epsilon_dpo/beta_margin_mean': 1.0700877904891968, 'epsilon_dpo/beta_margin_std': 1.190470814704895, 'epsilon_dpo/beta_margin_grad_mean': -0.30151036381721497, 'epsilon_dpo/beta_margin_grad_std': 0.1891818344593048, 'kl/beta': 0.07176738232374191, 'kl/avg_steps': 0.65625, 'epoch': 0.11} 11%|████████▎ | 72/681 [03:16<27:20, 2.69s/it] 11%|████████▍ | 73/681 [03:19<27:47, 2.74s/it] {'loss': 1.0033, 'grad_norm': 63.31758499145508, 'learning_rate': 4.999703557245192e-07, 'rewards/chosen': -0.5787585973739624, 'rewards/rejected': -1.6389822959899902, 'rewards/accuracies': 0.765625, 'rewards/margins': 1.0602235794067383, 'logps/chosen': -74.35950469970703, 'logps/rejected': -113.63124084472656, 'logps/ref_chosen': -66.25627136230469, 'logps/ref_rejected': -90.45613861083984, 'logits/chosen': -1.0396358966827393, 'logits/rejected': -0.9342153072357178, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'epsilon_dpo/beta': 0.07097236067056656, 'epsilon_dpo/loss_margin_mean': 15.071878433227539, 'epsilon_dpo/beta_margin_mean': 1.0602235794067383, 'epsilon_dpo/beta_margin_std': 1.6266452074050903, 'epsilon_dpo/beta_margin_grad_mean': -0.3319794535636902, 'epsilon_dpo/beta_margin_grad_std': 0.23944905400276184, 'kl/beta': 0.0712994784116745, 'kl/avg_steps': 0.46875, 'epoch': 0.11} 11%|████████▍ | 73/681 [03:19<27:47, 2.74s/it] 11%|████████▌ | 74/681 [03:22<27:23, 2.71s/it] {'loss': 0.9142, 'grad_norm': 58.661258697509766, 'learning_rate': 4.999472998758977e-07, 'rewards/chosen': -0.5950509309768677, 'rewards/rejected': -1.7296350002288818, 'rewards/accuracies': 0.796875, 'rewards/margins': 1.1345840692520142, 'logps/chosen': -61.79954528808594, 'logps/rejected': -120.50003051757812, 'logps/ref_chosen': -53.42488098144531, 'logps/ref_rejected': -95.94693756103516, 'logits/chosen': -1.0253856182098389, 'logits/rejected': -0.9532393217086792, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.07061904668807983, 'epsilon_dpo/loss_margin_mean': 16.178430557250977, 'epsilon_dpo/beta_margin_mean': 1.1345840692520142, 'epsilon_dpo/beta_margin_std': 1.767999291419983, 'epsilon_dpo/beta_margin_grad_mean': -0.3144068717956543, 'epsilon_dpo/beta_margin_grad_std': 0.20813730359077454, 'kl/beta': 0.07096681743860245, 'kl/avg_steps': 0.5, 'epoch': 0.11} 11%|████████▌ | 74/681 [03:22<27:23, 2.71s/it] 11%|████████▋ | 75/681 [03:24<27:38, 2.74s/it] {'loss': 0.6817, 'grad_norm': 43.96208190917969, 'learning_rate': 4.999176576834721e-07, 'rewards/chosen': -0.5542978048324585, 'rewards/rejected': -2.123166084289551, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.5688681602478027, 'logps/chosen': -59.737876892089844, 'logps/rejected': -141.58197021484375, 'logps/ref_chosen': -51.861663818359375, 'logps/ref_rejected': -111.25397491455078, 'logits/chosen': -1.0390228033065796, 'logits/rejected': -0.9516497254371643, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.07017943263053894, 'epsilon_dpo/loss_margin_mean': 22.451784133911133, 'epsilon_dpo/beta_margin_mean': 1.5688680410385132, 'epsilon_dpo/beta_margin_std': 1.6174330711364746, 'epsilon_dpo/beta_margin_grad_mean': -0.2569746673107147, 'epsilon_dpo/beta_margin_grad_std': 0.1957414597272873, 'kl/beta': 0.07061374932527542, 'kl/avg_steps': 0.625, 'epoch': 0.11} 11%|████████▋ | 75/681 [03:24<27:38, 2.74s/it] 11%|████████▊ | 76/681 [03:27<27:19, 2.71s/it] {'loss': 0.8908, 'grad_norm': 55.30027389526367, 'learning_rate': 4.998814299283415e-07, 'rewards/chosen': -0.6522326469421387, 'rewards/rejected': -1.6348612308502197, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.9826285243034363, 'logps/chosen': -62.58606719970703, 'logps/rejected': -101.7059326171875, 'logps/ref_chosen': -53.26604080200195, 'logps/ref_rejected': -78.21662139892578, 'logits/chosen': -1.0337252616882324, 'logits/rejected': -0.9552336931228638, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.06976546347141266, 'epsilon_dpo/loss_margin_mean': 14.169283866882324, 'epsilon_dpo/beta_margin_mean': 0.982628583908081, 'epsilon_dpo/beta_margin_std': 1.223758339881897, 'epsilon_dpo/beta_margin_grad_mean': -0.311787486076355, 'epsilon_dpo/beta_margin_grad_std': 0.19355669617652893, 'kl/beta': 0.0701751559972763, 'kl/avg_steps': 0.59375, 'epoch': 0.11} 11%|████████▊ | 76/681 [03:27<27:19, 2.71s/it] 11%|████████▉ | 77/681 [03:29<26:06, 2.59s/it] {'loss': 0.7442, 'grad_norm': 63.71240997314453, 'learning_rate': 4.998386175651409e-07, 'rewards/chosen': -0.5564785003662109, 'rewards/rejected': -2.115208148956299, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.558729648590088, 'logps/chosen': -66.05867004394531, 'logps/rejected': -124.33473205566406, 'logps/ref_chosen': -58.0966796875, 'logps/ref_rejected': -93.77361297607422, 'logits/chosen': -1.067899465560913, 'logits/rejected': -1.0597002506256104, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.06928827613592148, 'epsilon_dpo/loss_margin_mean': 22.599130630493164, 'epsilon_dpo/beta_margin_mean': 1.558729648590088, 'epsilon_dpo/beta_margin_std': 1.6443172693252563, 'epsilon_dpo/beta_margin_grad_mean': -0.24933093786239624, 'epsilon_dpo/beta_margin_grad_std': 0.21627967059612274, 'kl/beta': 0.06976094841957092, 'kl/avg_steps': 0.6875, 'epoch': 0.11} 11%|████████▉ | 77/681 [03:29<26:06, 2.59s/it] 11%|█████████ | 78/681 [03:32<26:29, 2.64s/it] {'loss': 0.8215, 'grad_norm': 54.934146881103516, 'learning_rate': 4.997892217220159e-07, 'rewards/chosen': -0.5324300527572632, 'rewards/rejected': -1.688539981842041, 'rewards/accuracies': 0.796875, 'rewards/margins': 1.1561098098754883, 'logps/chosen': -63.288551330566406, 'logps/rejected': -109.48651123046875, 'logps/ref_chosen': -55.61378479003906, 'logps/ref_rejected': -84.93436431884766, 'logits/chosen': -1.0190231800079346, 'logits/rejected': -0.9725791215896606, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.06890178471803665, 'epsilon_dpo/loss_margin_mean': 16.87738037109375, 'epsilon_dpo/beta_margin_mean': 1.1561099290847778, 'epsilon_dpo/beta_margin_std': 1.3384202718734741, 'epsilon_dpo/beta_margin_grad_mean': -0.2992877960205078, 'epsilon_dpo/beta_margin_grad_std': 0.2032037228345871, 'kl/beta': 0.06928461790084839, 'kl/avg_steps': 0.5625, 'epoch': 0.11} 11%|█████████ | 78/681 [03:32<26:29, 2.64s/it] 12%|█████████▏ | 79/681 [03:35<26:44, 2.67s/it] {'loss': 0.8577, 'grad_norm': 49.639915466308594, 'learning_rate': 4.997332437005931e-07, 'rewards/chosen': -0.5230532288551331, 'rewards/rejected': -1.7710298299789429, 'rewards/accuracies': 0.71875, 'rewards/margins': 1.2479766607284546, 'logps/chosen': -63.02421188354492, 'logps/rejected': -113.54330444335938, 'logps/ref_chosen': -55.45048522949219, 'logps/ref_rejected': -87.64756774902344, 'logits/chosen': -1.024808645248413, 'logits/rejected': -0.9812244176864624, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'epsilon_dpo/beta': 0.06860251724720001, 'epsilon_dpo/loss_margin_mean': 18.32201385498047, 'epsilon_dpo/beta_margin_mean': 1.2479766607284546, 'epsilon_dpo/beta_margin_std': 1.5779584646224976, 'epsilon_dpo/beta_margin_grad_mean': -0.30763792991638184, 'epsilon_dpo/beta_margin_grad_std': 0.2211284041404724, 'kl/beta': 0.0688970759510994, 'kl/avg_steps': 0.4375, 'epoch': 0.12} 12%|█████████▏ | 79/681 [03:35<26:44, 2.67s/it] 12%|█████████▎ | 80/681 [03:38<26:41, 2.66s/it] {'loss': 0.9049, 'grad_norm': 54.372802734375, 'learning_rate': 4.996706849759452e-07, 'rewards/chosen': -0.6747243404388428, 'rewards/rejected': -1.7966513633728027, 'rewards/accuracies': 0.78125, 'rewards/margins': 1.12192702293396, 'logps/chosen': -68.34993743896484, 'logps/rejected': -113.91645050048828, 'logps/ref_chosen': -58.519290924072266, 'logps/ref_rejected': -87.54750061035156, 'logits/chosen': -1.0545909404754639, 'logits/rejected': -0.9393061399459839, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.3125, 'epsilon_dpo/beta': 0.06835716217756271, 'epsilon_dpo/loss_margin_mean': 16.538299560546875, 'epsilon_dpo/beta_margin_mean': 1.12192702293396, 'epsilon_dpo/beta_margin_std': 1.5621590614318848, 'epsilon_dpo/beta_margin_grad_mean': -0.3214108943939209, 'epsilon_dpo/beta_margin_grad_std': 0.21686489880084991, 'kl/beta': 0.06859695911407471, 'kl/avg_steps': 0.359375, 'epoch': 0.12} 12%|█████████▎ | 80/681 [03:38<26:41, 2.66s/it] 12%|█████████▍ | 81/681 [03:40<27:28, 2.75s/it] {'loss': 0.7737, 'grad_norm': 58.842952728271484, 'learning_rate': 4.996015471965529e-07, 'rewards/chosen': -0.5745280981063843, 'rewards/rejected': -2.063444137573242, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.4889161586761475, 'logps/chosen': -74.84884643554688, 'logps/rejected': -160.08575439453125, 'logps/ref_chosen': -66.44886779785156, 'logps/ref_rejected': -129.66270446777344, 'logits/chosen': -1.1194102764129639, 'logits/rejected': -1.0304535627365112, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.06799500435590744, 'epsilon_dpo/loss_margin_mean': 22.02306365966797, 'epsilon_dpo/beta_margin_mean': 1.488916277885437, 'epsilon_dpo/beta_margin_std': 1.7577481269836426, 'epsilon_dpo/beta_margin_grad_mean': -0.2737717628479004, 'epsilon_dpo/beta_margin_grad_std': 0.22165818512439728, 'kl/beta': 0.06835132092237473, 'kl/avg_steps': 0.53125, 'epoch': 0.12} 12%|█████████▍ | 81/681 [03:41<27:28, 2.75s/it] 12%|█████████▌ | 82/681 [03:43<26:46, 2.68s/it] {'loss': 1.0089, 'grad_norm': 72.6407699584961, 'learning_rate': 4.995258321842611e-07, 'rewards/chosen': -0.6823013424873352, 'rewards/rejected': -1.8587411642074585, 'rewards/accuracies': 0.796875, 'rewards/margins': 1.176439881324768, 'logps/chosen': -62.255680084228516, 'logps/rejected': -118.30809020996094, 'logps/ref_chosen': -52.232383728027344, 'logps/ref_rejected': -90.74325561523438, 'logits/chosen': -1.0402679443359375, 'logits/rejected': -1.000870943069458, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.06761444360017776, 'epsilon_dpo/loss_margin_mean': 17.541545867919922, 'epsilon_dpo/beta_margin_mean': 1.176439881324768, 'epsilon_dpo/beta_margin_std': 1.8416661024093628, 'epsilon_dpo/beta_margin_grad_mean': -0.31272122263908386, 'epsilon_dpo/beta_margin_grad_std': 0.24187932908535004, 'kl/beta': 0.06799012422561646, 'kl/avg_steps': 0.5625, 'epoch': 0.12} 12%|█████████▌ | 82/681 [03:43<26:46, 2.68s/it] 12%|█████████▋ | 83/681 [03:46<26:08, 2.62s/it] {'loss': 0.8203, 'grad_norm': 59.325870513916016, 'learning_rate': 4.994435419342304e-07, 'rewards/chosen': -0.662756085395813, 'rewards/rejected': -1.9837496280670166, 'rewards/accuracies': 0.8125, 'rewards/margins': 1.3209935426712036, 'logps/chosen': -65.63611602783203, 'logps/rejected': -133.28402709960938, 'logps/ref_chosen': -55.82738494873047, 'logps/ref_rejected': -103.71590423583984, 'logits/chosen': -1.1070338487625122, 'logits/rejected': -0.9864081144332886, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'epsilon_dpo/beta': 0.06732076406478882, 'epsilon_dpo/loss_margin_mean': 19.759384155273438, 'epsilon_dpo/beta_margin_mean': 1.3209935426712036, 'epsilon_dpo/beta_margin_std': 1.548567295074463, 'epsilon_dpo/beta_margin_grad_mean': -0.29003429412841797, 'epsilon_dpo/beta_margin_grad_std': 0.22237923741340637, 'kl/beta': 0.06760982424020767, 'kl/avg_steps': 0.4375, 'epoch': 0.12} 12%|█████████▋ | 83/681 [03:46<26:08, 2.62s/it] 12%|█████████▋ | 84/681 [03:48<26:53, 2.70s/it] {'loss': 0.7733, 'grad_norm': 47.83616256713867, 'learning_rate': 4.993546786148857e-07, 'rewards/chosen': -0.5357580184936523, 'rewards/rejected': -1.6826353073120117, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.1468771696090698, 'logps/chosen': -75.1614990234375, 'logps/rejected': -112.50204467773438, 'logps/ref_chosen': -67.1761703491211, 'logps/ref_rejected': -87.29859924316406, 'logits/chosen': -1.0478136539459229, 'logits/rejected': -1.0096745491027832, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.06692232191562653, 'epsilon_dpo/loss_margin_mean': 17.218107223510742, 'epsilon_dpo/beta_margin_mean': 1.1468771696090698, 'epsilon_dpo/beta_margin_std': 1.1552484035491943, 'epsilon_dpo/beta_margin_grad_mean': -0.2845773994922638, 'epsilon_dpo/beta_margin_grad_std': 0.18569572269916534, 'kl/beta': 0.06731531769037247, 'kl/avg_steps': 0.59375, 'epoch': 0.12} 12%|█████████▋ | 84/681 [03:48<26:53, 2.70s/it] 12%|█████████▊ | 85/681 [03:51<26:56, 2.71s/it] {'loss': 0.8465, 'grad_norm': 52.211517333984375, 'learning_rate': 4.992592445678582e-07, 'rewards/chosen': -0.5604950189590454, 'rewards/rejected': -1.6872422695159912, 'rewards/accuracies': 0.734375, 'rewards/margins': 1.1267473697662354, 'logps/chosen': -66.77091979980469, 'logps/rejected': -104.02989196777344, 'logps/ref_chosen': -58.406620025634766, 'logps/ref_rejected': -78.63880157470703, 'logits/chosen': -1.0085999965667725, 'logits/rejected': -1.0254071950912476, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.328125, 'epsilon_dpo/beta': 0.06669463217258453, 'epsilon_dpo/loss_margin_mean': 17.026790618896484, 'epsilon_dpo/beta_margin_mean': 1.1267473697662354, 'epsilon_dpo/beta_margin_std': 1.3287572860717773, 'epsilon_dpo/beta_margin_grad_mean': -0.30865395069122314, 'epsilon_dpo/beta_margin_grad_std': 0.20696672797203064, 'kl/beta': 0.06691799312829971, 'kl/avg_steps': 0.34375, 'epoch': 0.12} 12%|█████████▊ | 85/681 [03:51<26:56, 2.71s/it] 13%|█████████▉ | 86/681 [03:54<27:40, 2.79s/it] {'loss': 0.9947, 'grad_norm': 66.92620849609375, 'learning_rate': 4.991572423079235e-07, 'rewards/chosen': -0.6760549545288086, 'rewards/rejected': -1.835242748260498, 'rewards/accuracies': 0.78125, 'rewards/margins': 1.1591877937316895, 'logps/chosen': -66.24519348144531, 'logps/rejected': -115.83348083496094, 'logps/ref_chosen': -56.13746643066406, 'logps/ref_rejected': -88.12165069580078, 'logits/chosen': -1.037870168685913, 'logits/rejected': -1.011461853981018, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'epsilon_dpo/beta': 0.06642446666955948, 'epsilon_dpo/loss_margin_mean': 17.604108810424805, 'epsilon_dpo/beta_margin_mean': 1.1591877937316895, 'epsilon_dpo/beta_margin_std': 1.8596081733703613, 'epsilon_dpo/beta_margin_grad_mean': -0.32995861768722534, 'epsilon_dpo/beta_margin_grad_std': 0.23720116913318634, 'kl/beta': 0.06668874621391296, 'kl/avg_steps': 0.40625, 'epoch': 0.13} 13%|█████████▉ | 86/681 [03:54<27:40, 2.79s/it] 13%|██████████ | 87/681 [03:57<27:15, 2.75s/it] {'loss': 0.8572, 'grad_norm': 52.268272399902344, 'learning_rate': 4.990486745229364e-07, 'rewards/chosen': -0.641059398651123, 'rewards/rejected': -1.9223012924194336, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.2812418937683105, 'logps/chosen': -65.29768371582031, 'logps/rejected': -124.65135192871094, 'logps/ref_chosen': -55.63609313964844, 'logps/ref_rejected': -95.46757507324219, 'logits/chosen': -1.0352756977081299, 'logits/rejected': -0.9126079082489014, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.06601040810346603, 'epsilon_dpo/loss_margin_mean': 19.522178649902344, 'epsilon_dpo/beta_margin_mean': 1.2812418937683105, 'epsilon_dpo/beta_margin_std': 1.6002838611602783, 'epsilon_dpo/beta_margin_grad_mean': -0.2889256775379181, 'epsilon_dpo/beta_margin_grad_std': 0.22318826615810394, 'kl/beta': 0.06641892343759537, 'kl/avg_steps': 0.625, 'epoch': 0.13} 13%|██████████ | 87/681 [03:57<27:15, 2.75s/it] 13%|██████████▏ | 88/681 [04:00<27:10, 2.75s/it] {'loss': 0.9858, 'grad_norm': 58.52445602416992, 'learning_rate': 4.989335440737586e-07, 'rewards/chosen': -0.7829633951187134, 'rewards/rejected': -1.7454442977905273, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.9624808430671692, 'logps/chosen': -85.53303527832031, 'logps/rejected': -133.33975219726562, 'logps/ref_chosen': -73.67115020751953, 'logps/ref_rejected': -106.70849609375, 'logits/chosen': -1.0081560611724854, 'logits/rejected': -0.9678352475166321, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'epsilon_dpo/beta': 0.06572417914867401, 'epsilon_dpo/loss_margin_mean': 14.769360542297363, 'epsilon_dpo/beta_margin_mean': 0.9624807834625244, 'epsilon_dpo/beta_margin_std': 1.440002202987671, 'epsilon_dpo/beta_margin_grad_mean': -0.33352962136268616, 'epsilon_dpo/beta_margin_grad_std': 0.22125916182994843, 'kl/beta': 0.0660063847899437, 'kl/avg_steps': 0.4375, 'epoch': 0.13} 13%|██████████▏ | 88/681 [04:00<27:10, 2.75s/it] 13%|██████████▎ | 89/681 [04:02<26:44, 2.71s/it] {'loss': 0.8338, 'grad_norm': 42.91395950317383, 'learning_rate': 4.988118539941847e-07, 'rewards/chosen': -0.4233604073524475, 'rewards/rejected': -1.4193130731582642, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.9959526658058167, 'logps/chosen': -67.0821304321289, 'logps/rejected': -103.86257934570312, 'logps/ref_chosen': -60.624916076660156, 'logps/ref_rejected': -82.08354949951172, 'logits/chosen': -0.9808931946754456, 'logits/rejected': -0.8977110385894775, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.06527356803417206, 'epsilon_dpo/loss_margin_mean': 15.32180404663086, 'epsilon_dpo/beta_margin_mean': 0.9959526658058167, 'epsilon_dpo/beta_margin_std': 1.146094799041748, 'epsilon_dpo/beta_margin_grad_mean': -0.31276053190231323, 'epsilon_dpo/beta_margin_grad_std': 0.17064036428928375, 'kl/beta': 0.06571885943412781, 'kl/avg_steps': 0.6875, 'epoch': 0.13} 13%|██████████▎ | 89/681 [04:02<26:44, 2.71s/it] 13%|██████████▍ | 90/681 [04:05<26:09, 2.66s/it] {'loss': 0.8947, 'grad_norm': 52.33935546875, 'learning_rate': 4.986836074908615e-07, 'rewards/chosen': -0.565422773361206, 'rewards/rejected': -1.8093502521514893, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.2439274787902832, 'logps/chosen': -61.970252990722656, 'logps/rejected': -139.51487731933594, 'logps/ref_chosen': -53.285308837890625, 'logps/ref_rejected': -111.54470825195312, 'logits/chosen': -1.033602237701416, 'logits/rejected': -0.973494291305542, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0648890808224678, 'epsilon_dpo/loss_margin_mean': 19.285219192504883, 'epsilon_dpo/beta_margin_mean': 1.2439275979995728, 'epsilon_dpo/beta_margin_std': 1.7182693481445312, 'epsilon_dpo/beta_margin_grad_mean': -0.31154337525367737, 'epsilon_dpo/beta_margin_grad_std': 0.21933160722255707, 'kl/beta': 0.06527013331651688, 'kl/avg_steps': 0.59375, 'epoch': 0.13} 13%|██████████▍ | 90/681 [04:05<26:09, 2.66s/it] 13%|██████████▌ | 91/681 [04:07<26:13, 2.67s/it] {'loss': 0.8525, 'grad_norm': 51.4193229675293, 'learning_rate': 4.985488079432037e-07, 'rewards/chosen': -0.5083088874816895, 'rewards/rejected': -1.681951880455017, 'rewards/accuracies': 0.734375, 'rewards/margins': 1.1736429929733276, 'logps/chosen': -69.62163543701172, 'logps/rejected': -113.9945068359375, 'logps/ref_chosen': -61.80295944213867, 'logps/ref_rejected': -87.87395477294922, 'logits/chosen': -1.039869785308838, 'logits/rejected': -1.0203289985656738, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'epsilon_dpo/beta': 0.0646277442574501, 'epsilon_dpo/loss_margin_mean': 18.3018741607666, 'epsilon_dpo/beta_margin_mean': 1.1736429929733276, 'epsilon_dpo/beta_margin_std': 1.418158769607544, 'epsilon_dpo/beta_margin_grad_mean': -0.3069167733192444, 'epsilon_dpo/beta_margin_grad_std': 0.21806450188159943, 'kl/beta': 0.06488487869501114, 'kl/avg_steps': 0.40625, 'epoch': 0.13} 13%|██████████▌ | 91/681 [04:07<26:13, 2.67s/it] 14%|██████████▋ | 92/681 [04:10<25:57, 2.64s/it] {'loss': 0.8756, 'grad_norm': 45.67806625366211, 'learning_rate': 4.984074589033043e-07, 'rewards/chosen': -0.469423770904541, 'rewards/rejected': -1.575516700744629, 'rewards/accuracies': 0.75, 'rewards/margins': 1.106092929840088, 'logps/chosen': -58.884674072265625, 'logps/rejected': -102.44317626953125, 'logps/ref_chosen': -51.640769958496094, 'logps/ref_rejected': -77.88117980957031, 'logits/chosen': -1.0301735401153564, 'logits/rejected': -1.0114562511444092, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.296875, 'epsilon_dpo/beta': 0.06437624990940094, 'epsilon_dpo/loss_margin_mean': 17.31808853149414, 'epsilon_dpo/beta_margin_mean': 1.106092929840088, 'epsilon_dpo/beta_margin_std': 1.3934167623519897, 'epsilon_dpo/beta_margin_grad_mean': -0.31448835134506226, 'epsilon_dpo/beta_margin_grad_std': 0.21340087056159973, 'kl/beta': 0.06462235003709793, 'kl/avg_steps': 0.390625, 'epoch': 0.14} 14%|██████████▋ | 92/681 [04:10<25:57, 2.64s/it] 14%|██████████▊ | 93/681 [04:12<24:38, 2.51s/it] {'loss': 0.7855, 'grad_norm': 39.11857986450195, 'learning_rate': 4.982595640958425e-07, 'rewards/chosen': -0.5029890537261963, 'rewards/rejected': -1.6084156036376953, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.105426549911499, 'logps/chosen': -60.38475036621094, 'logps/rejected': -102.37294006347656, 'logps/ref_chosen': -52.529239654541016, 'logps/ref_rejected': -77.1607437133789, 'logits/chosen': -1.0493080615997314, 'logits/rejected': -0.9445855617523193, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.06395485997200012, 'epsilon_dpo/loss_margin_mean': 17.356679916381836, 'epsilon_dpo/beta_margin_mean': 1.105426549911499, 'epsilon_dpo/beta_margin_std': 1.2089306116104126, 'epsilon_dpo/beta_margin_grad_mean': -0.29885995388031006, 'epsilon_dpo/beta_margin_grad_std': 0.17560887336730957, 'kl/beta': 0.06437090039253235, 'kl/avg_steps': 0.65625, 'epoch': 0.14} 14%|██████████▊ | 93/681 [04:12<24:38, 2.51s/it] 14%|██████████▉ | 94/681 [04:15<25:53, 2.65s/it] {'loss': 0.7487, 'grad_norm': 41.127235412597656, 'learning_rate': 4.98105127417984e-07, 'rewards/chosen': -0.5378746390342712, 'rewards/rejected': -1.7199702262878418, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.1820956468582153, 'logps/chosen': -69.6494140625, 'logps/rejected': -126.70462036132812, 'logps/ref_chosen': -61.22261047363281, 'logps/ref_rejected': -99.59902954101562, 'logits/chosen': -1.0464091300964355, 'logits/rejected': -1.0004725456237793, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.06359785050153732, 'epsilon_dpo/loss_margin_mean': 18.67878532409668, 'epsilon_dpo/beta_margin_mean': 1.1820956468582153, 'epsilon_dpo/beta_margin_std': 1.1670469045639038, 'epsilon_dpo/beta_margin_grad_mean': -0.2872365713119507, 'epsilon_dpo/beta_margin_grad_std': 0.18094860017299652, 'kl/beta': 0.06395121663808823, 'kl/avg_steps': 0.5625, 'epoch': 0.14} 14%|██████████▉ | 94/681 [04:15<25:53, 2.65s/it] 14%|███████████ | 95/681 [04:18<25:32, 2.61s/it] {'loss': 0.8374, 'grad_norm': 40.1151237487793, 'learning_rate': 4.979441529392784e-07, 'rewards/chosen': -0.4412747621536255, 'rewards/rejected': -1.386451244354248, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.9451765418052673, 'logps/chosen': -59.46489715576172, 'logps/rejected': -97.84718322753906, 'logps/ref_chosen': -52.52364730834961, 'logps/ref_rejected': -75.88035583496094, 'logits/chosen': -1.0391969680786133, 'logits/rejected': -0.9397677779197693, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.06318248808383942, 'epsilon_dpo/loss_margin_mean': 15.025580406188965, 'epsilon_dpo/beta_margin_mean': 0.9451765418052673, 'epsilon_dpo/beta_margin_std': 1.0015146732330322, 'epsilon_dpo/beta_margin_grad_mean': -0.31443729996681213, 'epsilon_dpo/beta_margin_grad_std': 0.17064118385314941, 'kl/beta': 0.06359350681304932, 'kl/avg_steps': 0.65625, 'epoch': 0.14} 14%|███████████ | 95/681 [04:18<25:32, 2.61s/it] 14%|███████████▏ | 96/681 [04:20<25:37, 2.63s/it] {'loss': 0.727, 'grad_norm': 40.65923309326172, 'learning_rate': 4.977766449015534e-07, 'rewards/chosen': -0.3794512152671814, 'rewards/rejected': -1.6398093700408936, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.2603580951690674, 'logps/chosen': -68.17340850830078, 'logps/rejected': -122.7648696899414, 'logps/ref_chosen': -62.15697479248047, 'logps/ref_rejected': -96.59601593017578, 'logits/chosen': -1.0081816911697388, 'logits/rejected': -0.9590755701065063, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.06277056038379669, 'epsilon_dpo/loss_margin_mean': 20.152423858642578, 'epsilon_dpo/beta_margin_mean': 1.2603580951690674, 'epsilon_dpo/beta_margin_std': 1.311813235282898, 'epsilon_dpo/beta_margin_grad_mean': -0.27842971682548523, 'epsilon_dpo/beta_margin_grad_std': 0.17272476851940155, 'kl/beta': 0.0631788969039917, 'kl/avg_steps': 0.65625, 'epoch': 0.14} 14%|███████████▏ | 96/681 [04:20<25:37, 2.63s/it] 14%|███████████▎ | 97/681 [04:23<26:15, 2.70s/it] {'loss': 0.7976, 'grad_norm': 42.303070068359375, 'learning_rate': 4.976026077188012e-07, 'rewards/chosen': -0.43561291694641113, 'rewards/rejected': -1.4346325397491455, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.9990196228027344, 'logps/chosen': -61.59731674194336, 'logps/rejected': -99.99954986572266, 'logps/ref_chosen': -54.64636993408203, 'logps/ref_rejected': -76.96475219726562, 'logits/chosen': -1.06300687789917, 'logits/rejected': -0.8783408403396606, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.062439776957035065, 'epsilon_dpo/loss_margin_mean': 16.083852767944336, 'epsilon_dpo/beta_margin_mean': 0.9990195631980896, 'epsilon_dpo/beta_margin_std': 0.9852281808853149, 'epsilon_dpo/beta_margin_grad_mean': -0.30438145995140076, 'epsilon_dpo/beta_margin_grad_std': 0.16797401010990143, 'kl/beta': 0.06276698410511017, 'kl/avg_steps': 0.53125, 'epoch': 0.14} 14%|███████████▎ | 97/681 [04:23<26:15, 2.70s/it] 14%|███████████▎ | 98/681 [04:26<25:37, 2.64s/it] {'loss': 0.8297, 'grad_norm': 43.674556732177734, 'learning_rate': 4.974220459770639e-07, 'rewards/chosen': -0.5198632478713989, 'rewards/rejected': -1.6088919639587402, 'rewards/accuracies': 0.8125, 'rewards/margins': 1.0890288352966309, 'logps/chosen': -73.58518981933594, 'logps/rejected': -122.49948120117188, 'logps/ref_chosen': -65.25862884521484, 'logps/ref_rejected': -96.5274887084961, 'logits/chosen': -1.0256155729293823, 'logits/rejected': -0.9630335569381714, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.062109820544719696, 'epsilon_dpo/loss_margin_mean': 17.645435333251953, 'epsilon_dpo/beta_margin_mean': 1.0890288352966309, 'epsilon_dpo/beta_margin_std': 1.186418890953064, 'epsilon_dpo/beta_margin_grad_mean': -0.2962830364704132, 'epsilon_dpo/beta_margin_grad_std': 0.19924990832805634, 'kl/beta': 0.062435299158096313, 'kl/avg_steps': 0.53125, 'epoch': 0.14} 14%|███████████▎ | 98/681 [04:26<25:37, 2.64s/it] 15%|███████████▍ | 99/681 [04:28<24:37, 2.54s/it] {'loss': 0.7496, 'grad_norm': 39.50008010864258, 'learning_rate': 4.972349644343108e-07, 'rewards/chosen': -0.4397137761116028, 'rewards/rejected': -1.6406198740005493, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.2009060382843018, 'logps/chosen': -52.740440368652344, 'logps/rejected': -113.07182312011719, 'logps/ref_chosen': -45.63848114013672, 'logps/ref_rejected': -86.43792724609375, 'logits/chosen': -1.0253294706344604, 'logits/rejected': -0.9613098502159119, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.06172337755560875, 'epsilon_dpo/loss_margin_mean': 19.53193473815918, 'epsilon_dpo/beta_margin_mean': 1.2009061574935913, 'epsilon_dpo/beta_margin_std': 1.299239993095398, 'epsilon_dpo/beta_margin_grad_mean': -0.29098957777023315, 'epsilon_dpo/beta_margin_grad_std': 0.1694328337907791, 'kl/beta': 0.062105365097522736, 'kl/avg_steps': 0.625, 'epoch': 0.15} 15%|███████████▍ | 99/681 [04:28<24:37, 2.54s/it] 15%|███████████▍ | 100/681 [04:31<25:05, 2.59s/it] {'loss': 0.9809, 'grad_norm': 48.6550178527832, 'learning_rate': 4.970413680203148e-07, 'rewards/chosen': -0.4583064317703247, 'rewards/rejected': -1.28363835811615, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.82533198595047, 'logps/chosen': -65.0089111328125, 'logps/rejected': -95.01786804199219, 'logps/ref_chosen': -57.5939826965332, 'logps/ref_rejected': -74.06021118164062, 'logits/chosen': -0.9946659207344055, 'logits/rejected': -0.8828315734863281, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.061397869139909744, 'epsilon_dpo/loss_margin_mean': 13.542726516723633, 'epsilon_dpo/beta_margin_mean': 0.8253319263458252, 'epsilon_dpo/beta_margin_std': 1.1934319734573364, 'epsilon_dpo/beta_margin_grad_mean': -0.34565305709838867, 'epsilon_dpo/beta_margin_grad_std': 0.20056740939617157, 'kl/beta': 0.0617196150124073, 'kl/avg_steps': 0.53125, 'epoch': 0.15} 15%|███████████▍ | 100/681 [04:31<25:05, 2.59s/it][INFO|trainer.py:4307] 2026-04-18 00:42:40,415 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 00:42:40,415 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 00:42:40,415 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 00:47:48,787 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 00:47:48,787 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-200 [INFO|configuration_utils.py:419] 2026-04-18 00:48:46,984 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-200/config.json [INFO|configuration_utils.py:911] 2026-04-18 00:48:47,108 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-200/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-18 00:49:43,315 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-200/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-18 00:49:43,450 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-200/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-18 00:49:43,556 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-200/special_tokens_map.json 30%|██████████████████████▏ | 201/681 [15:06<13:18:44, 99.84s/it] {'loss': 0.6978, 'grad_norm': 32.52497482299805, 'learning_rate': 4.455721242469372e-07, 'rewards/chosen': -0.7529622316360474, 'rewards/rejected': -2.2497823238372803, 'rewards/accuracies': 0.875, 'rewards/margins': 1.4968202114105225, 'logps/chosen': -98.13409423828125, 'logps/rejected': -183.04058837890625, 'logps/ref_chosen': -75.4022216796875, 'logps/ref_rejected': -114.80821990966797, 'logits/chosen': -0.9245244264602661, 'logits/rejected': -0.8867864608764648, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.03303124010562897, 'epsilon_dpo/loss_margin_mean': 45.50050735473633, 'epsilon_dpo/beta_margin_mean': 1.4968202114105225, 'epsilon_dpo/beta_margin_std': 1.5476200580596924, 'epsilon_dpo/beta_margin_grad_mean': -0.2606509029865265, 'epsilon_dpo/beta_margin_grad_std': 0.19733315706253052, 'kl/beta': 0.03325657919049263, 'kl/avg_steps': 0.6875, 'epoch': 0.3} 30%|██████████████████████▏ | 201/681 [15:06<13:18:44, 99.84s/it] 30%|██████████████████████▌ | 202/681 [15:09<9:25:03, 70.78s/it] {'loss': 0.9358, 'grad_norm': 43.633670806884766, 'learning_rate': 4.4477014363141755e-07, 'rewards/chosen': -0.8780945539474487, 'rewards/rejected': -1.9198366403579712, 'rewards/accuracies': 0.8125, 'rewards/margins': 1.0417420864105225, 'logps/chosen': -76.72681427001953, 'logps/rejected': -145.55088806152344, 'logps/ref_chosen': -50.101318359375, 'logps/ref_rejected': -86.98503112792969, 'logits/chosen': -0.8974350094795227, 'logits/rejected': -0.8948566913604736, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.03284699469804764, 'epsilon_dpo/loss_margin_mean': 31.94036293029785, 'epsilon_dpo/beta_margin_mean': 1.0417420864105225, 'epsilon_dpo/beta_margin_std': 1.408299446105957, 'epsilon_dpo/beta_margin_grad_mean': -0.3225107192993164, 'epsilon_dpo/beta_margin_grad_std': 0.2235589325428009, 'kl/beta': 0.033029500395059586, 'kl/avg_steps': 0.5625, 'epoch': 0.3} 30%|██████████████████████▌ | 202/681 [15:09<9:25:03, 70.78s/it] 30%|██████████████████████▋ | 203/681 [15:12<6:41:49, 50.44s/it] {'loss': 0.7446, 'grad_norm': 31.58559799194336, 'learning_rate': 4.439630306414758e-07, 'rewards/chosen': -0.7440138459205627, 'rewards/rejected': -1.9402940273284912, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.1962802410125732, 'logps/chosen': -83.34878540039062, 'logps/rejected': -145.45053100585938, 'logps/ref_chosen': -60.60969543457031, 'logps/ref_rejected': -85.89596557617188, 'logits/chosen': -0.9989201426506042, 'logits/rejected': -0.887505292892456, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.03264273330569267, 'epsilon_dpo/loss_margin_mean': 36.81545639038086, 'epsilon_dpo/beta_margin_mean': 1.1962802410125732, 'epsilon_dpo/beta_margin_std': 1.203896403312683, 'epsilon_dpo/beta_margin_grad_mean': -0.28617146611213684, 'epsilon_dpo/beta_margin_grad_std': 0.17759016156196594, 'kl/beta': 0.032844748347997665, 'kl/avg_steps': 0.625, 'epoch': 0.3} 30%|██████████████████████▋ | 203/681 [15:12<6:41:49, 50.44s/it] 30%|██████████████████████▊ | 204/681 [15:15<4:48:08, 36.24s/it] {'loss': 0.7734, 'grad_norm': 34.97591018676758, 'learning_rate': 4.431508065452897e-07, 'rewards/chosen': -0.8980864882469177, 'rewards/rejected': -2.1283555030822754, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.230268955230713, 'logps/chosen': -107.8044204711914, 'logps/rejected': -153.438232421875, 'logps/ref_chosen': -80.16496276855469, 'logps/ref_rejected': -87.69590759277344, 'logits/chosen': -0.9733670949935913, 'logits/rejected': -0.8489159941673279, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.03243998438119888, 'epsilon_dpo/loss_margin_mean': 38.102867126464844, 'epsilon_dpo/beta_margin_mean': 1.230268955230713, 'epsilon_dpo/beta_margin_std': 1.4135836362838745, 'epsilon_dpo/beta_margin_grad_mean': -0.2921689748764038, 'epsilon_dpo/beta_margin_grad_std': 0.1850796341896057, 'kl/beta': 0.032640744000673294, 'kl/avg_steps': 0.625, 'epoch': 0.3} 30%|██████████████████████▊ | 204/681 [15:15<4:48:08, 36.24s/it] 30%|██████████████████████▉ | 205/681 [15:17<3:27:43, 26.18s/it] {'loss': 0.7467, 'grad_norm': 37.37656021118164, 'learning_rate': 4.4233349274571974e-07, 'rewards/chosen': -0.9544239044189453, 'rewards/rejected': -2.3895630836486816, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.4351390600204468, 'logps/chosen': -88.91207885742188, 'logps/rejected': -159.39739990234375, 'logps/ref_chosen': -59.384735107421875, 'logps/ref_rejected': -85.12505340576172, 'logits/chosen': -0.9361932277679443, 'logits/rejected': -0.8849596381187439, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.03224863111972809, 'epsilon_dpo/loss_margin_mean': 44.7449951171875, 'epsilon_dpo/beta_margin_mean': 1.4351390600204468, 'epsilon_dpo/beta_margin_std': 1.4817283153533936, 'epsilon_dpo/beta_margin_grad_mean': -0.266499400138855, 'epsilon_dpo/beta_margin_grad_std': 0.21775342524051666, 'kl/beta': 0.0324380062520504, 'kl/avg_steps': 0.59375, 'epoch': 0.3} 30%|██████████████████████▉ | 205/681 [15:17<3:27:43, 26.18s/it] 30%|██████████████████████▉ | 206/681 [15:20<2:30:53, 19.06s/it] {'loss': 0.6167, 'grad_norm': 31.1833553314209, 'learning_rate': 4.415111107797445e-07, 'rewards/chosen': -0.9376621246337891, 'rewards/rejected': -2.5803780555725098, 'rewards/accuracies': 0.875, 'rewards/margins': 1.6427156925201416, 'logps/chosen': -76.15838623046875, 'logps/rejected': -179.63294982910156, 'logps/ref_chosen': -46.964500427246094, 'logps/ref_rejected': -98.9534912109375, 'logits/chosen': -0.9038573503494263, 'logits/rejected': -0.919549822807312, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0320381261408329, 'epsilon_dpo/loss_margin_mean': 51.48557662963867, 'epsilon_dpo/beta_margin_mean': 1.6427158117294312, 'epsilon_dpo/beta_margin_std': 1.4306899309158325, 'epsilon_dpo/beta_margin_grad_mean': -0.2319086343050003, 'epsilon_dpo/beta_margin_grad_std': 0.19648852944374084, 'kl/beta': 0.03224654123187065, 'kl/avg_steps': 0.65625, 'epoch': 0.3} 30%|██████████████████████▉ | 206/681 [15:20<2:30:53, 19.06s/it] 30%|███████████████████████ | 207/681 [15:22<1:51:42, 14.14s/it] {'loss': 0.6375, 'grad_norm': 33.46366500854492, 'learning_rate': 4.4068368231789365e-07, 'rewards/chosen': -0.7494818568229675, 'rewards/rejected': -2.459930896759033, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.710448980331421, 'logps/chosen': -79.54106140136719, 'logps/rejected': -161.87109375, 'logps/ref_chosen': -56.05625915527344, 'logps/ref_rejected': -84.44779968261719, 'logits/chosen': -0.9281418323516846, 'logits/rejected': -0.862322211265564, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.031819239258766174, 'epsilon_dpo/loss_margin_mean': 53.9384880065918, 'epsilon_dpo/beta_margin_mean': 1.7104490995407104, 'epsilon_dpo/beta_margin_std': 1.7299070358276367, 'epsilon_dpo/beta_margin_grad_mean': -0.24663381278514862, 'epsilon_dpo/beta_margin_grad_std': 0.18707990646362305, 'kl/beta': 0.03203630447387695, 'kl/avg_steps': 0.6875, 'epoch': 0.3} 30%|███████████████████████ | 207/681 [15:22<1:51:42, 14.14s/it] 31%|███████████████████████▏ | 208/681 [15:25<1:24:41, 10.74s/it] {'loss': 0.7398, 'grad_norm': 47.45777893066406, 'learning_rate': 4.398512291636768e-07, 'rewards/chosen': -1.0961982011795044, 'rewards/rejected': -2.4754316806793213, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.3792335987091064, 'logps/chosen': -101.65859985351562, 'logps/rejected': -172.70254516601562, 'logps/ref_chosen': -67.06761169433594, 'logps/ref_rejected': -94.28689575195312, 'logits/chosen': -0.922553539276123, 'logits/rejected': -0.8577385544776917, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.031592030078172684, 'epsilon_dpo/loss_margin_mean': 43.82465744018555, 'epsilon_dpo/beta_margin_mean': 1.3792335987091064, 'epsilon_dpo/beta_margin_std': 1.4800665378570557, 'epsilon_dpo/beta_margin_grad_mean': -0.2750071585178375, 'epsilon_dpo/beta_margin_grad_std': 0.19815897941589355, 'kl/beta': 0.03181755915284157, 'kl/avg_steps': 0.71875, 'epoch': 0.31} 31%|███████████████████████▏ | 208/681 [15:25<1:24:41, 10.74s/it] 31%|███████████████████████▎ | 209/681 [15:28<1:04:56, 8.26s/it] {'loss': 0.7621, 'grad_norm': 33.01478958129883, 'learning_rate': 4.3901377325300857e-07, 'rewards/chosen': -0.8777600526809692, 'rewards/rejected': -2.209413528442383, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.331653356552124, 'logps/chosen': -84.05644226074219, 'logps/rejected': -151.44398498535156, 'logps/ref_chosen': -56.18169403076172, 'logps/ref_rejected': -80.94152069091797, 'logits/chosen': -0.8361650109291077, 'logits/rejected': -0.807563304901123, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.03140607476234436, 'epsilon_dpo/loss_margin_mean': 42.62771987915039, 'epsilon_dpo/beta_margin_mean': 1.3316532373428345, 'epsilon_dpo/beta_margin_std': 1.4257279634475708, 'epsilon_dpo/beta_margin_grad_mean': -0.27811017632484436, 'epsilon_dpo/beta_margin_grad_std': 0.20568938553333282, 'kl/beta': 0.031590502709150314, 'kl/avg_steps': 0.59375, 'epoch': 0.31} 31%|███████████████████████▎ | 209/681 [15:28<1:04:56, 8.26s/it] 31%|████████████████████████ | 210/681 [15:30<51:25, 6.55s/it] {'loss': 0.6964, 'grad_norm': 34.61418914794922, 'learning_rate': 4.381713366536311e-07, 'rewards/chosen': -0.8289717435836792, 'rewards/rejected': -2.1574063301086426, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.3284344673156738, 'logps/chosen': -72.92080688476562, 'logps/rejected': -145.9642791748047, 'logps/ref_chosen': -46.371822357177734, 'logps/ref_rejected': -76.68162536621094, 'logits/chosen': -0.8779923915863037, 'logits/rejected': -0.8523108959197998, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.031171627342700958, 'epsilon_dpo/loss_margin_mean': 42.733673095703125, 'epsilon_dpo/beta_margin_mean': 1.3284344673156738, 'epsilon_dpo/beta_margin_std': 1.2857555150985718, 'epsilon_dpo/beta_margin_grad_mean': -0.2719811201095581, 'epsilon_dpo/beta_margin_grad_std': 0.17441512644290924, 'kl/beta': 0.0314040407538414, 'kl/avg_steps': 0.75, 'epoch': 0.31} 31%|████████████████████████ | 210/681 [15:30<51:25, 6.55s/it] 31%|████████████████████████▏ | 211/681 [15:33<41:47, 5.33s/it] {'loss': 0.8303, 'grad_norm': 46.81315994262695, 'learning_rate': 4.373239415645323e-07, 'rewards/chosen': -1.2315609455108643, 'rewards/rejected': -2.42732572555542, 'rewards/accuracies': 0.796875, 'rewards/margins': 1.1957648992538452, 'logps/chosen': -118.53823852539062, 'logps/rejected': -165.2410430908203, 'logps/ref_chosen': -78.93235778808594, 'logps/ref_rejected': -86.82098388671875, 'logits/chosen': -0.9604991674423218, 'logits/rejected': -0.8495550155639648, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.031017513945698738, 'epsilon_dpo/loss_margin_mean': 38.81417465209961, 'epsilon_dpo/beta_margin_mean': 1.1957648992538452, 'epsilon_dpo/beta_margin_std': 1.424613118171692, 'epsilon_dpo/beta_margin_grad_mean': -0.3004344403743744, 'epsilon_dpo/beta_margin_grad_std': 0.21112920343875885, 'kl/beta': 0.03117026388645172, 'kl/avg_steps': 0.5, 'epoch': 0.31} 31%|████████████████████████▏ | 211/681 [15:33<41:47, 5.33s/it] 31%|████████████████████████▎ | 212/681 [15:36<35:34, 4.55s/it] {'loss': 0.6557, 'grad_norm': 35.138973236083984, 'learning_rate': 4.3647161031536086e-07, 'rewards/chosen': -1.0024387836456299, 'rewards/rejected': -2.7132415771484375, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.7108030319213867, 'logps/chosen': -90.64710998535156, 'logps/rejected': -191.25003051757812, 'logps/ref_chosen': -58.19701385498047, 'logps/ref_rejected': -103.05784606933594, 'logits/chosen': -0.8966690897941589, 'logits/rejected': -0.919607400894165, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.030834117904305458, 'epsilon_dpo/loss_margin_mean': 55.742095947265625, 'epsilon_dpo/beta_margin_mean': 1.7108030319213867, 'epsilon_dpo/beta_margin_std': 1.6675610542297363, 'epsilon_dpo/beta_margin_grad_mean': -0.24542976915836334, 'epsilon_dpo/beta_margin_grad_std': 0.20712795853614807, 'kl/beta': 0.03101518750190735, 'kl/avg_steps': 0.59375, 'epoch': 0.31} 31%|████████████████████████▎ | 212/681 [15:36<35:34, 4.55s/it] 31%|████████████████████████▍ | 213/681 [15:38<31:17, 4.01s/it] {'loss': 0.6496, 'grad_norm': 38.70331954956055, 'learning_rate': 4.3561436536583774e-07, 'rewards/chosen': -0.9383978247642517, 'rewards/rejected': -2.6488285064697266, 'rewards/accuracies': 0.875, 'rewards/margins': 1.7104308605194092, 'logps/chosen': -98.09537506103516, 'logps/rejected': -180.55552673339844, 'logps/ref_chosen': -67.51271057128906, 'logps/ref_rejected': -93.91471862792969, 'logits/chosen': -0.9104989767074585, 'logits/rejected': -0.8471628427505493, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.030632847920060158, 'epsilon_dpo/loss_margin_mean': 56.05814743041992, 'epsilon_dpo/beta_margin_mean': 1.7104308605194092, 'epsilon_dpo/beta_margin_std': 1.6659079790115356, 'epsilon_dpo/beta_margin_grad_mean': -0.24080544710159302, 'epsilon_dpo/beta_margin_grad_std': 0.1985619217157364, 'kl/beta': 0.030832121148705482, 'kl/avg_steps': 0.65625, 'epoch': 0.31} 31%|████████████████████████▍ | 213/681 [15:38<31:17, 4.01s/it] 31%|████████████████████████▌ | 214/681 [15:41<27:40, 3.56s/it] {'loss': 0.7848, 'grad_norm': 46.01312255859375, 'learning_rate': 4.3475222930516473e-07, 'rewards/chosen': -0.9344457387924194, 'rewards/rejected': -2.3005313873291016, 'rewards/accuracies': 0.875, 'rewards/margins': 1.3660856485366821, 'logps/chosen': -72.26516723632812, 'logps/rejected': -153.2633056640625, 'logps/ref_chosen': -41.604888916015625, 'logps/ref_rejected': -77.51741027832031, 'logits/chosen': -0.7757738828659058, 'logits/rejected': -0.7755211591720581, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.030418725684285164, 'epsilon_dpo/loss_margin_mean': 45.08562469482422, 'epsilon_dpo/beta_margin_mean': 1.3660856485366821, 'epsilon_dpo/beta_margin_std': 1.5981017351150513, 'epsilon_dpo/beta_margin_grad_mean': -0.28497371077537537, 'epsilon_dpo/beta_margin_grad_std': 0.19882433116436005, 'kl/beta': 0.030631106346845627, 'kl/avg_steps': 0.703125, 'epoch': 0.31} 31%|████████████████████████▌ | 214/681 [15:41<27:40, 3.56s/it] 32%|████████████████████████▋ | 215/681 [15:44<26:43, 3.44s/it] {'loss': 0.7021, 'grad_norm': 36.04100036621094, 'learning_rate': 4.3388522485142885e-07, 'rewards/chosen': -1.0044194459915161, 'rewards/rejected': -2.462951183319092, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.4585318565368652, 'logps/chosen': -86.42012023925781, 'logps/rejected': -171.57260131835938, 'logps/ref_chosen': -53.279266357421875, 'logps/ref_rejected': -89.96464538574219, 'logits/chosen': -0.8914802670478821, 'logits/rejected': -0.8997035622596741, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.030230149626731873, 'epsilon_dpo/loss_margin_mean': 48.46710205078125, 'epsilon_dpo/beta_margin_mean': 1.4585318565368652, 'epsilon_dpo/beta_margin_std': 1.5505485534667969, 'epsilon_dpo/beta_margin_grad_mean': -0.2637866735458374, 'epsilon_dpo/beta_margin_grad_std': 0.1917775273323059, 'kl/beta': 0.03041723370552063, 'kl/avg_steps': 0.625, 'epoch': 0.32} 32%|████████████████████████▋ | 215/681 [15:44<26:43, 3.44s/it] 32%|████████████████████████▋ | 216/681 [15:47<25:20, 3.27s/it] {'loss': 0.7448, 'grad_norm': 47.91659927368164, 'learning_rate': 4.330133748510036e-07, 'rewards/chosen': -1.115185260772705, 'rewards/rejected': -2.6599624156951904, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.5447771549224854, 'logps/chosen': -85.96902465820312, 'logps/rejected': -165.9541473388672, 'logps/ref_chosen': -48.887794494628906, 'logps/ref_rejected': -77.19892883300781, 'logits/chosen': -0.8279985189437866, 'logits/rejected': -0.7683322429656982, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.030023491010069847, 'epsilon_dpo/loss_margin_mean': 51.67399215698242, 'epsilon_dpo/beta_margin_mean': 1.5447771549224854, 'epsilon_dpo/beta_margin_std': 1.6737757921218872, 'epsilon_dpo/beta_margin_grad_mean': -0.26890814304351807, 'epsilon_dpo/beta_margin_grad_std': 0.2200879454612732, 'kl/beta': 0.030228307470679283, 'kl/avg_steps': 0.6875, 'epoch': 0.32} 32%|████████████████████████▋ | 216/681 [15:47<25:20, 3.27s/it] 32%|████████████████████████▊ | 217/681 [15:50<23:59, 3.10s/it] {'loss': 0.6145, 'grad_norm': 41.643062591552734, 'learning_rate': 4.3213670227794757e-07, 'rewards/chosen': -0.9862960577011108, 'rewards/rejected': -2.625056743621826, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.6387605667114258, 'logps/chosen': -82.93193054199219, 'logps/rejected': -188.33164978027344, 'logps/ref_chosen': -49.845306396484375, 'logps/ref_rejected': -100.07832336425781, 'logits/chosen': -0.8984875082969666, 'logits/rejected': -0.8015980124473572, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.029762189835309982, 'epsilon_dpo/loss_margin_mean': 55.16669845581055, 'epsilon_dpo/beta_margin_mean': 1.6387605667114258, 'epsilon_dpo/beta_margin_std': 1.45741868019104, 'epsilon_dpo/beta_margin_grad_mean': -0.23406146466732025, 'epsilon_dpo/beta_margin_grad_std': 0.18837563693523407, 'kl/beta': 0.03002190589904785, 'kl/avg_steps': 0.875, 'epoch': 0.32} 32%|████████████████████████▊ | 217/681 [15:50<23:59, 3.10s/it] 32%|████████████████████████▉ | 218/681 [15:52<23:23, 3.03s/it] {'loss': 0.7488, 'grad_norm': 42.43049621582031, 'learning_rate': 4.3125523023339815e-07, 'rewards/chosen': -1.0180344581604004, 'rewards/rejected': -2.4033541679382324, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.385319709777832, 'logps/chosen': -92.91569519042969, 'logps/rejected': -169.2481689453125, 'logps/ref_chosen': -58.576683044433594, 'logps/ref_rejected': -87.84639739990234, 'logits/chosen': -0.925573468208313, 'logits/rejected': -0.8360555768013, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.029578441753983498, 'epsilon_dpo/loss_margin_mean': 47.062767028808594, 'epsilon_dpo/beta_margin_mean': 1.385319709777832, 'epsilon_dpo/beta_margin_std': 1.5074750185012817, 'epsilon_dpo/beta_margin_grad_mean': -0.2751648426055908, 'epsilon_dpo/beta_margin_grad_std': 0.20698566734790802, 'kl/beta': 0.02976149320602417, 'kl/avg_steps': 0.625, 'epoch': 0.32} 32%|████████████████████████▉ | 218/681 [15:52<23:23, 3.03s/it] 32%|█████████████████████████ | 219/681 [15:55<22:43, 2.95s/it] {'loss': 0.7962, 'grad_norm': 42.281761169433594, 'learning_rate': 4.303689819449636e-07, 'rewards/chosen': -1.0273648500442505, 'rewards/rejected': -2.195300817489624, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.167935848236084, 'logps/chosen': -95.96543884277344, 'logps/rejected': -160.62933349609375, 'logps/ref_chosen': -61.083858489990234, 'logps/ref_rejected': -85.83042907714844, 'logits/chosen': -0.8818705677986145, 'logits/rejected': -0.8401570320129395, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.029385482892394066, 'epsilon_dpo/loss_margin_mean': 39.917320251464844, 'epsilon_dpo/beta_margin_mean': 1.1679359674453735, 'epsilon_dpo/beta_margin_std': 1.2931469678878784, 'epsilon_dpo/beta_margin_grad_mean': -0.29207971692085266, 'epsilon_dpo/beta_margin_grad_std': 0.18771642446517944, 'kl/beta': 0.02957664057612419, 'kl/avg_steps': 0.65625, 'epoch': 0.32} 32%|█████████████████████████ | 219/681 [15:55<22:43, 2.95s/it] 32%|█████████████████████████▏ | 220/681 [15:58<22:10, 2.89s/it] {'loss': 0.7876, 'grad_norm': 40.59983444213867, 'learning_rate': 4.2947798076611047e-07, 'rewards/chosen': -1.0686886310577393, 'rewards/rejected': -2.1185829639434814, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.0498943328857422, 'logps/chosen': -106.59527587890625, 'logps/rejected': -160.37210083007812, 'logps/ref_chosen': -70.03128051757812, 'logps/ref_rejected': -87.68551635742188, 'logits/chosen': -0.9992510080337524, 'logits/rejected': -0.8547923564910889, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.029193896800279617, 'epsilon_dpo/loss_margin_mean': 36.12261199951172, 'epsilon_dpo/beta_margin_mean': 1.0498942136764526, 'epsilon_dpo/beta_margin_std': 1.0387635231018066, 'epsilon_dpo/beta_margin_grad_mean': -0.3001166582107544, 'epsilon_dpo/beta_margin_grad_std': 0.17635947465896606, 'kl/beta': 0.029383808374404907, 'kl/avg_steps': 0.65625, 'epoch': 0.32} 32%|█████████████████████████▏ | 220/681 [15:58<22:10, 2.89s/it] 32%|█████████████████████████▎ | 221/681 [16:01<21:38, 2.82s/it] {'loss': 0.5239, 'grad_norm': 35.39186096191406, 'learning_rate': 4.285822501755485e-07, 'rewards/chosen': -1.0850903987884521, 'rewards/rejected': -2.928849697113037, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.843759298324585, 'logps/chosen': -89.57997131347656, 'logps/rejected': -207.70077514648438, 'logps/ref_chosen': -52.15470886230469, 'logps/ref_rejected': -106.46768188476562, 'logits/chosen': -0.8663440346717834, 'logits/rejected': -0.9057981967926025, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.028957944363355637, 'epsilon_dpo/loss_margin_mean': 63.807823181152344, 'epsilon_dpo/beta_margin_mean': 1.8437591791152954, 'epsilon_dpo/beta_margin_std': 1.4009723663330078, 'epsilon_dpo/beta_margin_grad_mean': -0.20571331679821014, 'epsilon_dpo/beta_margin_grad_std': 0.17811840772628784, 'kl/beta': 0.029192235320806503, 'kl/avg_steps': 0.8125, 'epoch': 0.32} 32%|█████████████████████████▎ | 221/681 [16:01<21:38, 2.82s/it] 33%|█████████████████████████▍ | 222/681 [16:03<21:24, 2.80s/it] {'loss': 0.7478, 'grad_norm': 45.9877815246582, 'learning_rate': 4.276818137766118e-07, 'rewards/chosen': -1.1236631870269775, 'rewards/rejected': -2.51025390625, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.3865907192230225, 'logps/chosen': -99.87564086914062, 'logps/rejected': -187.33274841308594, 'logps/ref_chosen': -60.971099853515625, 'logps/ref_rejected': -100.00115203857422, 'logits/chosen': -0.951972246170044, 'logits/rejected': -0.9163509607315063, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0287607554346323, 'epsilon_dpo/loss_margin_mean': 48.42705154418945, 'epsilon_dpo/beta_margin_mean': 1.3865907192230225, 'epsilon_dpo/beta_margin_std': 1.4618659019470215, 'epsilon_dpo/beta_margin_grad_mean': -0.27154117822647095, 'epsilon_dpo/beta_margin_grad_std': 0.20897042751312256, 'kl/beta': 0.028956959024071693, 'kl/avg_steps': 0.6875, 'epoch': 0.33} 33%|█████████████████████████▍ | 222/681 [16:03<21:24, 2.80s/it] 33%|█████████████████████████▌ | 223/681 [16:06<20:23, 2.67s/it] {'loss': 0.8655, 'grad_norm': 52.527828216552734, 'learning_rate': 4.2677669529663686e-07, 'rewards/chosen': -1.2237632274627686, 'rewards/rejected': -2.423802375793457, 'rewards/accuracies': 0.765625, 'rewards/margins': 1.2000389099121094, 'logps/chosen': -95.27392578125, 'logps/rejected': -167.70492553710938, 'logps/ref_chosen': -52.64057922363281, 'logps/ref_rejected': -82.82502746582031, 'logits/chosen': -0.9274756908416748, 'logits/rejected': -0.7958655953407288, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.028600329533219337, 'epsilon_dpo/loss_margin_mean': 42.24653625488281, 'epsilon_dpo/beta_margin_mean': 1.200039029121399, 'epsilon_dpo/beta_margin_std': 1.4345189332962036, 'epsilon_dpo/beta_margin_grad_mean': -0.2979002594947815, 'epsilon_dpo/beta_margin_grad_std': 0.23002079129219055, 'kl/beta': 0.028759239241480827, 'kl/avg_steps': 0.5625, 'epoch': 0.33} 33%|█████████████████████████▌ | 223/681 [16:06<20:23, 2.67s/it] 33%|█████████████████████████▋ | 224/681 [16:08<19:19, 2.54s/it] {'loss': 0.7383, 'grad_norm': 43.929931640625, 'learning_rate': 4.2586691858633747e-07, 'rewards/chosen': -1.002156376838684, 'rewards/rejected': -2.46635103225708, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.4641945362091064, 'logps/chosen': -83.71321105957031, 'logps/rejected': -163.98226928710938, 'logps/ref_chosen': -48.59540939331055, 'logps/ref_rejected': -77.11648559570312, 'logits/chosen': -0.9127311706542969, 'logits/rejected': -0.8163399696350098, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.02845822647213936, 'epsilon_dpo/loss_margin_mean': 51.74797821044922, 'epsilon_dpo/beta_margin_mean': 1.4641945362091064, 'epsilon_dpo/beta_margin_std': 1.5582221746444702, 'epsilon_dpo/beta_margin_grad_mean': -0.26928919553756714, 'epsilon_dpo/beta_margin_grad_std': 0.2087675780057907, 'kl/beta': 0.028598373755812645, 'kl/avg_steps': 0.5, 'epoch': 0.33} 33%|█████████████████████████▋ | 224/681 [16:08<19:19, 2.54s/it] 33%|█████████████████████████▊ | 225/681 [16:10<18:59, 2.50s/it] {'loss': 0.6025, 'grad_norm': 35.99250411987305, 'learning_rate': 4.249525076191759e-07, 'rewards/chosen': -1.1272399425506592, 'rewards/rejected': -2.882115602493286, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.754875659942627, 'logps/chosen': -97.8477783203125, 'logps/rejected': -202.06344604492188, 'logps/ref_chosen': -58.000465393066406, 'logps/ref_rejected': -99.90290832519531, 'logits/chosen': -0.9548609256744385, 'logits/rejected': -0.9496945142745972, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.028254389762878418, 'epsilon_dpo/loss_margin_mean': 62.31322479248047, 'epsilon_dpo/beta_margin_mean': 1.7548757791519165, 'epsilon_dpo/beta_margin_std': 1.5744292736053467, 'epsilon_dpo/beta_margin_grad_mean': -0.22889071702957153, 'epsilon_dpo/beta_margin_grad_std': 0.1976906806230545, 'kl/beta': 0.02845609374344349, 'kl/avg_steps': 0.71875, 'epoch': 0.33} 33%|█████████████████████████▊ | 225/681 [16:10<18:59, 2.50s/it] 33%|█████████████████████████▉ | 226/681 [16:13<20:01, 2.64s/it] {'loss': 0.7437, 'grad_norm': 39.40699768066406, 'learning_rate': 4.2403348649073167e-07, 'rewards/chosen': -0.9137367010116577, 'rewards/rejected': -2.2124862670898438, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.2987498044967651, 'logps/chosen': -91.37918090820312, 'logps/rejected': -157.63363647460938, 'logps/ref_chosen': -58.898799896240234, 'logps/ref_rejected': -78.68775939941406, 'logits/chosen': -0.9883425831794739, 'logits/rejected': -0.858059823513031, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.02807479165494442, 'epsilon_dpo/loss_margin_mean': 46.46550369262695, 'epsilon_dpo/beta_margin_mean': 1.2987498044967651, 'epsilon_dpo/beta_margin_std': 1.3646435737609863, 'epsilon_dpo/beta_margin_grad_mean': -0.2813953459262848, 'epsilon_dpo/beta_margin_grad_std': 0.18881773948669434, 'kl/beta': 0.02825302444398403, 'kl/avg_steps': 0.640625, 'epoch': 0.33} 33%|█████████████████████████▉ | 226/681 [16:13<20:01, 2.64s/it] 33%|██████████████████████████ | 227/681 [16:16<19:37, 2.59s/it] {'loss': 0.6364, 'grad_norm': 36.23415756225586, 'learning_rate': 4.2310987941806615e-07, 'rewards/chosen': -1.0061688423156738, 'rewards/rejected': -2.6720492839813232, 'rewards/accuracies': 0.875, 'rewards/margins': 1.6658804416656494, 'logps/chosen': -95.04913330078125, 'logps/rejected': -195.35543823242188, 'logps/ref_chosen': -59.072181701660156, 'logps/ref_rejected': -99.41236877441406, 'logits/chosen': -0.9570825099945068, 'logits/rejected': -0.9183826446533203, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.027891740202903748, 'epsilon_dpo/loss_margin_mean': 59.96611404418945, 'epsilon_dpo/beta_margin_mean': 1.6658804416656494, 'epsilon_dpo/beta_margin_std': 1.555821180343628, 'epsilon_dpo/beta_margin_grad_mean': -0.2355383336544037, 'epsilon_dpo/beta_margin_grad_std': 0.19523394107818604, 'kl/beta': 0.028073180466890335, 'kl/avg_steps': 0.65625, 'epoch': 0.33} 33%|██████████████████████████ | 227/681 [16:16<19:37, 2.59s/it] 33%|██████████████████████████ | 228/681 [16:19<20:29, 2.71s/it] {'loss': 0.7764, 'grad_norm': 37.50095748901367, 'learning_rate': 4.2218171073908463e-07, 'rewards/chosen': -0.9902582168579102, 'rewards/rejected': -2.1967992782592773, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.2065410614013672, 'logps/chosen': -101.53741455078125, 'logps/rejected': -170.43927001953125, 'logps/ref_chosen': -65.89129638671875, 'logps/ref_rejected': -91.04875183105469, 'logits/chosen': -0.9429744482040405, 'logits/rejected': -0.9067270755767822, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.027744758874177933, 'epsilon_dpo/loss_margin_mean': 43.744407653808594, 'epsilon_dpo/beta_margin_mean': 1.2065411806106567, 'epsilon_dpo/beta_margin_std': 1.2601219415664673, 'epsilon_dpo/beta_margin_grad_mean': -0.2855328619480133, 'epsilon_dpo/beta_margin_grad_std': 0.19863800704479218, 'kl/beta': 0.027890151366591454, 'kl/avg_steps': 0.53125, 'epoch': 0.33} 33%|██████████████████████████ | 228/681 [16:19<20:29, 2.71s/it] 34%|██████████████████████████▏ | 229/681 [16:21<20:07, 2.67s/it] {'loss': 0.7629, 'grad_norm': 38.05533981323242, 'learning_rate': 4.212490049118951e-07, 'rewards/chosen': -1.0388529300689697, 'rewards/rejected': -2.423427104949951, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.3845741748809814, 'logps/chosen': -108.31786346435547, 'logps/rejected': -172.60848999023438, 'logps/ref_chosen': -70.70636749267578, 'logps/ref_rejected': -84.52740478515625, 'logits/chosen': -0.963711142539978, 'logits/rejected': -0.8549022078514099, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.027563462033867836, 'epsilon_dpo/loss_margin_mean': 50.469581604003906, 'epsilon_dpo/beta_margin_mean': 1.3845741748809814, 'epsilon_dpo/beta_margin_std': 1.4678245782852173, 'epsilon_dpo/beta_margin_grad_mean': -0.2710328698158264, 'epsilon_dpo/beta_margin_grad_std': 0.2131662368774414, 'kl/beta': 0.027742767706513405, 'kl/avg_steps': 0.65625, 'epoch': 0.34} 34%|██████████████████████████▏ | 229/681 [16:21<20:07, 2.67s/it] 34%|██████████████████████████▎ | 230/681 [16:24<19:44, 2.63s/it] {'loss': 0.6508, 'grad_norm': 33.867340087890625, 'learning_rate': 4.203117865141635e-07, 'rewards/chosen': -0.8789701461791992, 'rewards/rejected': -2.5428528785705566, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.6638827323913574, 'logps/chosen': -71.353759765625, 'logps/rejected': -178.71905517578125, 'logps/ref_chosen': -39.282005310058594, 'logps/ref_rejected': -85.62191009521484, 'logits/chosen': -0.8555896282196045, 'logits/rejected': -0.8511315584182739, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0273579154163599, 'epsilon_dpo/loss_margin_mean': 61.025386810302734, 'epsilon_dpo/beta_margin_mean': 1.6638827323913574, 'epsilon_dpo/beta_margin_std': 1.6286611557006836, 'epsilon_dpo/beta_margin_grad_mean': -0.23891520500183105, 'epsilon_dpo/beta_margin_grad_std': 0.1923268884420395, 'kl/beta': 0.027561893686652184, 'kl/avg_steps': 0.75, 'epoch': 0.34} 34%|██████████████████████████▎ | 230/681 [16:24<19:44, 2.63s/it] 34%|██████████████████████████▍ | 231/681 [16:27<19:58, 2.66s/it] {'loss': 0.7175, 'grad_norm': 30.824636459350586, 'learning_rate': 4.1937008024246625e-07, 'rewards/chosen': -0.8553760051727295, 'rewards/rejected': -2.1662395000457764, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.3108634948730469, 'logps/chosen': -94.687255859375, 'logps/rejected': -153.92933654785156, 'logps/ref_chosen': -63.27644348144531, 'logps/ref_rejected': -74.1239013671875, 'logits/chosen': -0.9609138369560242, 'logits/rejected': -0.8604573607444763, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0271841399371624, 'epsilon_dpo/loss_margin_mean': 48.394615173339844, 'epsilon_dpo/beta_margin_mean': 1.3108634948730469, 'epsilon_dpo/beta_margin_std': 1.3848408460617065, 'epsilon_dpo/beta_margin_grad_mean': -0.2801465094089508, 'epsilon_dpo/beta_margin_grad_std': 0.17204877734184265, 'kl/beta': 0.027356717735528946, 'kl/avg_steps': 0.640625, 'epoch': 0.34} 34%|██████████████████████████▍ | 231/681 [16:27<19:58, 2.66s/it] 34%|██████████████████████████▌ | 232/681 [16:30<20:37, 2.76s/it] {'loss': 0.8942, 'grad_norm': 41.94512939453125, 'learning_rate': 4.1842391091163933e-07, 'rewards/chosen': -1.068268060684204, 'rewards/rejected': -2.098348617553711, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.0300805568695068, 'logps/chosen': -110.14714813232422, 'logps/rejected': -161.71157836914062, 'logps/ref_chosen': -70.74876403808594, 'logps/ref_rejected': -83.97706604003906, 'logits/chosen': -0.9774780869483948, 'logits/rejected': -0.7848711013793945, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.027032380923628807, 'epsilon_dpo/loss_margin_mean': 38.336116790771484, 'epsilon_dpo/beta_margin_mean': 1.0300805568695068, 'epsilon_dpo/beta_margin_std': 1.3635324239730835, 'epsilon_dpo/beta_margin_grad_mean': -0.32459476590156555, 'epsilon_dpo/beta_margin_grad_std': 0.2017158418893814, 'kl/beta': 0.027182579040527344, 'kl/avg_steps': 0.5625, 'epoch': 0.34} 34%|██████████████████████████▌ | 232/681 [16:30<20:37, 2.76s/it] 34%|██████████████████████████▋ | 233/681 [16:32<20:57, 2.81s/it] {'loss': 0.7647, 'grad_norm': 36.417903900146484, 'learning_rate': 4.174733034541245e-07, 'rewards/chosen': -0.9597580432891846, 'rewards/rejected': -2.471172332763672, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.5114142894744873, 'logps/chosen': -90.49263763427734, 'logps/rejected': -199.63160705566406, 'logps/ref_chosen': -54.8829345703125, 'logps/ref_rejected': -107.48007202148438, 'logits/chosen': -0.9822086095809937, 'logits/rejected': -0.9409425258636475, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0268896222114563, 'epsilon_dpo/loss_margin_mean': 56.54182815551758, 'epsilon_dpo/beta_margin_mean': 1.5114142894744873, 'epsilon_dpo/beta_margin_std': 1.7114201784133911, 'epsilon_dpo/beta_margin_grad_mean': -0.27420952916145325, 'epsilon_dpo/beta_margin_grad_std': 0.22176600992679596, 'kl/beta': 0.02703053317964077, 'kl/avg_steps': 0.53125, 'epoch': 0.34} 34%|██████████████████████████▋ | 233/681 [16:33<20:57, 2.81s/it] 34%|██████████████████████████▊ | 234/681 [16:35<21:01, 2.82s/it] {'loss': 0.6489, 'grad_norm': 34.17942428588867, 'learning_rate': 4.165182829193126e-07, 'rewards/chosen': -0.8547545671463013, 'rewards/rejected': -2.3802549839019775, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.5255002975463867, 'logps/chosen': -76.05662536621094, 'logps/rejected': -189.32183837890625, 'logps/ref_chosen': -44.09451675415039, 'logps/ref_rejected': -100.00663757324219, 'logits/chosen': -0.9014260768890381, 'logits/rejected': -0.9556354284286499, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.026680300012230873, 'epsilon_dpo/loss_margin_mean': 57.353092193603516, 'epsilon_dpo/beta_margin_mean': 1.5255002975463867, 'epsilon_dpo/beta_margin_std': 1.3624063730239868, 'epsilon_dpo/beta_margin_grad_mean': -0.23809513449668884, 'epsilon_dpo/beta_margin_grad_std': 0.18818014860153198, 'kl/beta': 0.026887692511081696, 'kl/avg_steps': 0.78125, 'epoch': 0.34} 34%|██████████████████████████▊ | 234/681 [16:35<21:01, 2.82s/it] 35%|██████████████████████████▉ | 235/681 [16:38<20:27, 2.75s/it] {'loss': 0.8567, 'grad_norm': 47.0859489440918, 'learning_rate': 4.1555887447288255e-07, 'rewards/chosen': -1.1885778903961182, 'rewards/rejected': -2.308373212814331, 'rewards/accuracies': 0.8125, 'rewards/margins': 1.119795322418213, 'logps/chosen': -106.87683868408203, 'logps/rejected': -177.4877166748047, 'logps/ref_chosen': -62.237911224365234, 'logps/ref_rejected': -90.39505767822266, 'logits/chosen': -0.9289396405220032, 'logits/rejected': -0.8603818416595459, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.026548519730567932, 'epsilon_dpo/loss_margin_mean': 42.453731536865234, 'epsilon_dpo/beta_margin_mean': 1.119795322418213, 'epsilon_dpo/beta_margin_std': 1.4032968282699585, 'epsilon_dpo/beta_margin_grad_mean': -0.3122892677783966, 'epsilon_dpo/beta_margin_grad_std': 0.2024298459291458, 'kl/beta': 0.026679260656237602, 'kl/avg_steps': 0.5, 'epoch': 0.35} 35%|██████████████████████████▉ | 235/681 [16:38<20:27, 2.75s/it] 35%|███████████████████████████ | 236/681 [16:41<20:39, 2.79s/it] {'loss': 0.6821, 'grad_norm': 44.04022216796875, 'learning_rate': 4.1459510339613946e-07, 'rewards/chosen': -0.8701371550559998, 'rewards/rejected': -2.265639543533325, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.3955023288726807, 'logps/chosen': -82.28189086914062, 'logps/rejected': -189.5817108154297, 'logps/ref_chosen': -49.34136199951172, 'logps/ref_rejected': -103.51162719726562, 'logits/chosen': -0.8796597719192505, 'logits/rejected': -0.9121089577674866, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.026354169473052025, 'epsilon_dpo/loss_margin_mean': 53.129547119140625, 'epsilon_dpo/beta_margin_mean': 1.3955023288726807, 'epsilon_dpo/beta_margin_std': 1.2780405282974243, 'epsilon_dpo/beta_margin_grad_mean': -0.26097211241722107, 'epsilon_dpo/beta_margin_grad_std': 0.18908995389938354, 'kl/beta': 0.026546526700258255, 'kl/avg_steps': 0.734375, 'epoch': 0.35} 35%|███████████████████████████ | 236/681 [16:41<20:39, 2.79s/it] 35%|███████████████████████████▏ | 237/681 [16:44<20:50, 2.82s/it] {'loss': 0.7318, 'grad_norm': 43.82939910888672, 'learning_rate': 4.136269950853473e-07, 'rewards/chosen': -1.0194590091705322, 'rewards/rejected': -2.3822340965270996, 'rewards/accuracies': 0.875, 'rewards/margins': 1.3627753257751465, 'logps/chosen': -93.03535461425781, 'logps/rejected': -185.932373046875, 'logps/ref_chosen': -54.168121337890625, 'logps/ref_rejected': -94.78036499023438, 'logits/chosen': -0.8822282552719116, 'logits/rejected': -0.8408148288726807, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.02617851458489895, 'epsilon_dpo/loss_margin_mean': 52.28477478027344, 'epsilon_dpo/beta_margin_mean': 1.3627753257751465, 'epsilon_dpo/beta_margin_std': 1.378995656967163, 'epsilon_dpo/beta_margin_grad_mean': -0.2683834135532379, 'epsilon_dpo/beta_margin_grad_std': 0.1993410587310791, 'kl/beta': 0.02635299786925316, 'kl/avg_steps': 0.671875, 'epoch': 0.35} 35%|███████████████████████████▏ | 237/681 [16:44<20:50, 2.82s/it] 35%|███████████████████████████▎ | 238/681 [16:47<20:47, 2.82s/it] {'loss': 0.7486, 'grad_norm': 48.45499801635742, 'learning_rate': 4.126545750510605e-07, 'rewards/chosen': -0.999852180480957, 'rewards/rejected': -2.1958298683166504, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.1959774494171143, 'logps/chosen': -92.368408203125, 'logps/rejected': -173.97970581054688, 'logps/ref_chosen': -53.973121643066406, 'logps/ref_rejected': -89.41795349121094, 'logits/chosen': -0.8099647760391235, 'logits/rejected': -0.8988782167434692, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.02600793167948723, 'epsilon_dpo/loss_margin_mean': 46.166481018066406, 'epsilon_dpo/beta_margin_mean': 1.1959774494171143, 'epsilon_dpo/beta_margin_std': 1.265442967414856, 'epsilon_dpo/beta_margin_grad_mean': -0.2891097664833069, 'epsilon_dpo/beta_margin_grad_std': 0.17171606421470642, 'kl/beta': 0.026177119463682175, 'kl/avg_steps': 0.65625, 'epoch': 0.35} 35%|███████████████████████████▎ | 238/681 [16:47<20:47, 2.82s/it] 35%|███████████████████████████▎ | 239/681 [16:49<19:59, 2.71s/it] {'loss': 0.7085, 'grad_norm': 49.931846618652344, 'learning_rate': 4.116778689174514e-07, 'rewards/chosen': -1.0936906337738037, 'rewards/rejected': -2.446521759033203, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.3528313636779785, 'logps/chosen': -100.35231018066406, 'logps/rejected': -188.42718505859375, 'logps/ref_chosen': -58.09782409667969, 'logps/ref_rejected': -93.59294128417969, 'logits/chosen': -0.8864554166793823, 'logits/rejected': -0.8159253597259521, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.025830240920186043, 'epsilon_dpo/loss_margin_mean': 52.579769134521484, 'epsilon_dpo/beta_margin_mean': 1.352831244468689, 'epsilon_dpo/beta_margin_std': 1.3413230180740356, 'epsilon_dpo/beta_margin_grad_mean': -0.27064305543899536, 'epsilon_dpo/beta_margin_grad_std': 0.18519093096256256, 'kl/beta': 0.02600645273923874, 'kl/avg_steps': 0.6875, 'epoch': 0.35} 35%|███████████████████████████▎ | 239/681 [16:49<19:59, 2.71s/it] 35%|███████████████████████████▍ | 240/681 [16:52<19:59, 2.72s/it] {'loss': 0.7647, 'grad_norm': 36.69154739379883, 'learning_rate': 4.106969024216348e-07, 'rewards/chosen': -1.100210189819336, 'rewards/rejected': -2.3409833908081055, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.2407732009887695, 'logps/chosen': -103.43043518066406, 'logps/rejected': -165.52423095703125, 'logps/ref_chosen': -60.6144905090332, 'logps/ref_rejected': -74.1185302734375, 'logits/chosen': -0.861579954624176, 'logits/rejected': -0.8070861101150513, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.025670016184449196, 'epsilon_dpo/loss_margin_mean': 48.58975601196289, 'epsilon_dpo/beta_margin_mean': 1.24077308177948, 'epsilon_dpo/beta_margin_std': 1.2659971714019775, 'epsilon_dpo/beta_margin_grad_mean': -0.28202009201049805, 'epsilon_dpo/beta_margin_grad_std': 0.2018457055091858, 'kl/beta': 0.025828879326581955, 'kl/avg_steps': 0.625, 'epoch': 0.35} 35%|███████████████████████████▍ | 240/681 [16:52<19:59, 2.72s/it] 35%|███████████████████████████▌ | 241/681 [16:54<19:33, 2.67s/it] {'loss': 0.6398, 'grad_norm': 38.793495178222656, 'learning_rate': 4.097117014129903e-07, 'rewards/chosen': -0.9906549453735352, 'rewards/rejected': -2.7133307456970215, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.7226760387420654, 'logps/chosen': -104.88059997558594, 'logps/rejected': -194.681396484375, 'logps/ref_chosen': -66.091064453125, 'logps/ref_rejected': -88.06088256835938, 'logits/chosen': -0.8475504517555237, 'logits/rejected': -0.7058205604553223, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.025486508384346962, 'epsilon_dpo/loss_margin_mean': 67.83097076416016, 'epsilon_dpo/beta_margin_mean': 1.7226759195327759, 'epsilon_dpo/beta_margin_std': 1.789014458656311, 'epsilon_dpo/beta_margin_grad_mean': -0.2389088273048401, 'epsilon_dpo/beta_margin_grad_std': 0.1938583105802536, 'kl/beta': 0.025668451562523842, 'kl/avg_steps': 0.71875, 'epoch': 0.35} 35%|███████████████████████████▌ | 241/681 [16:54<19:33, 2.67s/it] 36%|███████████████████████████▋ | 242/681 [16:57<19:17, 2.64s/it] {'loss': 0.7818, 'grad_norm': 69.306884765625, 'learning_rate': 4.087222918524807e-07, 'rewards/chosen': -1.1993852853775024, 'rewards/rejected': -2.3545730113983154, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.1551876068115234, 'logps/chosen': -115.20914459228516, 'logps/rejected': -176.51649475097656, 'logps/ref_chosen': -67.86392211914062, 'logps/ref_rejected': -83.36033630371094, 'logits/chosen': -0.7396622896194458, 'logits/rejected': -0.7340766191482544, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.025296665728092194, 'epsilon_dpo/loss_margin_mean': 45.810943603515625, 'epsilon_dpo/beta_margin_mean': 1.155187726020813, 'epsilon_dpo/beta_margin_std': 1.2903164625167847, 'epsilon_dpo/beta_margin_grad_mean': -0.2969255745410919, 'epsilon_dpo/beta_margin_grad_std': 0.1766965687274933, 'kl/beta': 0.02548527531325817, 'kl/avg_steps': 0.75, 'epoch': 0.36} 36%|███████████████████████████▋ | 242/681 [16:57<19:17, 2.64s/it] 36%|███████████████████████████▊ | 243/681 [17:00<19:22, 2.65s/it] {'loss': 0.5833, 'grad_norm': 44.07530975341797, 'learning_rate': 4.07728699811968e-07, 'rewards/chosen': -1.2278523445129395, 'rewards/rejected': -2.8480563163757324, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.6202038526535034, 'logps/chosen': -111.91984558105469, 'logps/rejected': -189.87693786621094, 'logps/ref_chosen': -63.08424377441406, 'logps/ref_rejected': -76.33563232421875, 'logits/chosen': -0.8574939966201782, 'logits/rejected': -0.7249910831451416, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.025108352303504944, 'epsilon_dpo/loss_margin_mean': 64.70569610595703, 'epsilon_dpo/beta_margin_mean': 1.6202038526535034, 'epsilon_dpo/beta_margin_std': 1.3733315467834473, 'epsilon_dpo/beta_margin_grad_mean': -0.23020483553409576, 'epsilon_dpo/beta_margin_grad_std': 0.17036239802837372, 'kl/beta': 0.02529555931687355, 'kl/avg_steps': 0.75, 'epoch': 0.36} 36%|███████████████████████████▊ | 243/681 [17:00<19:22, 2.65s/it] 36%|███████████████████████████▉ | 244/681 [17:02<19:23, 2.66s/it] {'loss': 0.6755, 'grad_norm': 49.93527603149414, 'learning_rate': 4.067309514735267e-07, 'rewards/chosen': -1.093597650527954, 'rewards/rejected': -2.5280508995056152, 'rewards/accuracies': 0.875, 'rewards/margins': 1.4344532489776611, 'logps/chosen': -104.9389877319336, 'logps/rejected': -196.42117309570312, 'logps/ref_chosen': -61.14069366455078, 'logps/ref_rejected': -94.89193725585938, 'logits/chosen': -0.7641308903694153, 'logits/rejected': -0.7154449224472046, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.024944983422756195, 'epsilon_dpo/loss_margin_mean': 57.7309455871582, 'epsilon_dpo/beta_margin_mean': 1.4344532489776611, 'epsilon_dpo/beta_margin_std': 1.32624351978302, 'epsilon_dpo/beta_margin_grad_mean': -0.2581092119216919, 'epsilon_dpo/beta_margin_grad_std': 0.18795832991600037, 'kl/beta': 0.025107255205512047, 'kl/avg_steps': 0.65625, 'epoch': 0.36} 36%|███████████████████████████▉ | 244/681 [17:02<19:23, 2.66s/it] 36%|████████████████████████████ | 245/681 [17:05<19:59, 2.75s/it] {'loss': 0.7694, 'grad_norm': 54.38076400756836, 'learning_rate': 4.057290731287531e-07, 'rewards/chosen': -1.249393343925476, 'rewards/rejected': -2.72542667388916, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.4760334491729736, 'logps/chosen': -117.57855224609375, 'logps/rejected': -197.78199768066406, 'logps/ref_chosen': -67.26228332519531, 'logps/ref_rejected': -87.64010620117188, 'logits/chosen': -0.8543267250061035, 'logits/rejected': -0.7700981497764587, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.024797938764095306, 'epsilon_dpo/loss_margin_mean': 59.82563400268555, 'epsilon_dpo/beta_margin_mean': 1.4760335683822632, 'epsilon_dpo/beta_margin_std': 1.7903791666030884, 'epsilon_dpo/beta_margin_grad_mean': -0.277778685092926, 'epsilon_dpo/beta_margin_grad_std': 0.20800314843654633, 'kl/beta': 0.024943562224507332, 'kl/avg_steps': 0.59375, 'epoch': 0.36} 36%|████████████████████████████ | 245/681 [17:05<19:59, 2.75s/it] 36%|████████████████████████████▏ | 246/681 [17:08<19:49, 2.73s/it] {'loss': 0.757, 'grad_norm': 41.48917007446289, 'learning_rate': 4.047230911780736e-07, 'rewards/chosen': -1.1463537216186523, 'rewards/rejected': -2.477789878845215, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.3314363956451416, 'logps/chosen': -113.2144546508789, 'logps/rejected': -185.14031982421875, 'logps/ref_chosen': -66.69696807861328, 'logps/ref_rejected': -84.34634399414062, 'logits/chosen': -0.8397436141967773, 'logits/rejected': -0.7311065196990967, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.024620573967695236, 'epsilon_dpo/loss_margin_mean': 54.27648162841797, 'epsilon_dpo/beta_margin_mean': 1.3314363956451416, 'epsilon_dpo/beta_margin_std': 1.5045994520187378, 'epsilon_dpo/beta_margin_grad_mean': -0.28294581174850464, 'epsilon_dpo/beta_margin_grad_std': 0.19259461760520935, 'kl/beta': 0.02479633502662182, 'kl/avg_steps': 0.71875, 'epoch': 0.36} 36%|████████████████████████████▏ | 246/681 [17:08<19:49, 2.73s/it] 36%|████████████████████████████▎ | 247/681 [17:10<19:30, 2.70s/it] {'loss': 0.5939, 'grad_norm': 43.01525115966797, 'learning_rate': 4.0371303213004814e-07, 'rewards/chosen': -1.3169013261795044, 'rewards/rejected': -3.2172365188598633, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.9003353118896484, 'logps/chosen': -110.41958618164062, 'logps/rejected': -238.10891723632812, 'logps/ref_chosen': -56.6053466796875, 'logps/ref_rejected': -106.29327392578125, 'logits/chosen': -0.7675491571426392, 'logits/rejected': -0.7368471622467041, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.024429485201835632, 'epsilon_dpo/loss_margin_mean': 78.00141143798828, 'epsilon_dpo/beta_margin_mean': 1.9003351926803589, 'epsilon_dpo/beta_margin_std': 1.767119288444519, 'epsilon_dpo/beta_margin_grad_mean': -0.22065015137195587, 'epsilon_dpo/beta_margin_grad_std': 0.19751501083374023, 'kl/beta': 0.02461938187479973, 'kl/avg_steps': 0.78125, 'epoch': 0.36} 36%|████████████████████████████▎ | 247/681 [17:11<19:30, 2.70s/it] 36%|████████████████████████████▍ | 248/681 [17:13<19:18, 2.68s/it] {'loss': 0.6365, 'grad_norm': 49.93627166748047, 'learning_rate': 4.0269892260067197e-07, 'rewards/chosen': -1.0947335958480835, 'rewards/rejected': -2.455599784851074, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.3608663082122803, 'logps/chosen': -89.22013854980469, 'logps/rejected': -193.29786682128906, 'logps/ref_chosen': -44.043216705322266, 'logps/ref_rejected': -91.85687255859375, 'logits/chosen': -0.7783492803573608, 'logits/rejected': -0.7796909809112549, 'kl/p_epsilon_steps': 0.9375, 'kl/n_epsilon_steps': 0.0625, 'epsilon_dpo/beta': 0.024217206984758377, 'epsilon_dpo/loss_margin_mean': 56.26406478881836, 'epsilon_dpo/beta_margin_mean': 1.3608661890029907, 'epsilon_dpo/beta_margin_std': 1.123822569847107, 'epsilon_dpo/beta_margin_grad_mean': -0.2548842430114746, 'epsilon_dpo/beta_margin_grad_std': 0.15879112482070923, 'kl/beta': 0.024428535252809525, 'kl/avg_steps': 0.875, 'epoch': 0.36} 36%|████████████████████████████▍ | 248/681 [17:13<19:18, 2.68s/it] 37%|████████████████████████████▌ | 249/681 [17:16<19:21, 2.69s/it] {'loss': 0.9049, 'grad_norm': 63.952247619628906, 'learning_rate': 4.0168078931267426e-07, 'rewards/chosen': -1.2819901704788208, 'rewards/rejected': -2.3654775619506836, 'rewards/accuracies': 0.734375, 'rewards/margins': 1.0834875106811523, 'logps/chosen': -115.44468688964844, 'logps/rejected': -178.76162719726562, 'logps/ref_chosen': -62.442352294921875, 'logps/ref_rejected': -80.46806335449219, 'logits/chosen': -0.832280158996582, 'logits/rejected': -0.7321330308914185, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.3125, 'epsilon_dpo/beta': 0.02412823960185051, 'epsilon_dpo/loss_margin_mean': 45.29121398925781, 'epsilon_dpo/beta_margin_mean': 1.0834875106811523, 'epsilon_dpo/beta_margin_std': 1.4188543558120728, 'epsilon_dpo/beta_margin_grad_mean': -0.3194911479949951, 'epsilon_dpo/beta_margin_grad_std': 0.2219369113445282, 'kl/beta': 0.02421663887798786, 'kl/avg_steps': 0.375, 'epoch': 0.37} 37%|████████████████████████████▌ | 249/681 [17:16<19:21, 2.69s/it] 37%|████████████████████████████▋ | 250/681 [17:18<18:56, 2.64s/it] {'loss': 0.6012, 'grad_norm': 42.29915237426758, 'learning_rate': 4.006586590948141e-07, 'rewards/chosen': -1.0348234176635742, 'rewards/rejected': -2.4846692085266113, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.449845790863037, 'logps/chosen': -108.88627624511719, 'logps/rejected': -177.83912658691406, 'logps/ref_chosen': -65.6366958618164, 'logps/ref_rejected': -73.87183380126953, 'logits/chosen': -0.8870829343795776, 'logits/rejected': -0.6928812265396118, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0239400751888752, 'epsilon_dpo/loss_margin_mean': 60.717708587646484, 'epsilon_dpo/beta_margin_mean': 1.4498459100723267, 'epsilon_dpo/beta_margin_std': 1.092609167098999, 'epsilon_dpo/beta_margin_grad_mean': -0.23697948455810547, 'epsilon_dpo/beta_margin_grad_std': 0.16869625449180603, 'kl/beta': 0.024126166477799416, 'kl/avg_steps': 0.78125, 'epoch': 0.37} 37%|████████████████████████████▋ | 250/681 [17:18<18:56, 2.64s/it] 37%|████████████████████████████▋ | 251/681 [17:21<18:38, 2.60s/it] {'loss': 0.7294, 'grad_norm': 38.71930694580078, 'learning_rate': 3.9963255888117325e-07, 'rewards/chosen': -1.3019583225250244, 'rewards/rejected': -2.610532283782959, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.3085737228393555, 'logps/chosen': -111.89410400390625, 'logps/rejected': -187.62496948242188, 'logps/ref_chosen': -57.182716369628906, 'logps/ref_rejected': -77.66343688964844, 'logits/chosen': -0.8276119232177734, 'logits/rejected': -0.7018730640411377, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.023791901767253876, 'epsilon_dpo/loss_margin_mean': 55.25014877319336, 'epsilon_dpo/beta_margin_mean': 1.3085737228393555, 'epsilon_dpo/beta_margin_std': 1.2904928922653198, 'epsilon_dpo/beta_margin_grad_mean': -0.2746790051460266, 'epsilon_dpo/beta_margin_grad_std': 0.19281445443630219, 'kl/beta': 0.023939142003655434, 'kl/avg_steps': 0.625, 'epoch': 0.37} 37%|████████████████████████████▋ | 251/681 [17:21<18:38, 2.60s/it] 37%|████████████████████████████▊ | 252/681 [17:24<19:04, 2.67s/it] {'loss': 0.6352, 'grad_norm': 54.85141372680664, 'learning_rate': 3.9860251571044666e-07, 'rewards/chosen': -1.2233631610870361, 'rewards/rejected': -2.670339584350586, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.4469765424728394, 'logps/chosen': -123.50946044921875, 'logps/rejected': -198.05880737304688, 'logps/ref_chosen': -71.68563842773438, 'logps/ref_rejected': -84.75798797607422, 'logits/chosen': -0.8772724866867065, 'logits/rejected': -0.7638910412788391, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.023599514737725258, 'epsilon_dpo/loss_margin_mean': 61.47700119018555, 'epsilon_dpo/beta_margin_mean': 1.4469765424728394, 'epsilon_dpo/beta_margin_std': 1.207213282585144, 'epsilon_dpo/beta_margin_grad_mean': -0.23693950474262238, 'epsilon_dpo/beta_margin_grad_std': 0.17138709127902985, 'kl/beta': 0.023790450766682625, 'kl/avg_steps': 0.8125, 'epoch': 0.37} 37%|████████████████████████████▊ | 252/681 [17:24<19:04, 2.67s/it] 37%|████████████████████████████▉ | 253/681 [17:26<19:15, 2.70s/it] {'loss': 0.8005, 'grad_norm': 51.151695251464844, 'learning_rate': 3.9756855672522986e-07, 'rewards/chosen': -1.0857230424880981, 'rewards/rejected': -2.3547239303588867, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.269000768661499, 'logps/chosen': -115.36244201660156, 'logps/rejected': -199.31797790527344, 'logps/ref_chosen': -69.13392639160156, 'logps/ref_rejected': -98.70252990722656, 'logits/chosen': -0.8559330701828003, 'logits/rejected': -0.8246597051620483, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.02346094138920307, 'epsilon_dpo/loss_margin_mean': 54.386932373046875, 'epsilon_dpo/beta_margin_mean': 1.269000768661499, 'epsilon_dpo/beta_margin_std': 1.4000648260116577, 'epsilon_dpo/beta_margin_grad_mean': -0.2852194905281067, 'epsilon_dpo/beta_margin_grad_std': 0.21233250200748444, 'kl/beta': 0.02359871193766594, 'kl/avg_steps': 0.59375, 'epoch': 0.37} 37%|████████████████████████████▉ | 253/681 [17:26<19:15, 2.70s/it] 37%|█████████████████████████████ | 254/681 [17:29<19:33, 2.75s/it] {'loss': 0.8367, 'grad_norm': 73.59979248046875, 'learning_rate': 3.965307091713037e-07, 'rewards/chosen': -1.280766487121582, 'rewards/rejected': -2.548877716064453, 'rewards/accuracies': 0.796875, 'rewards/margins': 1.268110990524292, 'logps/chosen': -108.90068054199219, 'logps/rejected': -199.73941040039062, 'logps/ref_chosen': -54.154998779296875, 'logps/ref_rejected': -90.30764770507812, 'logits/chosen': -0.8435882329940796, 'logits/rejected': -0.7198815941810608, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.023337125778198242, 'epsilon_dpo/loss_margin_mean': 54.68608856201172, 'epsilon_dpo/beta_margin_mean': 1.268110990524292, 'epsilon_dpo/beta_margin_std': 1.560611367225647, 'epsilon_dpo/beta_margin_grad_mean': -0.29857712984085083, 'epsilon_dpo/beta_margin_grad_std': 0.21847309172153473, 'kl/beta': 0.0234594214707613, 'kl/avg_steps': 0.53125, 'epoch': 0.37} 37%|█████████████████████████████ | 254/681 [17:29<19:33, 2.75s/it] 37%|█████████████████████████████▏ | 255/681 [17:32<19:08, 2.70s/it] {'loss': 0.8796, 'grad_norm': 83.69367218017578, 'learning_rate': 3.954890003969163e-07, 'rewards/chosen': -1.417651891708374, 'rewards/rejected': -2.747748851776123, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.330096960067749, 'logps/chosen': -118.17829895019531, 'logps/rejected': -208.9357147216797, 'logps/ref_chosen': -57.14167022705078, 'logps/ref_rejected': -90.2085952758789, 'logits/chosen': -0.7430202960968018, 'logits/rejected': -0.6945962905883789, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.023177336901426315, 'epsilon_dpo/loss_margin_mean': 57.690494537353516, 'epsilon_dpo/beta_margin_mean': 1.3300968408584595, 'epsilon_dpo/beta_margin_std': 1.7163633108139038, 'epsilon_dpo/beta_margin_grad_mean': -0.296540766954422, 'epsilon_dpo/beta_margin_grad_std': 0.2287638783454895, 'kl/beta': 0.023335451260209084, 'kl/avg_steps': 0.6875, 'epoch': 0.37} 37%|█████████████████████████████▏ | 255/681 [17:32<19:08, 2.70s/it] 38%|█████████████████████████████▎ | 256/681 [17:35<18:59, 2.68s/it] {'loss': 0.7279, 'grad_norm': 53.05121994018555, 'learning_rate': 3.944434578520628e-07, 'rewards/chosen': -1.298081874847412, 'rewards/rejected': -2.729990005493164, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.4319078922271729, 'logps/chosen': -111.40750122070312, 'logps/rejected': -211.26126098632812, 'logps/ref_chosen': -55.163490295410156, 'logps/ref_rejected': -92.56291961669922, 'logits/chosen': -0.7182353734970093, 'logits/rejected': -0.665188729763031, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.023040810599923134, 'epsilon_dpo/loss_margin_mean': 62.45432662963867, 'epsilon_dpo/beta_margin_mean': 1.4319080114364624, 'epsilon_dpo/beta_margin_std': 1.4983351230621338, 'epsilon_dpo/beta_margin_grad_mean': -0.26910004019737244, 'epsilon_dpo/beta_margin_grad_std': 0.2048642784357071, 'kl/beta': 0.02317611500620842, 'kl/avg_steps': 0.59375, 'epoch': 0.38} 38%|█████████████████████████████▎ | 256/681 [17:35<18:59, 2.68s/it] 38%|█████████████████████████████▍ | 257/681 [17:37<19:07, 2.71s/it] {'loss': 0.7147, 'grad_norm': 67.98104095458984, 'learning_rate': 3.933941090877615e-07, 'rewards/chosen': -1.176638126373291, 'rewards/rejected': -2.6955676078796387, 'rewards/accuracies': 0.78125, 'rewards/margins': 1.5189297199249268, 'logps/chosen': -100.78417205810547, 'logps/rejected': -197.52523803710938, 'logps/ref_chosen': -49.4236946105957, 'logps/ref_rejected': -79.53791809082031, 'logits/chosen': -0.7001844644546509, 'logits/rejected': -0.662509024143219, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.022919215261936188, 'epsilon_dpo/loss_margin_mean': 66.6268539428711, 'epsilon_dpo/beta_margin_mean': 1.5189297199249268, 'epsilon_dpo/beta_margin_std': 1.622308611869812, 'epsilon_dpo/beta_margin_grad_mean': -0.2688346207141876, 'epsilon_dpo/beta_margin_grad_std': 0.20320047438144684, 'kl/beta': 0.02303932048380375, 'kl/avg_steps': 0.53125, 'epoch': 0.38} 38%|█████████████████████████████▍ | 257/681 [17:37<19:07, 2.71s/it] 38%|█████████████████████████████▌ | 258/681 [17:40<18:19, 2.60s/it] {'loss': 0.8349, 'grad_norm': 54.61482620239258, 'learning_rate': 3.923409817553284e-07, 'rewards/chosen': -1.3097673654556274, 'rewards/rejected': -2.6058008670806885, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.296033501625061, 'logps/chosen': -116.78972625732422, 'logps/rejected': -210.6261749267578, 'logps/ref_chosen': -59.384124755859375, 'logps/ref_rejected': -95.9901123046875, 'logits/chosen': -0.7847526669502258, 'logits/rejected': -0.742351770401001, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.022769449278712273, 'epsilon_dpo/loss_margin_mean': 57.230464935302734, 'epsilon_dpo/beta_margin_mean': 1.296033501625061, 'epsilon_dpo/beta_margin_std': 1.511138677597046, 'epsilon_dpo/beta_margin_grad_mean': -0.27944010496139526, 'epsilon_dpo/beta_margin_grad_std': 0.2202530950307846, 'kl/beta': 0.022917570546269417, 'kl/avg_steps': 0.65625, 'epoch': 0.38} 38%|█████████████████████████████▌ | 258/681 [17:40<18:19, 2.60s/it] 38%|█████████████████████████████▋ | 259/681 [17:42<18:22, 2.61s/it] {'loss': 0.7926, 'grad_norm': 43.3511848449707, 'learning_rate': 3.9128410360564793e-07, 'rewards/chosen': -1.1445544958114624, 'rewards/rejected': -2.342374324798584, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.197819709777832, 'logps/chosen': -103.28313446044922, 'logps/rejected': -192.8330078125, 'logps/ref_chosen': -52.828346252441406, 'logps/ref_rejected': -89.19165802001953, 'logits/chosen': -0.7879418134689331, 'logits/rejected': -0.7493730783462524, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0226423479616642, 'epsilon_dpo/loss_margin_mean': 53.186553955078125, 'epsilon_dpo/beta_margin_mean': 1.197819709777832, 'epsilon_dpo/beta_margin_std': 1.2998988628387451, 'epsilon_dpo/beta_margin_grad_mean': -0.29370635747909546, 'epsilon_dpo/beta_margin_grad_std': 0.1979660838842392, 'kl/beta': 0.022768154740333557, 'kl/avg_steps': 0.5625, 'epoch': 0.38} 38%|█████████████████████████████▋ | 259/681 [17:42<18:22, 2.61s/it] 38%|█████████████████████████████▊ | 260/681 [17:45<18:43, 2.67s/it] {'loss': 0.7334, 'grad_norm': 37.80736541748047, 'learning_rate': 3.9022350248844246e-07, 'rewards/chosen': -1.1006656885147095, 'rewards/rejected': -2.388970375061035, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.2883045673370361, 'logps/chosen': -96.36808776855469, 'logps/rejected': -201.5484161376953, 'logps/ref_chosen': -47.41767501831055, 'logps/ref_rejected': -95.08979034423828, 'logits/chosen': -0.7794230580329895, 'logits/rejected': -0.8244825601577759, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.02247324213385582, 'epsilon_dpo/loss_margin_mean': 57.508216857910156, 'epsilon_dpo/beta_margin_mean': 1.2883044481277466, 'epsilon_dpo/beta_margin_std': 1.362417221069336, 'epsilon_dpo/beta_margin_grad_mean': -0.28044986724853516, 'epsilon_dpo/beta_margin_grad_std': 0.1815163493156433, 'kl/beta': 0.02264080010354519, 'kl/avg_steps': 0.75, 'epoch': 0.38} 38%|█████████████████████████████▊ | 260/681 [17:45<18:43, 2.67s/it] 38%|█████████████████████████████▉ | 261/681 [17:48<18:09, 2.59s/it] {'loss': 0.6733, 'grad_norm': 46.86500549316406, 'learning_rate': 3.891592063515376e-07, 'rewards/chosen': -0.8925292491912842, 'rewards/rejected': -2.2794415950775146, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.3869123458862305, 'logps/chosen': -92.98090362548828, 'logps/rejected': -190.8365478515625, 'logps/ref_chosen': -53.03137969970703, 'logps/ref_rejected': -88.51494598388672, 'logits/chosen': -0.8502944707870483, 'logits/rejected': -0.7972038984298706, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.022312970831990242, 'epsilon_dpo/loss_margin_mean': 62.372074127197266, 'epsilon_dpo/beta_margin_mean': 1.3869123458862305, 'epsilon_dpo/beta_margin_std': 1.3287155628204346, 'epsilon_dpo/beta_margin_grad_mean': -0.2600063979625702, 'epsilon_dpo/beta_margin_grad_std': 0.178856760263443, 'kl/beta': 0.022472258657217026, 'kl/avg_steps': 0.71875, 'epoch': 0.38} 38%|█████████████████████████████▉ | 261/681 [17:48<18:09, 2.59s/it] 38%|██████████████████████████████ | 262/681 [17:50<17:52, 2.56s/it] {'loss': 0.7602, 'grad_norm': 43.433895111083984, 'learning_rate': 3.880912432401264e-07, 'rewards/chosen': -1.0171130895614624, 'rewards/rejected': -2.149836540222168, 'rewards/accuracies': 0.875, 'rewards/margins': 1.1327235698699951, 'logps/chosen': -105.44173431396484, 'logps/rejected': -183.58560180664062, 'logps/ref_chosen': -59.620140075683594, 'logps/ref_rejected': -86.41853332519531, 'logits/chosen': -0.8100330233573914, 'logits/rejected': -0.6867516040802002, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.022153738886117935, 'epsilon_dpo/loss_margin_mean': 51.34547424316406, 'epsilon_dpo/beta_margin_mean': 1.1327235698699951, 'epsilon_dpo/beta_margin_std': 1.0738329887390137, 'epsilon_dpo/beta_margin_grad_mean': -0.28383615612983704, 'epsilon_dpo/beta_margin_grad_std': 0.18365508317947388, 'kl/beta': 0.022311890497803688, 'kl/avg_steps': 0.71875, 'epoch': 0.38} 38%|██████████████████████████████ | 262/681 [17:50<17:52, 2.56s/it] 39%|██████████████████████████████ | 263/681 [17:53<18:06, 2.60s/it] {'loss': 0.6508, 'grad_norm': 43.3455696105957, 'learning_rate': 3.870196412960302e-07, 'rewards/chosen': -0.884239673614502, 'rewards/rejected': -2.3459324836730957, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.4616928100585938, 'logps/chosen': -99.48373413085938, 'logps/rejected': -203.59500122070312, 'logps/ref_chosen': -59.42094421386719, 'logps/ref_rejected': -96.85720825195312, 'logits/chosen': -0.9041777849197388, 'logits/rejected': -0.7808328866958618, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.021988723427057266, 'epsilon_dpo/loss_margin_mean': 66.67501068115234, 'epsilon_dpo/beta_margin_mean': 1.4616928100585938, 'epsilon_dpo/beta_margin_std': 1.3058973550796509, 'epsilon_dpo/beta_margin_grad_mean': -0.2505612075328827, 'epsilon_dpo/beta_margin_grad_std': 0.183075949549675, 'kl/beta': 0.022152669727802277, 'kl/avg_steps': 0.75, 'epoch': 0.39} 39%|██████████████████████████████ | 263/681 [17:53<18:06, 2.60s/it] 39%|██████████████████████████████▏ | 264/681 [17:56<18:35, 2.68s/it] {'loss': 0.7875, 'grad_norm': 45.67006301879883, 'learning_rate': 3.8594442875695665e-07, 'rewards/chosen': -0.9191652536392212, 'rewards/rejected': -2.025968074798584, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.1068028211593628, 'logps/chosen': -104.73934936523438, 'logps/rejected': -186.763916015625, 'logps/ref_chosen': -62.722084045410156, 'logps/ref_rejected': -93.85621643066406, 'logits/chosen': -0.8379767537117004, 'logits/rejected': -0.7533121109008789, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.021831907331943512, 'epsilon_dpo/loss_margin_mean': 50.89043426513672, 'epsilon_dpo/beta_margin_mean': 1.1068028211593628, 'epsilon_dpo/beta_margin_std': 1.1502090692520142, 'epsilon_dpo/beta_margin_grad_mean': -0.2944892644882202, 'epsilon_dpo/beta_margin_grad_std': 0.17753368616104126, 'kl/beta': 0.021987760439515114, 'kl/avg_steps': 0.71875, 'epoch': 0.39} 39%|██████████████████████████████▏ | 264/681 [17:56<18:35, 2.68s/it] 39%|██████████████████████████████▎ | 265/681 [17:58<18:10, 2.62s/it] {'loss': 0.7847, 'grad_norm': 56.90526580810547, 'learning_rate': 3.848656339557562e-07, 'rewards/chosen': -1.0657553672790527, 'rewards/rejected': -2.260312557220459, 'rewards/accuracies': 0.8125, 'rewards/margins': 1.1945571899414062, 'logps/chosen': -110.93099975585938, 'logps/rejected': -192.29173278808594, 'logps/ref_chosen': -61.971466064453125, 'logps/ref_rejected': -88.02059936523438, 'logits/chosen': -0.8070585131645203, 'logits/rejected': -0.7428088784217834, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.021703332662582397, 'epsilon_dpo/loss_margin_mean': 55.31159973144531, 'epsilon_dpo/beta_margin_mean': 1.1945571899414062, 'epsilon_dpo/beta_margin_std': 1.3362030982971191, 'epsilon_dpo/beta_margin_grad_mean': -0.29720309376716614, 'epsilon_dpo/beta_margin_grad_std': 0.18713915348052979, 'kl/beta': 0.021830851212143898, 'kl/avg_steps': 0.59375, 'epoch': 0.39} 39%|██████████████████████████████▎ | 265/681 [17:58<18:10, 2.62s/it] 39%|██████████████████████████████▍ | 266/681 [18:01<18:08, 2.62s/it] {'loss': 0.7567, 'grad_norm': 32.84288787841797, 'learning_rate': 3.8378328531967507e-07, 'rewards/chosen': -0.9995484352111816, 'rewards/rejected': -2.223362445831299, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.2238138914108276, 'logps/chosen': -113.43331146240234, 'logps/rejected': -171.31344604492188, 'logps/ref_chosen': -67.09967041015625, 'logps/ref_rejected': -67.97122192382812, 'logits/chosen': -0.8406482338905334, 'logits/rejected': -0.6030203104019165, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.021548166871070862, 'epsilon_dpo/loss_margin_mean': 57.00858688354492, 'epsilon_dpo/beta_margin_mean': 1.2238138914108276, 'epsilon_dpo/beta_margin_std': 1.2477452754974365, 'epsilon_dpo/beta_margin_grad_mean': -0.28442031145095825, 'epsilon_dpo/beta_margin_grad_std': 0.1898299902677536, 'kl/beta': 0.02170199528336525, 'kl/avg_steps': 0.71875, 'epoch': 0.39} 39%|██████████████████████████████▍ | 266/681 [18:01<18:08, 2.62s/it] 39%|██████████████████████████████▌ | 267/681 [18:03<18:12, 2.64s/it] {'loss': 0.6767, 'grad_norm': 36.54864501953125, 'learning_rate': 3.8269741136960646e-07, 'rewards/chosen': -1.0386977195739746, 'rewards/rejected': -2.3372254371643066, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.2985275983810425, 'logps/chosen': -117.4482650756836, 'logps/rejected': -199.51803588867188, 'logps/ref_chosen': -68.97074890136719, 'logps/ref_rejected': -90.16844940185547, 'logits/chosen': -0.823701798915863, 'logits/rejected': -0.6632372140884399, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.02138766087591648, 'epsilon_dpo/loss_margin_mean': 60.872066497802734, 'epsilon_dpo/beta_margin_mean': 1.2985275983810425, 'epsilon_dpo/beta_margin_std': 1.1735737323760986, 'epsilon_dpo/beta_margin_grad_mean': -0.2644921839237213, 'epsilon_dpo/beta_margin_grad_std': 0.1655421257019043, 'kl/beta': 0.02154712565243244, 'kl/avg_steps': 0.75, 'epoch': 0.39} 39%|██████████████████████████████▌ | 267/681 [18:03<18:12, 2.64s/it] 39%|██████████████████████████████▋ | 268/681 [18:06<18:14, 2.65s/it] {'loss': 0.7472, 'grad_norm': 45.175697326660156, 'learning_rate': 3.8160804071933894e-07, 'rewards/chosen': -1.1551294326782227, 'rewards/rejected': -2.382227897644043, 'rewards/accuracies': 0.875, 'rewards/margins': 1.2270984649658203, 'logps/chosen': -110.12238311767578, 'logps/rejected': -213.89219665527344, 'logps/ref_chosen': -55.900306701660156, 'logps/ref_rejected': -101.64763641357422, 'logits/chosen': -0.6772645711898804, 'logits/rejected': -0.6899411678314209, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.021241815760731697, 'epsilon_dpo/loss_margin_mean': 58.02248764038086, 'epsilon_dpo/beta_margin_mean': 1.2270984649658203, 'epsilon_dpo/beta_margin_std': 1.2154557704925537, 'epsilon_dpo/beta_margin_grad_mean': -0.2792663872241974, 'epsilon_dpo/beta_margin_grad_std': 0.1909855753183365, 'kl/beta': 0.021386725828051567, 'kl/avg_steps': 0.6875, 'epoch': 0.39} 39%|██████████████████████████████▋ | 268/681 [18:06<18:14, 2.65s/it] 40%|██████████████████████████████▊ | 269/681 [18:09<18:08, 2.64s/it] {'loss': 0.6333, 'grad_norm': 55.38663101196289, 'learning_rate': 3.8051520207480204e-07, 'rewards/chosen': -1.2531479597091675, 'rewards/rejected': -2.8698487281799316, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.6167008876800537, 'logps/chosen': -129.2608642578125, 'logps/rejected': -243.49948120117188, 'logps/ref_chosen': -70.03955078125, 'logps/ref_rejected': -107.34937286376953, 'logits/chosen': -0.777718722820282, 'logits/rejected': -0.6842841506004333, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.02111669071018696, 'epsilon_dpo/loss_margin_mean': 76.9288101196289, 'epsilon_dpo/beta_margin_mean': 1.6167008876800537, 'epsilon_dpo/beta_margin_std': 1.4006026983261108, 'epsilon_dpo/beta_margin_grad_mean': -0.2359836995601654, 'epsilon_dpo/beta_margin_grad_std': 0.20190633833408356, 'kl/beta': 0.02124069631099701, 'kl/avg_steps': 0.59375, 'epoch': 0.4} 40%|██████████████████████████████▊ | 269/681 [18:09<18:08, 2.64s/it] 40%|██████████████████████████████▉ | 270/681 [18:11<18:29, 2.70s/it] {'loss': 0.7883, 'grad_norm': 39.35811233520508, 'learning_rate': 3.794189242333106e-07, 'rewards/chosen': -1.0666934251785278, 'rewards/rejected': -2.2703449726104736, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.2036516666412354, 'logps/chosen': -120.31138610839844, 'logps/rejected': -218.33621215820312, 'logps/ref_chosen': -69.53347778320312, 'logps/ref_rejected': -109.92864990234375, 'logits/chosen': -0.7690585851669312, 'logits/rejected': -0.7463020086288452, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.020985450595617294, 'epsilon_dpo/loss_margin_mean': 57.62965393066406, 'epsilon_dpo/beta_margin_mean': 1.2036516666412354, 'epsilon_dpo/beta_margin_std': 1.3258358240127563, 'epsilon_dpo/beta_margin_grad_mean': -0.29173266887664795, 'epsilon_dpo/beta_margin_grad_std': 0.1945246160030365, 'kl/beta': 0.021115323528647423, 'kl/avg_steps': 0.625, 'epoch': 0.4} 40%|██████████████████████████████▉ | 270/681 [18:12<18:29, 2.70s/it] 40%|███████████████████████████████ | 271/681 [18:14<17:58, 2.63s/it] {'loss': 0.702, 'grad_norm': 50.04530334472656, 'learning_rate': 3.7831923608280514e-07, 'rewards/chosen': -1.2361798286437988, 'rewards/rejected': -2.6154062747955322, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.3792264461517334, 'logps/chosen': -116.04220581054688, 'logps/rejected': -218.20013427734375, 'logps/ref_chosen': -56.76457214355469, 'logps/ref_rejected': -92.51383209228516, 'logits/chosen': -0.7107840776443481, 'logits/rejected': -0.6057737469673157, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.020835433155298233, 'epsilon_dpo/loss_margin_mean': 66.40867614746094, 'epsilon_dpo/beta_margin_mean': 1.3792264461517334, 'epsilon_dpo/beta_margin_std': 1.3681683540344238, 'epsilon_dpo/beta_margin_grad_mean': -0.2704339027404785, 'epsilon_dpo/beta_margin_grad_std': 0.1832066774368286, 'kl/beta': 0.020984172821044922, 'kl/avg_steps': 0.71875, 'epoch': 0.4} 40%|███████████████████████████████ | 271/681 [18:14<17:58, 2.63s/it] 40%|███████████████████████████████▏ | 272/681 [18:17<18:19, 2.69s/it] {'loss': 0.6611, 'grad_norm': 41.83186721801758, 'learning_rate': 3.772161666010912e-07, 'rewards/chosen': -1.223039150238037, 'rewards/rejected': -2.860086441040039, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.6370470523834229, 'logps/chosen': -108.57723999023438, 'logps/rejected': -244.06532287597656, 'logps/ref_chosen': -49.49715805053711, 'logps/ref_rejected': -105.54279327392578, 'logits/chosen': -0.5893428325653076, 'logits/rejected': -0.6096173524856567, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.020693259313702583, 'epsilon_dpo/loss_margin_mean': 79.44245147705078, 'epsilon_dpo/beta_margin_mean': 1.6370471715927124, 'epsilon_dpo/beta_margin_std': 1.55239999294281, 'epsilon_dpo/beta_margin_grad_mean': -0.2399667203426361, 'epsilon_dpo/beta_margin_grad_std': 0.2069646120071411, 'kl/beta': 0.0208344254642725, 'kl/avg_steps': 0.6875, 'epoch': 0.4} 40%|███████████████████████████████▏ | 272/681 [18:17<18:19, 2.69s/it] 40%|███████████████████████████████▎ | 273/681 [18:19<17:56, 2.64s/it] {'loss': 0.6271, 'grad_norm': 78.92858123779297, 'learning_rate': 3.761097448550755e-07, 'rewards/chosen': -1.290513038635254, 'rewards/rejected': -2.906710624694824, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.6161974668502808, 'logps/chosen': -125.79875183105469, 'logps/rejected': -234.24114990234375, 'logps/ref_chosen': -62.97539520263672, 'logps/ref_rejected': -92.49858093261719, 'logits/chosen': -0.6929817199707031, 'logits/rejected': -0.5817391872406006, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.02051962912082672, 'epsilon_dpo/loss_margin_mean': 78.91922760009766, 'epsilon_dpo/beta_margin_mean': 1.6161974668502808, 'epsilon_dpo/beta_margin_std': 1.4859600067138672, 'epsilon_dpo/beta_margin_grad_mean': -0.23856356739997864, 'epsilon_dpo/beta_margin_grad_std': 0.18430212140083313, 'kl/beta': 0.020692165940999985, 'kl/avg_steps': 0.84375, 'epoch': 0.4} 40%|███████████████████████████████▎ | 273/681 [18:19<17:56, 2.64s/it] 40%|███████████████████████████████▍ | 274/681 [18:22<18:25, 2.72s/it] {'loss': 0.7387, 'grad_norm': 54.08884048461914, 'learning_rate': 3.75e-07, 'rewards/chosen': -1.521093487739563, 'rewards/rejected': -2.830944538116455, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.309850811958313, 'logps/chosen': -130.21035766601562, 'logps/rejected': -216.40267944335938, 'logps/ref_chosen': -55.66770935058594, 'logps/ref_rejected': -77.33308410644531, 'logits/chosen': -0.6644202470779419, 'logits/rejected': -0.5181941390037537, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.020380007103085518, 'epsilon_dpo/loss_margin_mean': 64.52693939208984, 'epsilon_dpo/beta_margin_mean': 1.309850811958313, 'epsilon_dpo/beta_margin_std': 1.3032691478729248, 'epsilon_dpo/beta_margin_grad_mean': -0.2764292061328888, 'epsilon_dpo/beta_margin_grad_std': 0.19955939054489136, 'kl/beta': 0.020519036799669266, 'kl/avg_steps': 0.6875, 'epoch': 0.4} 40%|███████████████████████████████▍ | 274/681 [18:22<18:25, 2.72s/it] 40%|███████████████████████████████▍ | 275/681 [18:25<18:27, 2.73s/it] {'loss': 0.7525, 'grad_norm': 55.116641998291016, 'learning_rate': 3.738869612786737e-07, 'rewards/chosen': -1.1173958778381348, 'rewards/rejected': -2.469923496246338, 'rewards/accuracies': 0.8125, 'rewards/margins': 1.3525274991989136, 'logps/chosen': -103.64492797851562, 'logps/rejected': -215.45387268066406, 'logps/ref_chosen': -48.594703674316406, 'logps/ref_rejected': -93.30369567871094, 'logits/chosen': -0.6706832647323608, 'logits/rejected': -0.6582698822021484, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.020253589376807213, 'epsilon_dpo/loss_margin_mean': 67.09994506835938, 'epsilon_dpo/beta_margin_mean': 1.3525274991989136, 'epsilon_dpo/beta_margin_std': 1.4203792810440063, 'epsilon_dpo/beta_margin_grad_mean': -0.2785060405731201, 'epsilon_dpo/beta_margin_grad_std': 0.20673537254333496, 'kl/beta': 0.02037893235683441, 'kl/avg_steps': 0.625, 'epoch': 0.4} 40%|███████████████████████████████▍ | 275/681 [18:25<18:27, 2.73s/it] 41%|███████████████████████████████▌ | 276/681 [18:28<18:16, 2.71s/it] {'loss': 0.7503, 'grad_norm': 97.52819061279297, 'learning_rate': 3.7277065802070204e-07, 'rewards/chosen': -1.257150650024414, 'rewards/rejected': -2.597024917602539, 'rewards/accuracies': 0.796875, 'rewards/margins': 1.339874267578125, 'logps/chosen': -118.94365692138672, 'logps/rejected': -199.61550903320312, 'logps/ref_chosen': -56.57740783691406, 'logps/ref_rejected': -70.36566925048828, 'logits/chosen': -0.7014357447624207, 'logits/rejected': -0.5417762994766235, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.020127790048718452, 'epsilon_dpo/loss_margin_mean': 66.88359832763672, 'epsilon_dpo/beta_margin_mean': 1.339874267578125, 'epsilon_dpo/beta_margin_std': 1.46302330493927, 'epsilon_dpo/beta_margin_grad_mean': -0.28192076086997986, 'epsilon_dpo/beta_margin_grad_std': 0.1965862363576889, 'kl/beta': 0.020252354443073273, 'kl/avg_steps': 0.625, 'epoch': 0.41} 41%|███████████████████████████████▌ | 276/681 [18:28<18:16, 2.71s/it] 41%|███████████████████████████████▋ | 277/681 [18:30<17:30, 2.60s/it] {'loss': 0.6888, 'grad_norm': 76.95030212402344, 'learning_rate': 3.71651119641714e-07, 'rewards/chosen': -1.182989239692688, 'rewards/rejected': -2.46641206741333, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.283422589302063, 'logps/chosen': -115.48424530029297, 'logps/rejected': -216.51283264160156, 'logps/ref_chosen': -56.27156066894531, 'logps/ref_rejected': -92.88127136230469, 'logits/chosen': -0.6967241764068604, 'logits/rejected': -0.6158395409584045, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.019971320405602455, 'epsilon_dpo/loss_margin_mean': 64.41889190673828, 'epsilon_dpo/beta_margin_mean': 1.283422589302063, 'epsilon_dpo/beta_margin_std': 1.175058364868164, 'epsilon_dpo/beta_margin_grad_mean': -0.2710755169391632, 'epsilon_dpo/beta_margin_grad_std': 0.16731785237789154, 'kl/beta': 0.02012656256556511, 'kl/avg_steps': 0.78125, 'epoch': 0.41} 41%|███████████████████████████████▋ | 277/681 [18:30<17:30, 2.60s/it] 41%|███████████████████████████████▊ | 278/681 [18:33<17:34, 2.62s/it] {'loss': 0.7055, 'grad_norm': 43.44647979736328, 'learning_rate': 3.705283756425872e-07, 'rewards/chosen': -1.2011182308197021, 'rewards/rejected': -2.7062265872955322, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.50510835647583, 'logps/chosen': -113.42312622070312, 'logps/rejected': -227.90817260742188, 'logps/ref_chosen': -52.94194030761719, 'logps/ref_rejected': -91.25357818603516, 'logits/chosen': -0.5835840702056885, 'logits/rejected': -0.5952056646347046, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.019841471686959267, 'epsilon_dpo/loss_margin_mean': 76.17341613769531, 'epsilon_dpo/beta_margin_mean': 1.5051084756851196, 'epsilon_dpo/beta_margin_std': 1.559203863143921, 'epsilon_dpo/beta_margin_grad_mean': -0.264595627784729, 'epsilon_dpo/beta_margin_grad_std': 0.19803930819034576, 'kl/beta': 0.019970543682575226, 'kl/avg_steps': 0.65625, 'epoch': 0.41} 41%|███████████████████████████████▊ | 278/681 [18:33<17:34, 2.62s/it] 41%|███████████████████████████████▉ | 279/681 [18:35<17:32, 2.62s/it] {'loss': 0.6789, 'grad_norm': 56.63746643066406, 'learning_rate': 3.6940245560867e-07, 'rewards/chosen': -1.4587175846099854, 'rewards/rejected': -3.074979305267334, 'rewards/accuracies': 0.8125, 'rewards/margins': 1.616261601448059, 'logps/chosen': -122.50505065917969, 'logps/rejected': -244.05181884765625, 'logps/ref_chosen': -48.641319274902344, 'logps/ref_rejected': -87.8514404296875, 'logits/chosen': -0.5176438689231873, 'logits/rejected': -0.4667193591594696, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.019724512472748756, 'epsilon_dpo/loss_margin_mean': 82.3366470336914, 'epsilon_dpo/beta_margin_mean': 1.616261601448059, 'epsilon_dpo/beta_margin_std': 1.5595982074737549, 'epsilon_dpo/beta_margin_grad_mean': -0.25120463967323303, 'epsilon_dpo/beta_margin_grad_std': 0.21320270001888275, 'kl/beta': 0.019840341061353683, 'kl/avg_steps': 0.59375, 'epoch': 0.41} 41%|███████████████████████████████▉ | 279/681 [18:35<17:32, 2.62s/it] 41%|████████████████████████████████ | 280/681 [18:38<17:49, 2.67s/it] {'loss': 0.5458, 'grad_norm': 51.763580322265625, 'learning_rate': 3.6827338920900253e-07, 'rewards/chosen': -1.44656240940094, 'rewards/rejected': -3.041074752807617, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.5945125818252563, 'logps/chosen': -132.6591033935547, 'logps/rejected': -254.16738891601562, 'logps/ref_chosen': -58.797122955322266, 'logps/ref_rejected': -98.61885070800781, 'logits/chosen': -0.5518442988395691, 'logits/rejected': -0.5683473944664001, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.01956493966281414, 'epsilon_dpo/loss_margin_mean': 81.6865463256836, 'epsilon_dpo/beta_margin_mean': 1.5945125818252563, 'epsilon_dpo/beta_margin_std': 1.1257351636886597, 'epsilon_dpo/beta_margin_grad_mean': -0.2162458449602127, 'epsilon_dpo/beta_margin_grad_std': 0.16580165922641754, 'kl/beta': 0.019723234698176384, 'kl/avg_steps': 0.8125, 'epoch': 0.41} 41%|████████████████████████████████ | 280/681 [18:38<17:49, 2.67s/it] 41%|████████████████████████████████▏ | 281/681 [18:41<17:59, 2.70s/it] {'loss': 0.6922, 'grad_norm': 64.7247543334961, 'learning_rate': 3.6714120619553435e-07, 'rewards/chosen': -1.4034972190856934, 'rewards/rejected': -2.8288512229919434, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.42535400390625, 'logps/chosen': -127.68087768554688, 'logps/rejected': -226.75335693359375, 'logps/ref_chosen': -55.488521575927734, 'logps/ref_rejected': -80.88258361816406, 'logits/chosen': -0.5656956434249878, 'logits/rejected': -0.42098718881607056, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.01940114051103592, 'epsilon_dpo/loss_margin_mean': 73.67842102050781, 'epsilon_dpo/beta_margin_mean': 1.42535400390625, 'epsilon_dpo/beta_margin_std': 1.3284828662872314, 'epsilon_dpo/beta_margin_grad_mean': -0.24567550420761108, 'epsilon_dpo/beta_margin_grad_std': 0.19012600183486938, 'kl/beta': 0.019564274698495865, 'kl/avg_steps': 0.84375, 'epoch': 0.41} 41%|████████████████████████████████▏ | 281/681 [18:41<17:59, 2.70s/it] 41%|████████████████████████████████▎ | 282/681 [18:43<17:47, 2.68s/it] {'loss': 0.701, 'grad_norm': 50.608116149902344, 'learning_rate': 3.660059364023408e-07, 'rewards/chosen': -1.5348063707351685, 'rewards/rejected': -2.8041722774505615, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.2693657875061035, 'logps/chosen': -152.64346313476562, 'logps/rejected': -241.0337677001953, 'logps/ref_chosen': -73.07014465332031, 'logps/ref_rejected': -95.35098266601562, 'logits/chosen': -0.6203180551528931, 'logits/rejected': -0.36443668603897095, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.01927519217133522, 'epsilon_dpo/loss_margin_mean': 66.10945892333984, 'epsilon_dpo/beta_margin_mean': 1.269365906715393, 'epsilon_dpo/beta_margin_std': 1.2346893548965454, 'epsilon_dpo/beta_margin_grad_mean': -0.27404311299324036, 'epsilon_dpo/beta_margin_grad_std': 0.16918019950389862, 'kl/beta': 0.01940058171749115, 'kl/avg_steps': 0.65625, 'epoch': 0.41} 41%|████████████████████████████████▎ | 282/681 [18:43<17:47, 2.68s/it] 42%|████████████████████████████████▍ | 283/681 [18:46<17:42, 2.67s/it] {'loss': 0.6792, 'grad_norm': 56.80989074707031, 'learning_rate': 3.6486760974483685e-07, 'rewards/chosen': -1.5973310470581055, 'rewards/rejected': -3.174569845199585, 'rewards/accuracies': 0.875, 'rewards/margins': 1.5772387981414795, 'logps/chosen': -145.3206787109375, 'logps/rejected': -263.1604919433594, 'logps/ref_chosen': -61.89844512939453, 'logps/ref_rejected': -96.98655700683594, 'logits/chosen': -0.52199387550354, 'logits/rejected': -0.4504436254501343, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.019131455570459366, 'epsilon_dpo/loss_margin_mean': 82.75170135498047, 'epsilon_dpo/beta_margin_mean': 1.57723867893219, 'epsilon_dpo/beta_margin_std': 1.5101546049118042, 'epsilon_dpo/beta_margin_grad_mean': -0.24582470953464508, 'epsilon_dpo/beta_margin_grad_std': 0.20526982843875885, 'kl/beta': 0.019274096935987473, 'kl/avg_steps': 0.75, 'epoch': 0.42} 42%|████████████████████████████████▍ | 283/681 [18:46<17:42, 2.67s/it] 42%|████████████████████████████████▌ | 284/681 [18:49<18:04, 2.73s/it] {'loss': 0.6139, 'grad_norm': 54.7606315612793, 'learning_rate': 3.6372625621898863e-07, 'rewards/chosen': -1.6118556261062622, 'rewards/rejected': -3.2014176845550537, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.589562177658081, 'logps/chosen': -143.29278564453125, 'logps/rejected': -262.27685546875, 'logps/ref_chosen': -58.4355354309082, 'logps/ref_rejected': -93.46926879882812, 'logits/chosen': -0.46351319551467896, 'logits/rejected': -0.44188055396080017, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.018989035859704018, 'epsilon_dpo/loss_margin_mean': 83.95030975341797, 'epsilon_dpo/beta_margin_mean': 1.589562177658081, 'epsilon_dpo/beta_margin_std': 1.5010643005371094, 'epsilon_dpo/beta_margin_grad_mean': -0.24250973761081696, 'epsilon_dpo/beta_margin_grad_std': 0.1711161881685257, 'kl/beta': 0.019130617380142212, 'kl/avg_steps': 0.75, 'epoch': 0.42} 42%|████████████████████████████████▌ | 284/681 [18:49<18:04, 2.73s/it] 42%|████████████████████████████████▋ | 285/681 [18:52<17:56, 2.72s/it] {'loss': 0.6651, 'grad_norm': 71.8442153930664, 'learning_rate': 3.625819059005228e-07, 'rewards/chosen': -1.5869226455688477, 'rewards/rejected': -3.023146867752075, 'rewards/accuracies': 0.875, 'rewards/margins': 1.4362244606018066, 'logps/chosen': -150.2645263671875, 'logps/rejected': -259.6064453125, 'logps/ref_chosen': -66.2322006225586, 'logps/ref_rejected': -99.1268310546875, 'logits/chosen': -0.5949057340621948, 'logits/rejected': -0.5442003607749939, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.01885361224412918, 'epsilon_dpo/loss_margin_mean': 76.44730377197266, 'epsilon_dpo/beta_margin_mean': 1.436224341392517, 'epsilon_dpo/beta_margin_std': 1.289421796798706, 'epsilon_dpo/beta_margin_grad_mean': -0.25107693672180176, 'epsilon_dpo/beta_margin_grad_std': 0.18937461078166962, 'kl/beta': 0.01898820511996746, 'kl/avg_steps': 0.71875, 'epoch': 0.42} 42%|████████████████████████████████▋ | 285/681 [18:52<17:56, 2.72s/it] 42%|████████████████████████████████▊ | 286/681 [18:55<18:17, 2.78s/it] {'loss': 0.6691, 'grad_norm': 45.75380325317383, 'learning_rate': 3.614345889441346e-07, 'rewards/chosen': -1.4485076665878296, 'rewards/rejected': -3.016570568084717, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.5680627822875977, 'logps/chosen': -150.214599609375, 'logps/rejected': -249.91607666015625, 'logps/ref_chosen': -72.95100402832031, 'logps/ref_rejected': -88.58845520019531, 'logits/chosen': -0.6205496788024902, 'logits/rejected': -0.4964269995689392, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.018724961206316948, 'epsilon_dpo/loss_margin_mean': 84.06401824951172, 'epsilon_dpo/beta_margin_mean': 1.5680627822875977, 'epsilon_dpo/beta_margin_std': 1.476419448852539, 'epsilon_dpo/beta_margin_grad_mean': -0.24600212275981903, 'epsilon_dpo/beta_margin_grad_std': 0.19994834065437317, 'kl/beta': 0.018852701410651207, 'kl/avg_steps': 0.6875, 'epoch': 0.42} 42%|████████████████████████████████▊ | 286/681 [18:55<18:17, 2.78s/it] 42%|████████████████████████████████▊ | 287/681 [18:57<17:39, 2.69s/it] {'loss': 0.7981, 'grad_norm': 51.563961029052734, 'learning_rate': 3.6028433558269275e-07, 'rewards/chosen': -1.4748058319091797, 'rewards/rejected': -2.791060447692871, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.3162546157836914, 'logps/chosen': -140.733154296875, 'logps/rejected': -228.00364685058594, 'logps/ref_chosen': -61.54115295410156, 'logps/ref_rejected': -77.6960678100586, 'logits/chosen': -0.5361831188201904, 'logits/rejected': -0.3548717200756073, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.018591254949569702, 'epsilon_dpo/loss_margin_mean': 71.11557006835938, 'epsilon_dpo/beta_margin_mean': 1.3162546157836914, 'epsilon_dpo/beta_margin_std': 1.5128830671310425, 'epsilon_dpo/beta_margin_grad_mean': -0.2821826934814453, 'epsilon_dpo/beta_margin_grad_std': 0.212895929813385, 'kl/beta': 0.018723974004387856, 'kl/avg_steps': 0.71875, 'epoch': 0.42} 42%|████████████████████████████████▊ | 287/681 [18:57<17:39, 2.69s/it] 42%|████████████████████████████████▉ | 288/681 [19:00<17:37, 2.69s/it] {'loss': 0.6233, 'grad_norm': 72.87432098388672, 'learning_rate': 3.5913117612644327e-07, 'rewards/chosen': -1.462754726409912, 'rewards/rejected': -2.9758520126342773, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.5130971670150757, 'logps/chosen': -135.8625030517578, 'logps/rejected': -248.79489135742188, 'logps/ref_chosen': -56.661224365234375, 'logps/ref_rejected': -87.335693359375, 'logits/chosen': -0.5659801959991455, 'logits/rejected': -0.4238738715648651, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.018441151827573776, 'epsilon_dpo/loss_margin_mean': 82.2579116821289, 'epsilon_dpo/beta_margin_mean': 1.5130971670150757, 'epsilon_dpo/beta_margin_std': 1.265858769416809, 'epsilon_dpo/beta_margin_grad_mean': -0.2382936030626297, 'epsilon_dpo/beta_margin_grad_std': 0.18369296193122864, 'kl/beta': 0.018590355291962624, 'kl/avg_steps': 0.8125, 'epoch': 0.42} 42%|████████████████████████████████▉ | 288/681 [19:00<17:37, 2.69s/it] 42%|█████████████████████████████████ | 289/681 [19:02<17:20, 2.65s/it] {'loss': 0.66, 'grad_norm': 43.197750091552734, 'learning_rate': 3.5797514096221024e-07, 'rewards/chosen': -1.500335693359375, 'rewards/rejected': -3.1445508003234863, 'rewards/accuracies': 0.875, 'rewards/margins': 1.6442151069641113, 'logps/chosen': -127.05389404296875, 'logps/rejected': -259.5618896484375, 'logps/ref_chosen': -45.23039245605469, 'logps/ref_rejected': -87.64266967773438, 'logits/chosen': -0.4051462411880493, 'logits/rejected': -0.3806743025779724, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.018321340903639793, 'epsilon_dpo/loss_margin_mean': 90.0957260131836, 'epsilon_dpo/beta_margin_mean': 1.6442151069641113, 'epsilon_dpo/beta_margin_std': 1.6481398344039917, 'epsilon_dpo/beta_margin_grad_mean': -0.25129714608192444, 'epsilon_dpo/beta_margin_grad_std': 0.19622960686683655, 'kl/beta': 0.018440525978803635, 'kl/avg_steps': 0.65625, 'epoch': 0.42} 42%|█████████████████████████████████ | 289/681 [19:02<17:20, 2.65s/it] 43%|█████████████████████████████████▏ | 290/681 [19:05<16:59, 2.61s/it] {'loss': 0.6338, 'grad_norm': 60.19779968261719, 'learning_rate': 3.568162605525952e-07, 'rewards/chosen': -1.6394047737121582, 'rewards/rejected': -3.3937277793884277, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.7543230056762695, 'logps/chosen': -145.60018920898438, 'logps/rejected': -303.62261962890625, 'logps/ref_chosen': -55.47149658203125, 'logps/ref_rejected': -116.70857238769531, 'logits/chosen': -0.41647857427597046, 'logits/rejected': -0.5270963907241821, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.018167538568377495, 'epsilon_dpo/loss_margin_mean': 96.78536987304688, 'epsilon_dpo/beta_margin_mean': 1.7543230056762695, 'epsilon_dpo/beta_margin_std': 1.7301348447799683, 'epsilon_dpo/beta_margin_grad_mean': -0.23720751702785492, 'epsilon_dpo/beta_margin_grad_std': 0.19694750010967255, 'kl/beta': 0.018320299685001373, 'kl/avg_steps': 0.84375, 'epoch': 0.43} 43%|█████████████████████████████████▏ | 290/681 [19:05<16:59, 2.61s/it] 43%|█████████████████████████████████▎ | 291/681 [19:07<17:09, 2.64s/it] {'loss': 0.6417, 'grad_norm': 49.096588134765625, 'learning_rate': 3.5565456543517485e-07, 'rewards/chosen': -1.3472115993499756, 'rewards/rejected': -2.8971734046936035, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.549961805343628, 'logps/chosen': -137.8589630126953, 'logps/rejected': -250.12759399414062, 'logps/ref_chosen': -63.26036834716797, 'logps/ref_rejected': -89.29708862304688, 'logits/chosen': -0.45129257440567017, 'logits/rejected': -0.40018701553344727, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.018032565712928772, 'epsilon_dpo/loss_margin_mean': 86.23190307617188, 'epsilon_dpo/beta_margin_mean': 1.5499616861343384, 'epsilon_dpo/beta_margin_std': 1.3567811250686646, 'epsilon_dpo/beta_margin_grad_mean': -0.24202531576156616, 'epsilon_dpo/beta_margin_grad_std': 0.19435711205005646, 'kl/beta': 0.01816701516509056, 'kl/avg_steps': 0.75, 'epoch': 0.43} 43%|█████████████████████████████████▎ | 291/681 [19:08<17:09, 2.64s/it] 43%|█████████████████████████████████▍ | 292/681 [19:10<16:50, 2.60s/it] {'loss': 0.7375, 'grad_norm': 136.08177185058594, 'learning_rate': 3.5449008622169583e-07, 'rewards/chosen': -1.7816226482391357, 'rewards/rejected': -3.2855310440063477, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.5039082765579224, 'logps/chosen': -153.21641540527344, 'logps/rejected': -273.61773681640625, 'logps/ref_chosen': -53.91852951049805, 'logps/ref_rejected': -89.96138000488281, 'logits/chosen': -0.36719846725463867, 'logits/rejected': -0.2681925892829895, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.017903964966535568, 'epsilon_dpo/loss_margin_mean': 84.35847473144531, 'epsilon_dpo/beta_margin_mean': 1.5039082765579224, 'epsilon_dpo/beta_margin_std': 1.5898772478103638, 'epsilon_dpo/beta_margin_grad_mean': -0.260786235332489, 'epsilon_dpo/beta_margin_grad_std': 0.21484361588954926, 'kl/beta': 0.018031777814030647, 'kl/avg_steps': 0.71875, 'epoch': 0.43} 43%|█████████████████████████████████▍ | 292/681 [19:10<16:50, 2.60s/it] 43%|█████████████████████████████████▌ | 293/681 [19:13<17:07, 2.65s/it] {'loss': 0.834, 'grad_norm': 65.80419158935547, 'learning_rate': 3.5332285359726846e-07, 'rewards/chosen': -1.6426957845687866, 'rewards/rejected': -2.940351963043213, 'rewards/accuracies': 0.796875, 'rewards/margins': 1.2976562976837158, 'logps/chosen': -152.51913452148438, 'logps/rejected': -243.33319091796875, 'logps/ref_chosen': -60.376033782958984, 'logps/ref_rejected': -77.8524398803711, 'logits/chosen': -0.286105751991272, 'logits/rejected': -0.1886579692363739, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.01780417375266552, 'epsilon_dpo/loss_margin_mean': 73.33763885498047, 'epsilon_dpo/beta_margin_mean': 1.2976562976837158, 'epsilon_dpo/beta_margin_std': 1.5245400667190552, 'epsilon_dpo/beta_margin_grad_mean': -0.2954978048801422, 'epsilon_dpo/beta_margin_grad_std': 0.22622421383857727, 'kl/beta': 0.017903098836541176, 'kl/avg_steps': 0.5625, 'epoch': 0.43} 43%|█████████████████████████████████▌ | 293/681 [19:13<17:07, 2.65s/it] 43%|█████████████████████████████████▋ | 294/681 [19:15<16:52, 2.62s/it] {'loss': 0.7448, 'grad_norm': 62.94172286987305, 'learning_rate': 3.5215289831955786e-07, 'rewards/chosen': -1.5494587421417236, 'rewards/rejected': -2.9561705589294434, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.4067118167877197, 'logps/chosen': -135.5957794189453, 'logps/rejected': -249.2960968017578, 'logps/ref_chosen': -48.0875358581543, 'logps/ref_rejected': -81.89698791503906, 'logits/chosen': -0.23500093817710876, 'logits/rejected': -0.21314044296741486, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.01769345812499523, 'epsilon_dpo/loss_margin_mean': 79.890869140625, 'epsilon_dpo/beta_margin_mean': 1.4067116975784302, 'epsilon_dpo/beta_margin_std': 1.5747895240783691, 'epsilon_dpo/beta_margin_grad_mean': -0.27741488814353943, 'epsilon_dpo/beta_margin_grad_std': 0.20217780768871307, 'kl/beta': 0.017802957445383072, 'kl/avg_steps': 0.625, 'epoch': 0.43} 43%|█████████████████████████████████▋ | 294/681 [19:15<16:52, 2.62s/it] 43%|█████████████████████████████████▊ | 295/681 [19:18<16:39, 2.59s/it] {'loss': 0.741, 'grad_norm': 83.32816314697266, 'learning_rate': 3.509802512179737e-07, 'rewards/chosen': -1.7668020725250244, 'rewards/rejected': -3.1390466690063477, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.3722447156906128, 'logps/chosen': -150.30224609375, 'logps/rejected': -266.259765625, 'logps/ref_chosen': -49.92467498779297, 'logps/ref_rejected': -87.45632934570312, 'logits/chosen': -0.19670921564102173, 'logits/rejected': -0.21700705587863922, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.01757803000509739, 'epsilon_dpo/loss_margin_mean': 78.42586517333984, 'epsilon_dpo/beta_margin_mean': 1.3722447156906128, 'epsilon_dpo/beta_margin_std': 1.434361219406128, 'epsilon_dpo/beta_margin_grad_mean': -0.27356937527656555, 'epsilon_dpo/beta_margin_grad_std': 0.20483291149139404, 'kl/beta': 0.017692379653453827, 'kl/avg_steps': 0.65625, 'epoch': 0.43} 43%|█████████████████████████████████▊ | 295/681 [19:18<16:39, 2.59s/it] 43%|█████████████████████████████████▉ | 296/681 [19:20<16:32, 2.58s/it] {'loss': 0.9407, 'grad_norm': 100.70362854003906, 'learning_rate': 3.498049431928577e-07, 'rewards/chosen': -1.7783277034759521, 'rewards/rejected': -2.909916639328003, 'rewards/accuracies': 0.8125, 'rewards/margins': 1.1315889358520508, 'logps/chosen': -166.9285888671875, 'logps/rejected': -259.7269287109375, 'logps/ref_chosen': -65.49124145507812, 'logps/ref_rejected': -93.08908081054688, 'logits/chosen': -0.46437016129493713, 'logits/rejected': -0.2735271751880646, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.017485400661826134, 'epsilon_dpo/loss_margin_mean': 65.20050048828125, 'epsilon_dpo/beta_margin_mean': 1.1315889358520508, 'epsilon_dpo/beta_margin_std': 1.5379483699798584, 'epsilon_dpo/beta_margin_grad_mean': -0.3116765022277832, 'epsilon_dpo/beta_margin_grad_std': 0.2334553748369217, 'kl/beta': 0.017577029764652252, 'kl/avg_steps': 0.53125, 'epoch': 0.43} 43%|█████████████████████████████████▉ | 296/681 [19:20<16:32, 2.58s/it] 44%|██████████████████████████████████ | 297/681 [19:23<16:45, 2.62s/it] {'loss': 0.7113, 'grad_norm': 55.77240753173828, 'learning_rate': 3.486270052146694e-07, 'rewards/chosen': -1.6452429294586182, 'rewards/rejected': -2.93807053565979, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.2928276062011719, 'logps/chosen': -151.24240112304688, 'logps/rejected': -264.64349365234375, 'logps/ref_chosen': -56.47694778442383, 'logps/ref_rejected': -95.1385498046875, 'logits/chosen': -0.3529755473136902, 'logits/rejected': -0.2527937591075897, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.017354749143123627, 'epsilon_dpo/loss_margin_mean': 74.73949432373047, 'epsilon_dpo/beta_margin_mean': 1.2928276062011719, 'epsilon_dpo/beta_margin_std': 1.2205746173858643, 'epsilon_dpo/beta_margin_grad_mean': -0.2685171365737915, 'epsilon_dpo/beta_margin_grad_std': 0.18320339918136597, 'kl/beta': 0.017484145238995552, 'kl/avg_steps': 0.75, 'epoch': 0.44} 44%|██████████████████████████████████ | 297/681 [19:23<16:45, 2.62s/it] 44%|██████████████████████████████████▏ | 298/681 [19:26<17:17, 2.71s/it] {'loss': 0.6597, 'grad_norm': 48.75471878051758, 'learning_rate': 3.474464683231698e-07, 'rewards/chosen': -1.5013866424560547, 'rewards/rejected': -3.0520200729370117, 'rewards/accuracies': 0.875, 'rewards/margins': 1.5506335496902466, 'logps/chosen': -154.344970703125, 'logps/rejected': -293.95916748046875, 'logps/ref_chosen': -67.32516479492188, 'logps/ref_rejected': -116.66217041015625, 'logits/chosen': -0.37584325671195984, 'logits/rejected': -0.48060327768325806, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0172418300062418, 'epsilon_dpo/loss_margin_mean': 90.27720642089844, 'epsilon_dpo/beta_margin_mean': 1.5506335496902466, 'epsilon_dpo/beta_margin_std': 1.5275505781173706, 'epsilon_dpo/beta_margin_grad_mean': -0.2541157007217407, 'epsilon_dpo/beta_margin_grad_std': 0.18940328061580658, 'kl/beta': 0.01735399104654789, 'kl/avg_steps': 0.65625, 'epoch': 0.44} 44%|██████████████████████████████████▏ | 298/681 [19:26<17:17, 2.71s/it] 44%|██████████████████████████████████▏ | 299/681 [19:29<17:22, 2.73s/it] {'loss': 0.7716, 'grad_norm': 57.401187896728516, 'learning_rate': 3.462633636266041e-07, 'rewards/chosen': -1.4078408479690552, 'rewards/rejected': -2.9208476543426514, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.5130068063735962, 'logps/chosen': -130.8524627685547, 'logps/rejected': -255.00015258789062, 'logps/ref_chosen': -48.96209716796875, 'logps/ref_rejected': -84.32823944091797, 'logits/chosen': -0.24702782928943634, 'logits/rejected': -0.19395104050636292, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.01714019477367401, 'epsilon_dpo/loss_margin_mean': 88.78156280517578, 'epsilon_dpo/beta_margin_mean': 1.5130068063735962, 'epsilon_dpo/beta_margin_std': 1.702101469039917, 'epsilon_dpo/beta_margin_grad_mean': -0.2694842517375946, 'epsilon_dpo/beta_margin_grad_std': 0.22855404019355774, 'kl/beta': 0.017240848392248154, 'kl/avg_steps': 0.59375, 'epoch': 0.44} 44%|██████████████████████████████████▏ | 299/681 [19:29<17:22, 2.73s/it] 44%|██████████████████████████████████▎ | 300/681 [19:32<17:26, 2.75s/it] {'loss': 0.7955, 'grad_norm': 118.19168853759766, 'learning_rate': 3.4507772230088147e-07, 'rewards/chosen': -1.8646724224090576, 'rewards/rejected': -3.472029685974121, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.6073572635650635, 'logps/chosen': -168.39486694335938, 'logps/rejected': -300.21112060546875, 'logps/ref_chosen': -59.073707580566406, 'logps/ref_rejected': -95.9664535522461, 'logits/chosen': -0.23180323839187622, 'logits/rejected': -0.23513799905776978, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.017025606706738472, 'epsilon_dpo/loss_margin_mean': 94.92350769042969, 'epsilon_dpo/beta_margin_mean': 1.6073572635650635, 'epsilon_dpo/beta_margin_std': 1.7735728025436401, 'epsilon_dpo/beta_margin_grad_mean': -0.25651147961616516, 'epsilon_dpo/beta_margin_grad_std': 0.24618221819400787, 'kl/beta': 0.01713908463716507, 'kl/avg_steps': 0.671875, 'epoch': 0.44} 44%|██████████████████████████████████▎ | 300/681 [19:32<17:26, 2.75s/it][INFO|trainer.py:4307] 2026-04-18 00:57:41,298 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 00:57:41,298 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 00:57:41,298 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 01:02:50,983 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 01:02:50,983 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-400 [INFO|configuration_utils.py:419] 2026-04-18 01:03:49,117 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-400/config.json [INFO|configuration_utils.py:911] 2026-04-18 01:03:49,137 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-400/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-18 01:04:50,950 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-400/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-18 01:04:50,972 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-400/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-18 01:04:50,991 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-400/special_tokens_map.json 59%|████████████████████████████████████████████▏ | 401/681 [30:37<8:26:20, 108.50s/it] {'loss': 1.035, 'grad_norm': 50.62467575073242, 'learning_rate': 2.1800473436235136e-07, 'rewards/chosen': -1.1949478387832642, 'rewards/rejected': -2.018589973449707, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.8236421346664429, 'logps/chosen': -199.46241760253906, 'logps/rejected': -325.35943603515625, 'logps/ref_chosen': -57.16303253173828, 'logps/ref_rejected': -83.79249572753906, 'logits/chosen': 0.2227509617805481, 'logits/rejected': 0.32873934507369995, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'epsilon_dpo/beta': 0.008373704738914967, 'epsilon_dpo/loss_margin_mean': 99.26753997802734, 'epsilon_dpo/beta_margin_mean': 0.8236421346664429, 'epsilon_dpo/beta_margin_std': 1.309921145439148, 'epsilon_dpo/beta_margin_grad_mean': -0.35600748658180237, 'epsilon_dpo/beta_margin_grad_std': 0.218048557639122, 'kl/beta': 0.008412300609052181, 'kl/avg_steps': 0.46875, 'epoch': 0.59} 59%|████████████████████████████████████████████▏ | 401/681 [30:37<8:26:20, 108.50s/it] 59%|████████████████████████████████████████████▊ | 402/681 [30:39<5:56:29, 76.66s/it] {'loss': 0.5727, 'grad_norm': 27.302982330322266, 'learning_rate': 2.1673238449588665e-07, 'rewards/chosen': -0.8723675012588501, 'rewards/rejected': -2.2928786277770996, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.42051100730896, 'logps/chosen': -155.729248046875, 'logps/rejected': -357.44720458984375, 'logps/ref_chosen': -50.74037170410156, 'logps/ref_rejected': -81.0460433959961, 'logits/chosen': 0.1841059774160385, 'logits/rejected': 0.4696100056171417, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.008303234353661537, 'epsilon_dpo/loss_margin_mean': 171.4123077392578, 'epsilon_dpo/beta_margin_mean': 1.42051100730896, 'epsilon_dpo/beta_margin_std': 0.9418905377388, 'epsilon_dpo/beta_margin_grad_mean': -0.23055776953697205, 'epsilon_dpo/beta_margin_grad_std': 0.1520979404449463, 'kl/beta': 0.008373051881790161, 'kl/avg_steps': 0.84375, 'epoch': 0.59} 59%|████████████████████████████████████████████▊ | 402/681 [30:39<5:56:29, 76.66s/it] 59%|████████████████████████████████████████████▉ | 403/681 [30:42<4:12:09, 54.42s/it] {'loss': 0.7888, 'grad_norm': 37.2651252746582, 'learning_rate': 2.154609112620295e-07, 'rewards/chosen': -1.0179142951965332, 'rewards/rejected': -2.0947608947753906, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.0768463611602783, 'logps/chosen': -170.4490509033203, 'logps/rejected': -331.73577880859375, 'logps/ref_chosen': -47.14731216430664, 'logps/ref_rejected': -77.2666015625, 'logits/chosen': 0.3366158604621887, 'logits/rejected': 0.5971213579177856, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.00824414286762476, 'epsilon_dpo/loss_margin_mean': 131.1674041748047, 'epsilon_dpo/beta_margin_mean': 1.0768463611602783, 'epsilon_dpo/beta_margin_std': 1.0528734922409058, 'epsilon_dpo/beta_margin_grad_mean': -0.29532960057258606, 'epsilon_dpo/beta_margin_grad_std': 0.18958072364330292, 'kl/beta': 0.008302995935082436, 'kl/avg_steps': 0.71875, 'epoch': 0.59} 59%|████████████████████████████████████████████▉ | 403/681 [30:42<4:12:09, 54.42s/it] 59%|█████████████████████████████████████████████ | 404/681 [30:44<2:59:24, 38.86s/it] {'loss': 0.7747, 'grad_norm': 50.897254943847656, 'learning_rate': 2.1419034816528218e-07, 'rewards/chosen': -1.1005568504333496, 'rewards/rejected': -2.2049245834350586, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.104367971420288, 'logps/chosen': -182.01962280273438, 'logps/rejected': -346.76849365234375, 'logps/ref_chosen': -47.875274658203125, 'logps/ref_rejected': -77.15499877929688, 'logits/chosen': 0.2518424093723297, 'logits/rejected': 0.5586760640144348, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.008187886327505112, 'epsilon_dpo/loss_margin_mean': 135.46914672851562, 'epsilon_dpo/beta_margin_mean': 1.1043678522109985, 'epsilon_dpo/beta_margin_std': 1.079614281654358, 'epsilon_dpo/beta_margin_grad_mean': -0.2925785481929779, 'epsilon_dpo/beta_margin_grad_std': 0.18369947373867035, 'kl/beta': 0.00824374333024025, 'kl/avg_steps': 0.6875, 'epoch': 0.59} 59%|█████████████████████████████████████████████ | 404/681 [30:44<2:59:24, 38.86s/it] 59%|█████████████████████████████████████████████▏ | 405/681 [30:47<2:08:38, 27.97s/it] {'loss': 0.9057, 'grad_norm': 48.55431365966797, 'learning_rate': 2.129207286861638e-07, 'rewards/chosen': -1.3364191055297852, 'rewards/rejected': -2.2470414638519287, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.9106223583221436, 'logps/chosen': -228.88922119140625, 'logps/rejected': -363.45953369140625, 'logps/ref_chosen': -65.16290283203125, 'logps/ref_rejected': -87.18678283691406, 'logits/chosen': 0.21004986763000488, 'logits/rejected': 0.4672671854496002, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'epsilon_dpo/beta': 0.008149891160428524, 'epsilon_dpo/loss_margin_mean': 112.54642486572266, 'epsilon_dpo/beta_margin_mean': 0.9106223583221436, 'epsilon_dpo/beta_margin_std': 1.1335408687591553, 'epsilon_dpo/beta_margin_grad_mean': -0.3321351110935211, 'epsilon_dpo/beta_margin_grad_std': 0.19370493292808533, 'kl/beta': 0.008187455125153065, 'kl/avg_steps': 0.46875, 'epoch': 0.59} 59%|█████████████████████████████████████████████▏ | 405/681 [30:47<2:08:38, 27.97s/it] 60%|█████████████████████████████████████████████▎ | 406/681 [30:50<1:33:17, 20.36s/it] {'loss': 0.8218, 'grad_norm': 38.041080474853516, 'learning_rate': 2.1165208628032861e-07, 'rewards/chosen': -1.23288893699646, 'rewards/rejected': -2.300886392593384, 'rewards/accuracies': 0.796875, 'rewards/margins': 1.0679974555969238, 'logps/chosen': -201.58392333984375, 'logps/rejected': -376.50213623046875, 'logps/ref_chosen': -49.740814208984375, 'logps/ref_rejected': -92.07862854003906, 'logits/chosen': 0.43436455726623535, 'logits/rejected': 0.49039581418037415, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.008104225620627403, 'epsilon_dpo/loss_margin_mean': 132.58038330078125, 'epsilon_dpo/beta_margin_mean': 1.0679973363876343, 'epsilon_dpo/beta_margin_std': 1.1687884330749512, 'epsilon_dpo/beta_margin_grad_mean': -0.3036099374294281, 'epsilon_dpo/beta_margin_grad_std': 0.19255390763282776, 'kl/beta': 0.008149255067110062, 'kl/avg_steps': 0.5625, 'epoch': 0.6} 60%|█████████████████████████████████████████████▎ | 406/681 [30:50<1:33:17, 20.36s/it] 60%|█████████████████████████████████████████████▍ | 407/681 [30:52<1:08:38, 15.03s/it] {'loss': 0.9052, 'grad_norm': 80.59246826171875, 'learning_rate': 2.1038445437768375e-07, 'rewards/chosen': -1.4123256206512451, 'rewards/rejected': -2.2966251373291016, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.8842993974685669, 'logps/chosen': -231.44268798828125, 'logps/rejected': -363.0897521972656, 'logps/ref_chosen': -56.33069610595703, 'logps/ref_rejected': -77.5120849609375, 'logits/chosen': 0.39705413579940796, 'logits/rejected': 0.7918181419372559, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.008051296696066856, 'epsilon_dpo/loss_margin_mean': 110.4656753540039, 'epsilon_dpo/beta_margin_mean': 0.8842993974685669, 'epsilon_dpo/beta_margin_std': 1.0626513957977295, 'epsilon_dpo/beta_margin_grad_mean': -0.32882165908813477, 'epsilon_dpo/beta_margin_grad_std': 0.19124306738376617, 'kl/beta': 0.008103672415018082, 'kl/avg_steps': 0.65625, 'epoch': 0.6} 60%|█████████████████████████████████████████████▍ | 407/681 [30:52<1:08:38, 15.03s/it] 60%|██████████████████████████████████████████████▋ | 408/681 [30:55<51:57, 11.42s/it] {'loss': 0.9228, 'grad_norm': 45.580322265625, 'learning_rate': 2.0911786638150872e-07, 'rewards/chosen': -1.4534653425216675, 'rewards/rejected': -2.249375820159912, 'rewards/accuracies': 0.75, 'rewards/margins': 0.7959102988243103, 'logps/chosen': -250.8351593017578, 'logps/rejected': -371.2132568359375, 'logps/ref_chosen': -69.789306640625, 'logps/ref_rejected': -90.09693908691406, 'logits/chosen': 0.09460186958312988, 'logits/rejected': 0.6105620265007019, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.008008869364857674, 'epsilon_dpo/loss_margin_mean': 100.0704574584961, 'epsilon_dpo/beta_margin_mean': 0.7959102988243103, 'epsilon_dpo/beta_margin_std': 0.9429380893707275, 'epsilon_dpo/beta_margin_grad_mean': -0.34042710065841675, 'epsilon_dpo/beta_margin_grad_std': 0.1824430227279663, 'kl/beta': 0.008050838485360146, 'kl/avg_steps': 0.53125, 'epoch': 0.6} 60%|██████████████████████████████████████████████▋ | 408/681 [30:55<51:57, 11.42s/it] 60%|██████████████████████████████████████████████▊ | 409/681 [30:58<39:59, 8.82s/it] {'loss': 0.7636, 'grad_norm': 56.11100387573242, 'learning_rate': 2.0785235566757517e-07, 'rewards/chosen': -1.3948218822479248, 'rewards/rejected': -2.4308199882507324, 'rewards/accuracies': 0.875, 'rewards/margins': 1.0359981060028076, 'logps/chosen': -242.51821899414062, 'logps/rejected': -390.8831481933594, 'logps/ref_chosen': -67.31744384765625, 'logps/ref_rejected': -84.904296875, 'logits/chosen': 0.3741447925567627, 'logits/rejected': 0.6117522716522217, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.007951530627906322, 'epsilon_dpo/loss_margin_mean': 130.778076171875, 'epsilon_dpo/beta_margin_mean': 1.0359981060028076, 'epsilon_dpo/beta_margin_std': 0.9457764029502869, 'epsilon_dpo/beta_margin_grad_mean': -0.2959141135215759, 'epsilon_dpo/beta_margin_grad_std': 0.16134285926818848, 'kl/beta': 0.008008294738829136, 'kl/avg_steps': 0.71875, 'epoch': 0.6} 60%|██████████████████████████████████████████████▊ | 409/681 [30:58<39:59, 8.82s/it] 60%|██████████████████████████████████████████████▉ | 410/681 [31:01<31:30, 6.98s/it] {'loss': 0.8466, 'grad_norm': 44.16848373413086, 'learning_rate': 2.065879555832674e-07, 'rewards/chosen': -1.4710114002227783, 'rewards/rejected': -2.3844408988952637, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.9134294986724854, 'logps/chosen': -237.3287353515625, 'logps/rejected': -385.25921630859375, 'logps/ref_chosen': -51.465354919433594, 'logps/ref_rejected': -83.198974609375, 'logits/chosen': 0.4321461319923401, 'logits/rejected': 0.5176758766174316, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.007902241311967373, 'epsilon_dpo/loss_margin_mean': 116.19683837890625, 'epsilon_dpo/beta_margin_mean': 0.9134294986724854, 'epsilon_dpo/beta_margin_std': 0.9596654176712036, 'epsilon_dpo/beta_margin_grad_mean': -0.3191192150115967, 'epsilon_dpo/beta_margin_grad_std': 0.17301428318023682, 'kl/beta': 0.007951145060360432, 'kl/avg_steps': 0.625, 'epoch': 0.6} 60%|██████████████████████████████████████████████▉ | 410/681 [31:01<31:30, 6.98s/it] 60%|███████████████████████████████████████████████ | 411/681 [31:03<25:04, 5.57s/it] {'loss': 0.8034, 'grad_norm': 51.007633209228516, 'learning_rate': 2.0532469944670343e-07, 'rewards/chosen': -1.5672738552093506, 'rewards/rejected': -2.579244613647461, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.0119706392288208, 'logps/chosen': -251.80374145507812, 'logps/rejected': -409.6463623046875, 'logps/ref_chosen': -52.30727005004883, 'logps/ref_rejected': -80.69495391845703, 'logits/chosen': 0.5662134885787964, 'logits/rejected': 0.8120511770248413, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.007848219946026802, 'epsilon_dpo/loss_margin_mean': 129.45492553710938, 'epsilon_dpo/beta_margin_mean': 1.0119707584381104, 'epsilon_dpo/beta_margin_std': 1.0637365579605103, 'epsilon_dpo/beta_margin_grad_mean': -0.3073974549770355, 'epsilon_dpo/beta_margin_grad_std': 0.17039382457733154, 'kl/beta': 0.007901759818196297, 'kl/avg_steps': 0.6875, 'epoch': 0.6} 60%|███████████████████████████████████████████████ | 411/681 [31:03<25:04, 5.57s/it] 60%|███████████████████████████████████████████████▏ | 412/681 [31:05<20:43, 4.62s/it] {'loss': 0.8768, 'grad_norm': 55.60865783691406, 'learning_rate': 2.0406262054585738e-07, 'rewards/chosen': -1.5729925632476807, 'rewards/rejected': -2.4537506103515625, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8807581067085266, 'logps/chosen': -254.76193237304688, 'logps/rejected': -415.2494201660156, 'logps/ref_chosen': -53.144126892089844, 'logps/ref_rejected': -100.06080627441406, 'logits/chosen': 0.4706282317638397, 'logits/rejected': 0.5481432676315308, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.00779463117942214, 'epsilon_dpo/loss_margin_mean': 113.57080841064453, 'epsilon_dpo/beta_margin_mean': 0.8807581067085266, 'epsilon_dpo/beta_margin_std': 0.9668706059455872, 'epsilon_dpo/beta_margin_grad_mean': -0.3244841396808624, 'epsilon_dpo/beta_margin_grad_std': 0.18310624361038208, 'kl/beta': 0.007847805507481098, 'kl/avg_steps': 0.6875, 'epoch': 0.6} 60%|███████████████████████████████████████████████▏ | 412/681 [31:05<20:43, 4.62s/it] 61%|███████████████████████████████████████████████▎ | 413/681 [31:08<18:11, 4.07s/it] {'loss': 0.8106, 'grad_norm': 51.80922317504883, 'learning_rate': 2.0280175213768205e-07, 'rewards/chosen': -1.6561181545257568, 'rewards/rejected': -2.5851690769195557, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.9290508031845093, 'logps/chosen': -275.64434814453125, 'logps/rejected': -434.048095703125, 'logps/ref_chosen': -61.58196258544922, 'logps/ref_rejected': -99.47340393066406, 'logits/chosen': 0.41288986802101135, 'logits/rejected': 0.551357626914978, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.007729229982942343, 'epsilon_dpo/loss_margin_mean': 120.51229858398438, 'epsilon_dpo/beta_margin_mean': 0.929050862789154, 'epsilon_dpo/beta_margin_std': 0.8711987733840942, 'epsilon_dpo/beta_margin_grad_mean': -0.3099435269832611, 'epsilon_dpo/beta_margin_grad_std': 0.16134311258792877, 'kl/beta': 0.007794220466166735, 'kl/avg_steps': 0.84375, 'epoch': 0.61} 61%|███████████████████████████████████████████████▎ | 413/681 [31:08<18:11, 4.07s/it] 61%|███████████████████████████████████████████████▍ | 414/681 [31:11<16:13, 3.64s/it] {'loss': 0.7247, 'grad_norm': 51.93803024291992, 'learning_rate': 2.0154212744723247e-07, 'rewards/chosen': -1.4886877536773682, 'rewards/rejected': -2.5978167057037354, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.1091289520263672, 'logps/chosen': -240.62554931640625, 'logps/rejected': -426.708251953125, 'logps/ref_chosen': -46.63148880004883, 'logps/ref_rejected': -87.64652252197266, 'logits/chosen': 0.5738648176193237, 'logits/rejected': 0.8426915407180786, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.007666975259780884, 'epsilon_dpo/loss_margin_mean': 145.06768798828125, 'epsilon_dpo/beta_margin_mean': 1.1091288328170776, 'epsilon_dpo/beta_margin_std': 0.9259233474731445, 'epsilon_dpo/beta_margin_grad_mean': -0.27944430708885193, 'epsilon_dpo/beta_margin_grad_std': 0.16479705274105072, 'kl/beta': 0.007729006931185722, 'kl/avg_steps': 0.8125, 'epoch': 0.61} 61%|███████████████████████████████████████████████▍ | 414/681 [31:11<16:13, 3.64s/it] 61%|███████████████████████████████████████████████▌ | 415/681 [31:14<15:07, 3.41s/it] {'loss': 0.8712, 'grad_norm': 43.91434860229492, 'learning_rate': 2.002837796667909e-07, 'rewards/chosen': -1.6471409797668457, 'rewards/rejected': -2.5031731128692627, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8560322523117065, 'logps/chosen': -294.5069580078125, 'logps/rejected': -429.3173522949219, 'logps/ref_chosen': -78.6182861328125, 'logps/ref_rejected': -100.47752380371094, 'logits/chosen': 0.23712509870529175, 'logits/rejected': 0.4227851331233978, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.007614767644554377, 'epsilon_dpo/loss_margin_mean': 112.95114135742188, 'epsilon_dpo/beta_margin_mean': 0.8560322523117065, 'epsilon_dpo/beta_margin_std': 0.9299007058143616, 'epsilon_dpo/beta_margin_grad_mean': -0.32948988676071167, 'epsilon_dpo/beta_margin_grad_std': 0.16906966269016266, 'kl/beta': 0.007666714955121279, 'kl/avg_steps': 0.6875, 'epoch': 0.61} 61%|███████████████████████████████████████████████▌ | 415/681 [31:14<15:07, 3.41s/it] 61%|███████████████████████████████████████████████▋ | 416/681 [31:16<14:03, 3.18s/it] {'loss': 0.6837, 'grad_norm': 66.8626708984375, 'learning_rate': 1.990267419549914e-07, 'rewards/chosen': -1.5257494449615479, 'rewards/rejected': -2.7238240242004395, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.1980746984481812, 'logps/chosen': -260.194091796875, 'logps/rejected': -451.4803466796875, 'logps/ref_chosen': -58.27912521362305, 'logps/ref_rejected': -90.56871795654297, 'logits/chosen': 0.32131990790367126, 'logits/rejected': 0.6265676021575928, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.007555634714663029, 'epsilon_dpo/loss_margin_mean': 158.9966583251953, 'epsilon_dpo/beta_margin_mean': 1.1980746984481812, 'epsilon_dpo/beta_margin_std': 0.9696047306060791, 'epsilon_dpo/beta_margin_grad_mean': -0.26847735047340393, 'epsilon_dpo/beta_margin_grad_std': 0.16211014986038208, 'kl/beta': 0.007614366244524717, 'kl/avg_steps': 0.78125, 'epoch': 0.61} 61%|███████████████████████████████████████████████▋ | 416/681 [31:16<14:03, 3.18s/it] 61%|███████████████████████████████████████████████▊ | 417/681 [31:19<13:00, 2.96s/it] {'loss': 0.6667, 'grad_norm': 47.27936553955078, 'learning_rate': 1.9777104743594686e-07, 'rewards/chosen': -1.6028650999069214, 'rewards/rejected': -2.7524828910827637, 'rewards/accuracies': 0.9375, 'rewards/margins': 1.1496177911758423, 'logps/chosen': -264.1747131347656, 'logps/rejected': -435.85498046875, 'logps/ref_chosen': -50.1987190246582, 'logps/ref_rejected': -68.15184020996094, 'logits/chosen': 0.5093731880187988, 'logits/rejected': 1.3290538787841797, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.007487618364393711, 'epsilon_dpo/loss_margin_mean': 153.72714233398438, 'epsilon_dpo/beta_margin_mean': 1.1496177911758423, 'epsilon_dpo/beta_margin_std': 0.8150914907455444, 'epsilon_dpo/beta_margin_grad_mean': -0.2677600085735321, 'epsilon_dpo/beta_margin_grad_std': 0.141451895236969, 'kl/beta': 0.007555339951068163, 'kl/avg_steps': 0.90625, 'epoch': 0.61} 61%|███████████████████████████████████████████████▊ | 417/681 [31:19<13:00, 2.96s/it] 61%|███████████████████████████████████████████████▉ | 418/681 [31:22<12:46, 2.91s/it] {'loss': 0.8347, 'grad_norm': 49.26523971557617, 'learning_rate': 1.965167291983757e-07, 'rewards/chosen': -1.627633810043335, 'rewards/rejected': -2.6028285026550293, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.9751946926116943, 'logps/chosen': -300.4624328613281, 'logps/rejected': -454.90142822265625, 'logps/ref_chosen': -81.97846984863281, 'logps/ref_rejected': -104.69148254394531, 'logits/chosen': 0.11700999736785889, 'logits/rejected': 0.5254498720169067, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.00744143221527338, 'epsilon_dpo/loss_margin_mean': 131.72601318359375, 'epsilon_dpo/beta_margin_mean': 0.9751946926116943, 'epsilon_dpo/beta_margin_std': 1.0679258108139038, 'epsilon_dpo/beta_margin_grad_mean': -0.31553393602371216, 'epsilon_dpo/beta_margin_grad_std': 0.17717210948467255, 'kl/beta': 0.007487484719604254, 'kl/avg_steps': 0.625, 'epoch': 0.61} 61%|███████████████████████████████████████████████▉ | 418/681 [31:22<12:46, 2.91s/it] 62%|███████████████████████████████████████████████▉ | 419/681 [31:24<12:43, 2.92s/it] {'loss': 0.7559, 'grad_norm': 46.66392135620117, 'learning_rate': 1.9526382029472988e-07, 'rewards/chosen': -1.427504062652588, 'rewards/rejected': -2.5086851119995117, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.0811809301376343, 'logps/chosen': -245.95651245117188, 'logps/rejected': -431.462890625, 'logps/ref_chosen': -52.948646545410156, 'logps/ref_rejected': -91.58309936523438, 'logits/chosen': 0.3658443093299866, 'logits/rejected': 0.5743440389633179, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.007385909557342529, 'epsilon_dpo/loss_margin_mean': 146.87191772460938, 'epsilon_dpo/beta_margin_mean': 1.0811809301376343, 'epsilon_dpo/beta_margin_std': 0.9915341734886169, 'epsilon_dpo/beta_margin_grad_mean': -0.2904692590236664, 'epsilon_dpo/beta_margin_grad_std': 0.17036795616149902, 'kl/beta': 0.0074409786611795425, 'kl/avg_steps': 0.75, 'epoch': 0.62} 62%|███████████████████████████████████████████████▉ | 419/681 [31:24<12:43, 2.92s/it] 62%|████████████████████████████████████████████████ | 420/681 [31:27<12:17, 2.83s/it] {'loss': 0.8551, 'grad_norm': 55.739376068115234, 'learning_rate': 1.9401235374032425e-07, 'rewards/chosen': -1.4773731231689453, 'rewards/rejected': -2.4127635955810547, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.9353904724121094, 'logps/chosen': -278.3549499511719, 'logps/rejected': -397.9765319824219, 'logps/ref_chosen': -77.7699203491211, 'logps/ref_rejected': -69.31985473632812, 'logits/chosen': 0.14379703998565674, 'logits/rejected': 0.9165039658546448, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.007344777230173349, 'epsilon_dpo/loss_margin_mean': 128.07164001464844, 'epsilon_dpo/beta_margin_mean': 0.9353904724121094, 'epsilon_dpo/beta_margin_std': 1.032126545906067, 'epsilon_dpo/beta_margin_grad_mean': -0.31808242201805115, 'epsilon_dpo/beta_margin_grad_std': 0.17818176746368408, 'kl/beta': 0.007385586854070425, 'kl/avg_steps': 0.5625, 'epoch': 0.62} 62%|████████████████████████████████████████████████ | 420/681 [31:27<12:17, 2.83s/it] 62%|████████████████████████████████████████████████▏ | 421/681 [31:30<12:05, 2.79s/it] {'loss': 0.937, 'grad_norm': 60.28358840942383, 'learning_rate': 1.9276236251246653e-07, 'rewards/chosen': -1.3415601253509521, 'rewards/rejected': -2.1143455505371094, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.7727855443954468, 'logps/chosen': -237.02256774902344, 'logps/rejected': -379.0960998535156, 'logps/ref_chosen': -53.765865325927734, 'logps/ref_rejected': -89.28144836425781, 'logits/chosen': 0.2799881100654602, 'logits/rejected': 0.49353882670402527, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.007305989041924477, 'epsilon_dpo/loss_margin_mean': 106.55794525146484, 'epsilon_dpo/beta_margin_mean': 0.7727855443954468, 'epsilon_dpo/beta_margin_std': 0.9297990798950195, 'epsilon_dpo/beta_margin_grad_mean': -0.34329816699028015, 'epsilon_dpo/beta_margin_grad_std': 0.1848863959312439, 'kl/beta': 0.007344275247305632, 'kl/avg_steps': 0.53125, 'epoch': 0.62} 62%|████████████████████████████████████████████████▏ | 421/681 [31:30<12:05, 2.79s/it] 62%|████████████████████████████████████████████████▎ | 422/681 [31:33<12:12, 2.83s/it] {'loss': 0.8759, 'grad_norm': 49.03010559082031, 'learning_rate': 1.9151387954958792e-07, 'rewards/chosen': -1.4108562469482422, 'rewards/rejected': -2.269425392150879, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.8585691452026367, 'logps/chosen': -262.7305603027344, 'logps/rejected': -400.9112548828125, 'logps/ref_chosen': -68.6337661743164, 'logps/ref_rejected': -87.86351013183594, 'logits/chosen': 0.20186124742031097, 'logits/rejected': 0.5440545082092285, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.00725596584379673, 'epsilon_dpo/loss_margin_mean': 118.95094299316406, 'epsilon_dpo/beta_margin_mean': 0.8585691452026367, 'epsilon_dpo/beta_margin_std': 0.9232885837554932, 'epsilon_dpo/beta_margin_grad_mean': -0.32523876428604126, 'epsilon_dpo/beta_margin_grad_std': 0.17791172862052917, 'kl/beta': 0.007305465172976255, 'kl/avg_steps': 0.6875, 'epoch': 0.62} 62%|████████████████████████████████████████████████▎ | 422/681 [31:33<12:12, 2.83s/it] 62%|████████████████████████████████████████████████▍ | 423/681 [31:35<11:42, 2.72s/it] {'loss': 0.8532, 'grad_norm': 70.25402069091797, 'learning_rate': 1.902669377503756e-07, 'rewards/chosen': -1.307551622390747, 'rewards/rejected': -2.202369213104248, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.894817590713501, 'logps/chosen': -236.0745849609375, 'logps/rejected': -392.15887451171875, 'logps/ref_chosen': -54.99030303955078, 'logps/ref_rejected': -86.30654907226562, 'logits/chosen': 0.34702229499816895, 'logits/rejected': 0.37363946437835693, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0072086891159415245, 'epsilon_dpo/loss_margin_mean': 124.76802062988281, 'epsilon_dpo/beta_margin_mean': 0.894817590713501, 'epsilon_dpo/beta_margin_std': 0.9295312166213989, 'epsilon_dpo/beta_margin_grad_mean': -0.32009923458099365, 'epsilon_dpo/beta_margin_grad_std': 0.1750791072845459, 'kl/beta': 0.007255583070218563, 'kl/avg_steps': 0.65625, 'epoch': 0.62} 62%|████████████████████████████████████████████████▍ | 423/681 [31:35<11:42, 2.72s/it] 62%|████████████████████████████████████████████████▌ | 424/681 [31:38<11:42, 2.73s/it] {'loss': 0.8175, 'grad_norm': 69.93016815185547, 'learning_rate': 1.890215699729057e-07, 'rewards/chosen': -1.1121392250061035, 'rewards/rejected': -2.0356974601745605, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.9235584139823914, 'logps/chosen': -211.23085021972656, 'logps/rejected': -351.2352294921875, 'logps/ref_chosen': -56.01191711425781, 'logps/ref_rejected': -66.47896575927734, 'logits/chosen': 0.17603448033332825, 'logits/rejected': 0.7972570657730103, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.007159437518566847, 'epsilon_dpo/loss_margin_mean': 129.53733825683594, 'epsilon_dpo/beta_margin_mean': 0.9235584139823914, 'epsilon_dpo/beta_margin_std': 0.9231213331222534, 'epsilon_dpo/beta_margin_grad_mean': -0.31579458713531494, 'epsilon_dpo/beta_margin_grad_std': 0.15702416002750397, 'kl/beta': 0.007208278402686119, 'kl/avg_steps': 0.6875, 'epoch': 0.62} 62%|████████████████████████████████████████████████▌ | 424/681 [31:38<11:42, 2.73s/it] 62%|████████████████████████████████████████████████▋ | 425/681 [31:41<11:31, 2.70s/it] {'loss': 0.884, 'grad_norm': 50.22777557373047, 'learning_rate': 1.8777780903377732e-07, 'rewards/chosen': -1.2333381175994873, 'rewards/rejected': -2.0357167720794678, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.8023786544799805, 'logps/chosen': -219.85743713378906, 'logps/rejected': -382.3094482421875, 'logps/ref_chosen': -46.868995666503906, 'logps/ref_rejected': -95.92545318603516, 'logits/chosen': 0.22520506381988525, 'logits/rejected': 0.28665509819984436, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.007112789899110794, 'epsilon_dpo/loss_margin_mean': 113.39553833007812, 'epsilon_dpo/beta_margin_mean': 0.8023787140846252, 'epsilon_dpo/beta_margin_std': 0.8346377015113831, 'epsilon_dpo/beta_margin_grad_mean': -0.3333708941936493, 'epsilon_dpo/beta_margin_grad_std': 0.16707682609558105, 'kl/beta': 0.00715905986726284, 'kl/avg_steps': 0.65625, 'epoch': 0.62} 62%|████████████████████████████████████████████████▋ | 425/681 [31:41<11:31, 2.70s/it] 63%|████████████████████████████████████████████████▊ | 426/681 [31:43<11:37, 2.73s/it] {'loss': 0.7755, 'grad_norm': 60.108436584472656, 'learning_rate': 1.8653568770724803e-07, 'rewards/chosen': -1.1039516925811768, 'rewards/rejected': -2.1171505451202393, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.013198733329773, 'logps/chosen': -232.7984619140625, 'logps/rejected': -381.5323486328125, 'logps/ref_chosen': -76.58354187011719, 'logps/ref_rejected': -81.26658630371094, 'logits/chosen': -0.021541226655244827, 'logits/rejected': 0.46378132700920105, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.007057525217533112, 'epsilon_dpo/loss_margin_mean': 144.0508575439453, 'epsilon_dpo/beta_margin_mean': 1.013198733329773, 'epsilon_dpo/beta_margin_std': 0.9114711284637451, 'epsilon_dpo/beta_margin_grad_mean': -0.2971186637878418, 'epsilon_dpo/beta_margin_grad_std': 0.16744133830070496, 'kl/beta': 0.007112384773790836, 'kl/avg_steps': 0.78125, 'epoch': 0.63} 63%|████████████████████████████████████████████████▊ | 426/681 [31:43<11:37, 2.73s/it] 63%|████████████████████████████████████████████████▉ | 427/681 [31:46<11:33, 2.73s/it] {'loss': 0.919, 'grad_norm': 97.97515869140625, 'learning_rate': 1.8529523872436977e-07, 'rewards/chosen': -1.1492910385131836, 'rewards/rejected': -1.8683075904846191, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.7190166711807251, 'logps/chosen': -228.42538452148438, 'logps/rejected': -345.14276123046875, 'logps/ref_chosen': -64.8538818359375, 'logps/ref_rejected': -78.56600952148438, 'logits/chosen': 0.024022696539759636, 'logits/rejected': 0.4438282549381256, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.007016048766672611, 'epsilon_dpo/loss_margin_mean': 103.0052261352539, 'epsilon_dpo/beta_margin_mean': 0.7190166711807251, 'epsilon_dpo/beta_margin_std': 0.790506899356842, 'epsilon_dpo/beta_margin_grad_mean': -0.3493276834487915, 'epsilon_dpo/beta_margin_grad_std': 0.15186835825443268, 'kl/beta': 0.007057250011712313, 'kl/avg_steps': 0.59375, 'epoch': 0.63} 63%|████████████████████████████████████████████████▉ | 427/681 [31:46<11:33, 2.73s/it] 63%|█████████████████████████████████████████████████ | 428/681 [31:49<11:37, 2.76s/it] {'loss': 0.8578, 'grad_norm': 56.44048309326172, 'learning_rate': 1.8405649477212697e-07, 'rewards/chosen': -1.3960375785827637, 'rewards/rejected': -2.2945446968078613, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.8985069990158081, 'logps/chosen': -262.54144287109375, 'logps/rejected': -432.73614501953125, 'logps/ref_chosen': -62.63666534423828, 'logps/ref_rejected': -103.28182220458984, 'logits/chosen': 0.20027443766593933, 'logits/rejected': 0.22859197854995728, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.006972445175051689, 'epsilon_dpo/loss_margin_mean': 129.54954528808594, 'epsilon_dpo/beta_margin_mean': 0.8985069394111633, 'epsilon_dpo/beta_margin_std': 0.9505627751350403, 'epsilon_dpo/beta_margin_grad_mean': -0.3201913833618164, 'epsilon_dpo/beta_margin_grad_std': 0.1774386167526245, 'kl/beta': 0.0070155952125787735, 'kl/avg_steps': 0.625, 'epoch': 0.63} 63%|█████████████████████████████████████████████████ | 428/681 [31:49<11:37, 2.76s/it] 63%|█████████████████████████████████████████████████▏ | 429/681 [31:52<11:33, 2.75s/it] {'loss': 0.8753, 'grad_norm': 57.67830276489258, 'learning_rate': 1.828194884925749e-07, 'rewards/chosen': -1.4090235233306885, 'rewards/rejected': -2.2722792625427246, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.8632558584213257, 'logps/chosen': -284.10809326171875, 'logps/rejected': -419.96881103515625, 'logps/ref_chosen': -81.23401641845703, 'logps/ref_rejected': -91.79493713378906, 'logits/chosen': 0.008652932941913605, 'logits/rejected': 0.5683068037033081, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.006933495402336121, 'epsilon_dpo/loss_margin_mean': 125.29979705810547, 'epsilon_dpo/beta_margin_mean': 0.8632559180259705, 'epsilon_dpo/beta_margin_std': 0.9289405941963196, 'epsilon_dpo/beta_margin_grad_mean': -0.32588937878608704, 'epsilon_dpo/beta_margin_grad_std': 0.17860379815101624, 'kl/beta': 0.006972020026296377, 'kl/avg_steps': 0.5625, 'epoch': 0.63} 63%|█████████████████████████████████████████████████▏ | 429/681 [31:52<11:33, 2.75s/it] 63%|█████████████████████████████████████████████████▎ | 430/681 [31:55<12:04, 2.89s/it] {'loss': 0.8641, 'grad_norm': 52.97359848022461, 'learning_rate': 1.8158425248197928e-07, 'rewards/chosen': -1.3209152221679688, 'rewards/rejected': -2.146092414855957, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8251773118972778, 'logps/chosen': -252.5347442626953, 'logps/rejected': -416.38153076171875, 'logps/ref_chosen': -60.92032241821289, 'logps/ref_rejected': -104.42280578613281, 'logits/chosen': 0.1370442807674408, 'logits/rejected': 0.24206362664699554, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.006888212636113167, 'epsilon_dpo/loss_margin_mean': 120.34428405761719, 'epsilon_dpo/beta_margin_mean': 0.8251773118972778, 'epsilon_dpo/beta_margin_std': 0.8462843298912048, 'epsilon_dpo/beta_margin_grad_mean': -0.33034777641296387, 'epsilon_dpo/beta_margin_grad_std': 0.1568160504102707, 'kl/beta': 0.006933021824806929, 'kl/avg_steps': 0.65625, 'epoch': 0.63} 63%|█████████████████████████████████████████████████▎ | 430/681 [31:55<12:04, 2.89s/it] 63%|█████████████████████████████████████████████████▎ | 431/681 [31:58<12:09, 2.92s/it] {'loss': 0.7734, 'grad_norm': 39.05802536010742, 'learning_rate': 1.8035081928995788e-07, 'rewards/chosen': -1.3892830610275269, 'rewards/rejected': -2.426692008972168, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.0374089479446411, 'logps/chosen': -260.46075439453125, 'logps/rejected': -448.2281799316406, 'logps/ref_chosen': -57.348751068115234, 'logps/ref_rejected': -92.84022521972656, 'logits/chosen': 0.4554019570350647, 'logits/rejected': 0.5327665209770203, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.006834692787379026, 'epsilon_dpo/loss_margin_mean': 152.2759552001953, 'epsilon_dpo/beta_margin_mean': 1.0374089479446411, 'epsilon_dpo/beta_margin_std': 0.9661343693733215, 'epsilon_dpo/beta_margin_grad_mean': -0.29644519090652466, 'epsilon_dpo/beta_margin_grad_std': 0.16910652816295624, 'kl/beta': 0.006887820549309254, 'kl/avg_steps': 0.78125, 'epoch': 0.63} 63%|█████████████████████████████████████████████████▎ | 431/681 [31:58<12:09, 2.92s/it] 63%|█████████████████████████████████████████████████▍ | 432/681 [32:01<12:04, 2.91s/it] {'loss': 0.7803, 'grad_norm': 32.576263427734375, 'learning_rate': 1.791192214186223e-07, 'rewards/chosen': -1.2381863594055176, 'rewards/rejected': -2.2144782543182373, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.9762920141220093, 'logps/chosen': -253.63150024414062, 'logps/rejected': -425.50799560546875, 'logps/ref_chosen': -71.07479095458984, 'logps/ref_rejected': -98.57952880859375, 'logits/chosen': 0.02793058566749096, 'logits/rejected': 0.38804006576538086, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.006779574789106846, 'epsilon_dpo/loss_margin_mean': 144.37176513671875, 'epsilon_dpo/beta_margin_mean': 0.9762919545173645, 'epsilon_dpo/beta_margin_std': 0.9004828333854675, 'epsilon_dpo/beta_margin_grad_mean': -0.30400457978248596, 'epsilon_dpo/beta_margin_grad_std': 0.15248362720012665, 'kl/beta': 0.006834426429122686, 'kl/avg_steps': 0.8125, 'epoch': 0.63} 63%|█████████████████████████████████████████████████▍ | 432/681 [32:01<12:04, 2.91s/it] 64%|█████████████████████████████████████████████████▌ | 433/681 [32:03<11:34, 2.80s/it] {'loss': 0.9669, 'grad_norm': 44.58323287963867, 'learning_rate': 1.7788949132172193e-07, 'rewards/chosen': -1.487854242324829, 'rewards/rejected': -2.19893217086792, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.7110780477523804, 'logps/chosen': -278.4117431640625, 'logps/rejected': -422.3763122558594, 'logps/ref_chosen': -58.273193359375, 'logps/ref_rejected': -95.95089721679688, 'logits/chosen': 0.4049626588821411, 'logits/rejected': 0.5291376113891602, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.006746122147887945, 'epsilon_dpo/loss_margin_mean': 106.28688049316406, 'epsilon_dpo/beta_margin_mean': 0.7110780477523804, 'epsilon_dpo/beta_margin_std': 0.8855108618736267, 'epsilon_dpo/beta_margin_grad_mean': -0.35162287950515747, 'epsilon_dpo/beta_margin_grad_std': 0.183163121342659, 'kl/beta': 0.0067793442867696285, 'kl/avg_steps': 0.5, 'epoch': 0.64} 64%|█████████████████████████████████████████████████▌ | 433/681 [32:03<11:34, 2.80s/it] 64%|█████████████████████████████████████████████████▋ | 434/681 [32:06<11:24, 2.77s/it] {'loss': 0.8566, 'grad_norm': 35.67780303955078, 'learning_rate': 1.7666166140378853e-07, 'rewards/chosen': -1.28713059425354, 'rewards/rejected': -2.1034226417541504, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.8162918090820312, 'logps/chosen': -254.0223388671875, 'logps/rejected': -392.8670349121094, 'logps/ref_chosen': -61.97370147705078, 'logps/ref_rejected': -78.49861145019531, 'logits/chosen': 0.21431401371955872, 'logits/rejected': 0.6656895875930786, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0066956933587789536, 'epsilon_dpo/loss_margin_mean': 122.31977844238281, 'epsilon_dpo/beta_margin_mean': 0.816291868686676, 'epsilon_dpo/beta_margin_std': 0.822716474533081, 'epsilon_dpo/beta_margin_grad_mean': -0.3319990634918213, 'epsilon_dpo/beta_margin_grad_std': 0.14507560431957245, 'kl/beta': 0.006745615974068642, 'kl/avg_steps': 0.75, 'epoch': 0.64} 64%|█████████████████████████████████████████████████▋ | 434/681 [32:06<11:24, 2.77s/it] 64%|█████████████████████████████████████████████████▊ | 435/681 [32:08<10:49, 2.64s/it] {'loss': 0.8221, 'grad_norm': 33.875038146972656, 'learning_rate': 1.7543576401928218e-07, 'rewards/chosen': -1.1853218078613281, 'rewards/rejected': -2.125612735748291, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.9402910470962524, 'logps/chosen': -229.49069213867188, 'logps/rejected': -407.580322265625, 'logps/ref_chosen': -51.502052307128906, 'logps/ref_rejected': -87.56689453125, 'logits/chosen': 0.3405795097351074, 'logits/rejected': 0.5714382529258728, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.006650034803897142, 'epsilon_dpo/loss_margin_mean': 142.02479553222656, 'epsilon_dpo/beta_margin_mean': 0.9402909874916077, 'epsilon_dpo/beta_margin_std': 0.9170975685119629, 'epsilon_dpo/beta_margin_grad_mean': -0.3107792139053345, 'epsilon_dpo/beta_margin_grad_std': 0.1730233132839203, 'kl/beta': 0.006695400923490524, 'kl/avg_steps': 0.6875, 'epoch': 0.64} 64%|█████████████████████████████████████████████████▊ | 435/681 [32:08<10:49, 2.64s/it] 64%|█████████████████████████████████████████████████▉ | 436/681 [32:11<10:55, 2.67s/it] {'loss': 0.7874, 'grad_norm': 34.9669189453125, 'learning_rate': 1.742118314717391e-07, 'rewards/chosen': -1.181276798248291, 'rewards/rejected': -2.11148738861084, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.9302107095718384, 'logps/chosen': -250.2595672607422, 'logps/rejected': -402.91571044921875, 'logps/ref_chosen': -71.40371704101562, 'logps/ref_rejected': -82.72775268554688, 'logits/chosen': 0.09514103829860687, 'logits/rejected': 0.6070296168327332, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.006600471679121256, 'epsilon_dpo/loss_margin_mean': 141.33209228515625, 'epsilon_dpo/beta_margin_mean': 0.9302107095718384, 'epsilon_dpo/beta_margin_std': 0.8247028589248657, 'epsilon_dpo/beta_margin_grad_mean': -0.30987975001335144, 'epsilon_dpo/beta_margin_grad_std': 0.1436336487531662, 'kl/beta': 0.0066496841609478, 'kl/avg_steps': 0.75, 'epoch': 0.64} 64%|█████████████████████████████████████████████████▉ | 436/681 [32:11<10:55, 2.67s/it] 64%|██████████████████████████████████████████████████ | 437/681 [32:14<11:07, 2.73s/it] {'loss': 0.8502, 'grad_norm': 41.57432174682617, 'learning_rate': 1.7298989601292036e-07, 'rewards/chosen': -1.1652759313583374, 'rewards/rejected': -2.0105206966400146, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8452447652816772, 'logps/chosen': -242.2954864501953, 'logps/rejected': -389.12548828125, 'logps/ref_chosen': -64.7442626953125, 'logps/ref_rejected': -82.04356384277344, 'logits/chosen': 0.26234185695648193, 'logits/rejected': 0.6462544798851013, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.006555462256073952, 'epsilon_dpo/loss_margin_mean': 129.53070068359375, 'epsilon_dpo/beta_margin_mean': 0.8452447652816772, 'epsilon_dpo/beta_margin_std': 0.8253268599510193, 'epsilon_dpo/beta_margin_grad_mean': -0.32474982738494873, 'epsilon_dpo/beta_margin_grad_std': 0.15997685492038727, 'kl/beta': 0.006600182969123125, 'kl/avg_steps': 0.6875, 'epoch': 0.64} 64%|██████████████████████████████████████████████████ | 437/681 [32:14<11:07, 2.73s/it] 64%|██████████████████████████████████████████████████▏ | 438/681 [32:16<10:48, 2.67s/it] {'loss': 0.7415, 'grad_norm': 32.169349670410156, 'learning_rate': 1.7176998984196144e-07, 'rewards/chosen': -1.136362075805664, 'rewards/rejected': -2.136415958404541, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.0000540018081665, 'logps/chosen': -233.89158630371094, 'logps/rejected': -412.0855712890625, 'logps/ref_chosen': -59.0186653137207, 'logps/ref_rejected': -83.07682037353516, 'logits/chosen': 0.2299380898475647, 'logits/rejected': 0.6087510585784912, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.006496360059827566, 'epsilon_dpo/loss_margin_mean': 154.13580322265625, 'epsilon_dpo/beta_margin_mean': 1.0000540018081665, 'epsilon_dpo/beta_margin_std': 0.804446816444397, 'epsilon_dpo/beta_margin_grad_mean': -0.2956306040287018, 'epsilon_dpo/beta_margin_grad_std': 0.1379927545785904, 'kl/beta': 0.006555116269737482, 'kl/avg_steps': 0.90625, 'epoch': 0.64} 64%|██████████████████████████████████████████████████▏ | 438/681 [32:16<10:48, 2.67s/it] 64%|██████████████████████████████████████████████████▎ | 439/681 [32:19<10:36, 2.63s/it] {'loss': 0.8931, 'grad_norm': 40.5095100402832, 'learning_rate': 1.7055214510452458e-07, 'rewards/chosen': -1.1882553100585938, 'rewards/rejected': -1.961783528327942, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7735282182693481, 'logps/chosen': -237.7843475341797, 'logps/rejected': -388.39080810546875, 'logps/ref_chosen': -53.784080505371094, 'logps/ref_rejected': -83.98545837402344, 'logits/chosen': 0.3964412212371826, 'logits/rejected': 0.430361270904541, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.00645222794264555, 'epsilon_dpo/loss_margin_mean': 120.40505981445312, 'epsilon_dpo/beta_margin_mean': 0.7735282182693481, 'epsilon_dpo/beta_margin_std': 0.8291470408439636, 'epsilon_dpo/beta_margin_grad_mean': -0.34073659777641296, 'epsilon_dpo/beta_margin_grad_std': 0.15491709113121033, 'kl/beta': 0.0064962441101670265, 'kl/avg_steps': 0.6875, 'epoch': 0.64} 64%|██████████████████████████████████████████████████▎ | 439/681 [32:19<10:36, 2.63s/it] 65%|██████████████████████████████████████████████████▍ | 440/681 [32:21<10:22, 2.58s/it] {'loss': 0.9764, 'grad_norm': 48.05801010131836, 'learning_rate': 1.6933639389195134e-07, 'rewards/chosen': -1.2052370309829712, 'rewards/rejected': -1.878598928451538, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.6733620166778564, 'logps/chosen': -265.9714050292969, 'logps/rejected': -389.5462341308594, 'logps/ref_chosen': -78.56671905517578, 'logps/ref_rejected': -96.49775695800781, 'logits/chosen': 0.07137221843004227, 'logits/rejected': 0.40858763456344604, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.006414221134036779, 'epsilon_dpo/loss_margin_mean': 105.64379119873047, 'epsilon_dpo/beta_margin_mean': 0.6733620166778564, 'epsilon_dpo/beta_margin_std': 0.8658587336540222, 'epsilon_dpo/beta_margin_grad_mean': -0.36195108294487, 'epsilon_dpo/beta_margin_grad_std': 0.16643108427524567, 'kl/beta': 0.006451887544244528, 'kl/avg_steps': 0.59375, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▍ | 440/681 [32:21<10:22, 2.58s/it] 65%|██████████████████████████████████████████████████▌ | 441/681 [32:24<10:43, 2.68s/it] {'loss': 0.8885, 'grad_norm': 43.710899353027344, 'learning_rate': 1.681227682404166e-07, 'rewards/chosen': -1.2992668151855469, 'rewards/rejected': -2.1324520111083984, 'rewards/accuracies': 0.875, 'rewards/margins': 0.833185076713562, 'logps/chosen': -264.5113830566406, 'logps/rejected': -431.5632629394531, 'logps/ref_chosen': -60.824440002441406, 'logps/ref_rejected': -96.47080993652344, 'logits/chosen': 0.35447198152542114, 'logits/rejected': 0.7814577221870422, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.006367330439388752, 'epsilon_dpo/loss_margin_mean': 131.405517578125, 'epsilon_dpo/beta_margin_mean': 0.833185076713562, 'epsilon_dpo/beta_margin_std': 0.9349945187568665, 'epsilon_dpo/beta_margin_grad_mean': -0.33204370737075806, 'epsilon_dpo/beta_margin_grad_std': 0.16942457854747772, 'kl/beta': 0.006413805298507214, 'kl/avg_steps': 0.734375, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▌ | 441/681 [32:24<10:43, 2.68s/it] 65%|██████████████████████████████████████████████████▋ | 442/681 [32:27<10:54, 2.74s/it] {'loss': 0.7663, 'grad_norm': 36.07387161254883, 'learning_rate': 1.669113001300851e-07, 'rewards/chosen': -1.111487865447998, 'rewards/rejected': -2.1179301738739014, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.0064421892166138, 'logps/chosen': -222.94937133789062, 'logps/rejected': -412.21435546875, 'logps/ref_chosen': -47.01121520996094, 'logps/ref_rejected': -76.53926086425781, 'logits/chosen': 0.4353085160255432, 'logits/rejected': 0.7528847455978394, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.006313956808298826, 'epsilon_dpo/loss_margin_mean': 159.73695373535156, 'epsilon_dpo/beta_margin_mean': 1.0064421892166138, 'epsilon_dpo/beta_margin_std': 0.8888764977455139, 'epsilon_dpo/beta_margin_grad_mean': -0.2994399964809418, 'epsilon_dpo/beta_margin_grad_std': 0.1570899337530136, 'kl/beta': 0.006367047317326069, 'kl/avg_steps': 0.84375, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▋ | 442/681 [32:27<10:54, 2.74s/it] 65%|██████████████████████████████████████████████████▋ | 443/681 [32:30<10:54, 2.75s/it] {'loss': 0.9717, 'grad_norm': 50.00312805175781, 'learning_rate': 1.6570202148426815e-07, 'rewards/chosen': -1.3025455474853516, 'rewards/rejected': -2.0130648612976074, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.7105194330215454, 'logps/chosen': -278.45794677734375, 'logps/rejected': -407.8765869140625, 'logps/ref_chosen': -71.27301788330078, 'logps/ref_rejected': -86.679931640625, 'logits/chosen': 0.2745404839515686, 'logits/rejected': 0.6156367063522339, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.00627494091168046, 'epsilon_dpo/loss_margin_mean': 114.0117416381836, 'epsilon_dpo/beta_margin_mean': 0.7105193734169006, 'epsilon_dpo/beta_margin_std': 0.9240910410881042, 'epsilon_dpo/beta_margin_grad_mean': -0.354687362909317, 'epsilon_dpo/beta_margin_grad_std': 0.17831788957118988, 'kl/beta': 0.006313774734735489, 'kl/avg_steps': 0.625, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▋ | 443/681 [32:30<10:54, 2.75s/it] 65%|██████████████████████████████████████████████████▊ | 444/681 [32:33<10:40, 2.70s/it] {'loss': 0.8153, 'grad_norm': 41.11380386352539, 'learning_rate': 1.6449496416858282e-07, 'rewards/chosen': -1.282365322113037, 'rewards/rejected': -2.2575414180755615, 'rewards/accuracies': 0.875, 'rewards/margins': 0.9751760959625244, 'logps/chosen': -262.89581298828125, 'logps/rejected': -460.0967712402344, 'logps/ref_chosen': -57.213706970214844, 'logps/ref_rejected': -97.25489044189453, 'logits/chosen': 0.35377830266952515, 'logits/rejected': 0.5459920167922974, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0062300837598741055, 'epsilon_dpo/loss_margin_mean': 157.15977478027344, 'epsilon_dpo/beta_margin_mean': 0.9751760959625244, 'epsilon_dpo/beta_margin_std': 0.9998669028282166, 'epsilon_dpo/beta_margin_grad_mean': -0.30968645215034485, 'epsilon_dpo/beta_margin_grad_std': 0.16983701288700104, 'kl/beta': 0.006274559069424868, 'kl/avg_steps': 0.71875, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▊ | 444/681 [32:33<10:40, 2.70s/it] 65%|██████████████████████████████████████████████████▉ | 445/681 [32:35<10:38, 2.70s/it] {'loss': 0.7993, 'grad_norm': 45.95329666137695, 'learning_rate': 1.6329015999011182e-07, 'rewards/chosen': -1.2117838859558105, 'rewards/rejected': -2.1650891304016113, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.9533052444458008, 'logps/chosen': -262.902587890625, 'logps/rejected': -442.9150390625, 'logps/ref_chosen': -67.29979705810547, 'logps/ref_rejected': -92.68267822265625, 'logits/chosen': 0.3452759087085724, 'logits/rejected': 0.6788507699966431, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.006189518608152866, 'epsilon_dpo/loss_margin_mean': 154.62954711914062, 'epsilon_dpo/beta_margin_mean': 0.9533052444458008, 'epsilon_dpo/beta_margin_std': 0.9005807042121887, 'epsilon_dpo/beta_margin_grad_mean': -0.31064438819885254, 'epsilon_dpo/beta_margin_grad_std': 0.1580774337053299, 'kl/beta': 0.006229782477021217, 'kl/avg_steps': 0.65625, 'epoch': 0.65} 65%|██████████████████████████████████████████████████▉ | 445/681 [32:35<10:38, 2.70s/it] 65%|███████████████████████████████████████████████████ | 446/681 [32:38<10:38, 2.72s/it] {'loss': 0.7814, 'grad_norm': 55.357357025146484, 'learning_rate': 1.6208764069656578e-07, 'rewards/chosen': -1.2111321687698364, 'rewards/rejected': -2.144577980041504, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.9334458112716675, 'logps/chosen': -256.1627197265625, 'logps/rejected': -450.7152404785156, 'logps/ref_chosen': -59.098487854003906, 'logps/ref_rejected': -101.26419067382812, 'logits/chosen': 0.3471897542476654, 'logits/rejected': 0.3918067216873169, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.006143361795693636, 'epsilon_dpo/loss_margin_mean': 152.38681030273438, 'epsilon_dpo/beta_margin_mean': 0.9334458112716675, 'epsilon_dpo/beta_margin_std': 0.7915002107620239, 'epsilon_dpo/beta_margin_grad_mean': -0.30731505155563354, 'epsilon_dpo/beta_margin_grad_std': 0.14560139179229736, 'kl/beta': 0.006189166102558374, 'kl/avg_steps': 0.75, 'epoch': 0.65} 65%|███████████████████████████████████████████████████ | 446/681 [32:38<10:38, 2.72s/it] 66%|███████████████████████████████████████████████████▏ | 447/681 [32:40<10:14, 2.63s/it] {'loss': 0.7784, 'grad_norm': 48.96641540527344, 'learning_rate': 1.608874379754465e-07, 'rewards/chosen': -1.2925333976745605, 'rewards/rejected': -2.267184257507324, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.9746509790420532, 'logps/chosen': -268.1436767578125, 'logps/rejected': -471.0999450683594, 'logps/ref_chosen': -56.07533264160156, 'logps/ref_rejected': -98.69475555419922, 'logits/chosen': 0.4716407060623169, 'logits/rejected': 0.4547463059425354, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.006093789357692003, 'epsilon_dpo/loss_margin_mean': 160.3368377685547, 'epsilon_dpo/beta_margin_mean': 0.9746509194374084, 'epsilon_dpo/beta_margin_std': 0.9327902793884277, 'epsilon_dpo/beta_margin_grad_mean': -0.3054165840148926, 'epsilon_dpo/beta_margin_grad_std': 0.1465393453836441, 'kl/beta': 0.006143092643469572, 'kl/avg_steps': 0.8125, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▏ | 447/681 [32:41<10:14, 2.63s/it] 66%|███████████████████████████████████████████████████▎ | 448/681 [32:43<10:15, 2.64s/it] {'loss': 0.7752, 'grad_norm': 49.381710052490234, 'learning_rate': 1.5968958345321177e-07, 'rewards/chosen': -1.4422485828399658, 'rewards/rejected': -2.452152729034424, 'rewards/accuracies': 0.875, 'rewards/margins': 1.009904146194458, 'logps/chosen': -298.080322265625, 'logps/rejected': -507.822021484375, 'logps/ref_chosen': -60.00384521484375, 'logps/ref_rejected': -102.26465606689453, 'logits/chosen': 0.47561150789260864, 'logits/rejected': 0.6322281360626221, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.006052294280380011, 'epsilon_dpo/loss_margin_mean': 167.4809112548828, 'epsilon_dpo/beta_margin_mean': 1.009904146194458, 'epsilon_dpo/beta_margin_std': 0.9370668530464172, 'epsilon_dpo/beta_margin_grad_mean': -0.30193132162094116, 'epsilon_dpo/beta_margin_grad_std': 0.15906290709972382, 'kl/beta': 0.006093582604080439, 'kl/avg_steps': 0.6875, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▎ | 448/681 [32:43<10:15, 2.64s/it] 66%|███████████████████████████████████████████████████▍ | 449/681 [32:46<10:10, 2.63s/it] {'loss': 0.8983, 'grad_norm': 60.586238861083984, 'learning_rate': 1.584941086944423e-07, 'rewards/chosen': -1.452686071395874, 'rewards/rejected': -2.3414292335510254, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8887430429458618, 'logps/chosen': -308.541259765625, 'logps/rejected': -478.1945495605469, 'logps/ref_chosen': -67.52661895751953, 'logps/ref_rejected': -88.59690856933594, 'logits/chosen': 0.2962471544742584, 'logits/rejected': 0.6988736391067505, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.006014752201735973, 'epsilon_dpo/loss_margin_mean': 148.5830078125, 'epsilon_dpo/beta_margin_mean': 0.888742983341217, 'epsilon_dpo/beta_margin_std': 1.0735763311386108, 'epsilon_dpo/beta_margin_grad_mean': -0.3279159367084503, 'epsilon_dpo/beta_margin_grad_std': 0.18189726769924164, 'kl/beta': 0.006051975302398205, 'kl/avg_steps': 0.625, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▍ | 449/681 [32:46<10:10, 2.63s/it] 66%|███████████████████████████████████████████████████▌ | 450/681 [32:48<10:06, 2.63s/it] {'loss': 0.7534, 'grad_norm': 50.2854118347168, 'learning_rate': 1.573010452010098e-07, 'rewards/chosen': -1.2739126682281494, 'rewards/rejected': -2.295565605163574, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.0216530561447144, 'logps/chosen': -270.3196105957031, 'logps/rejected': -487.6568603515625, 'logps/ref_chosen': -57.108116149902344, 'logps/ref_rejected': -102.75494384765625, 'logits/chosen': 0.4698118567466736, 'logits/rejected': 0.6411651372909546, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.005967994686216116, 'epsilon_dpo/loss_margin_mean': 171.69039916992188, 'epsilon_dpo/beta_margin_mean': 1.0216530561447144, 'epsilon_dpo/beta_margin_std': 0.8822335600852966, 'epsilon_dpo/beta_margin_grad_mean': -0.29214900732040405, 'epsilon_dpo/beta_margin_grad_std': 0.1508917659521103, 'kl/beta': 0.006014385260641575, 'kl/avg_steps': 0.78125, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▌ | 450/681 [32:48<10:06, 2.63s/it] 66%|███████████████████████████████████████████████████▋ | 451/681 [32:51<09:45, 2.55s/it] {'loss': 0.9089, 'grad_norm': 50.47534942626953, 'learning_rate': 1.5611042441124687e-07, 'rewards/chosen': -1.556694507598877, 'rewards/rejected': -2.3582100868225098, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8015156388282776, 'logps/chosen': -320.57232666015625, 'logps/rejected': -470.96759033203125, 'logps/ref_chosen': -58.46883010864258, 'logps/ref_rejected': -72.92941284179688, 'logits/chosen': 0.7546664476394653, 'logits/rejected': 1.2638273239135742, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.005929191131144762, 'epsilon_dpo/loss_margin_mean': 135.93467712402344, 'epsilon_dpo/beta_margin_mean': 0.8015156388282776, 'epsilon_dpo/beta_margin_std': 0.9112508893013, 'epsilon_dpo/beta_margin_grad_mean': -0.3303794860839844, 'epsilon_dpo/beta_margin_grad_std': 0.1712116152048111, 'kl/beta': 0.005967761855572462, 'kl/avg_steps': 0.65625, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▋ | 451/681 [32:51<09:45, 2.55s/it] 66%|███████████████████████████████████████████████████▊ | 452/681 [32:53<09:51, 2.58s/it] {'loss': 0.8214, 'grad_norm': 34.57151794433594, 'learning_rate': 1.549222776991186e-07, 'rewards/chosen': -1.2716758251190186, 'rewards/rejected': -2.1425392627716064, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.8708633780479431, 'logps/chosen': -266.37939453125, 'logps/rejected': -462.17816162109375, 'logps/ref_chosen': -50.39055252075195, 'logps/ref_rejected': -97.77143096923828, 'logits/chosen': 0.7242465019226074, 'logits/rejected': 0.6068210601806641, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.00588312279433012, 'epsilon_dpo/loss_margin_mean': 148.41787719726562, 'epsilon_dpo/beta_margin_mean': 0.8708633780479431, 'epsilon_dpo/beta_margin_std': 0.8176453113555908, 'epsilon_dpo/beta_margin_grad_mean': -0.32057738304138184, 'epsilon_dpo/beta_margin_grad_std': 0.14430400729179382, 'kl/beta': 0.005928853992372751, 'kl/avg_steps': 0.78125, 'epoch': 0.66} 66%|███████████████████████████████████████████████████▊ | 452/681 [32:53<09:51, 2.58s/it] 67%|███████████████████████████████████████████████████▉ | 453/681 [32:56<09:46, 2.57s/it] {'loss': 0.8671, 'grad_norm': 33.32931137084961, 'learning_rate': 1.5373663637339584e-07, 'rewards/chosen': -1.368526816368103, 'rewards/rejected': -2.193310260772705, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8247835636138916, 'logps/chosen': -291.4721984863281, 'logps/rejected': -457.70806884765625, 'logps/ref_chosen': -57.71485137939453, 'logps/ref_rejected': -82.20741271972656, 'logits/chosen': 0.35818058252334595, 'logits/rejected': 0.8482306003570557, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0058448719792068005, 'epsilon_dpo/loss_margin_mean': 141.74331665039062, 'epsilon_dpo/beta_margin_mean': 0.8247835636138916, 'epsilon_dpo/beta_margin_std': 0.8746103644371033, 'epsilon_dpo/beta_margin_grad_mean': -0.3328855633735657, 'epsilon_dpo/beta_margin_grad_std': 0.15490888059139252, 'kl/beta': 0.005882893688976765, 'kl/avg_steps': 0.65625, 'epoch': 0.67} 67%|███████████████████████████████████████████████████▉ | 453/681 [32:56<09:46, 2.57s/it] 67%|████████████████████████████████████████████████████ | 454/681 [32:59<09:54, 2.62s/it] {'loss': 0.8274, 'grad_norm': 43.49067306518555, 'learning_rate': 1.5255353167683017e-07, 'rewards/chosen': -1.4860866069793701, 'rewards/rejected': -2.4427573680877686, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.9566707611083984, 'logps/chosen': -316.734375, 'logps/rejected': -506.279052734375, 'logps/ref_chosen': -60.945648193359375, 'logps/ref_rejected': -84.9507827758789, 'logits/chosen': 0.5292907357215881, 'logits/rejected': 1.0574098825454712, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0058012851513922215, 'epsilon_dpo/loss_margin_mean': 165.53953552246094, 'epsilon_dpo/beta_margin_mean': 0.9566707611083984, 'epsilon_dpo/beta_margin_std': 1.0051079988479614, 'epsilon_dpo/beta_margin_grad_mean': -0.3133637309074402, 'epsilon_dpo/beta_margin_grad_std': 0.1705373227596283, 'kl/beta': 0.005844539031386375, 'kl/avg_steps': 0.75, 'epoch': 0.67} 67%|████████████████████████████████████████████████████ | 454/681 [32:59<09:54, 2.62s/it] 67%|████████████████████████████████████████████████████ | 455/681 [33:01<09:42, 2.58s/it] {'loss': 0.7524, 'grad_norm': 34.411067962646484, 'learning_rate': 1.5137299478533064e-07, 'rewards/chosen': -1.282523512840271, 'rewards/rejected': -2.3502144813537598, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.0676908493041992, 'logps/chosen': -267.48406982421875, 'logps/rejected': -523.9266357421875, 'logps/ref_chosen': -44.88671112060547, 'logps/ref_rejected': -115.30147552490234, 'logits/chosen': 0.43719807267189026, 'logits/rejected': 0.41231751441955566, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.005758099257946014, 'epsilon_dpo/loss_margin_mean': 186.02780151367188, 'epsilon_dpo/beta_margin_mean': 1.0676908493041992, 'epsilon_dpo/beta_margin_std': 0.952242374420166, 'epsilon_dpo/beta_margin_grad_mean': -0.2917003035545349, 'epsilon_dpo/beta_margin_grad_std': 0.16674098372459412, 'kl/beta': 0.005801031365990639, 'kl/avg_steps': 0.75, 'epoch': 0.67} 67%|████████████████████████████████████████████████████ | 455/681 [33:01<09:42, 2.58s/it] 67%|████████████████████████████████████████████████████▏ | 456/681 [33:04<09:51, 2.63s/it] {'loss': 0.71, 'grad_norm': 32.09308624267578, 'learning_rate': 1.5019505680714232e-07, 'rewards/chosen': -1.247272253036499, 'rewards/rejected': -2.313566207885742, 'rewards/accuracies': 0.984375, 'rewards/margins': 1.066293716430664, 'logps/chosen': -275.6275939941406, 'logps/rejected': -510.848876953125, 'logps/ref_chosen': -57.036781311035156, 'logps/ref_rejected': -105.21783447265625, 'logits/chosen': 0.39520812034606934, 'logits/rejected': 0.4162057042121887, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'epsilon_dpo/beta': 0.0057062371633946896, 'epsilon_dpo/loss_margin_mean': 187.0402374267578, 'epsilon_dpo/beta_margin_mean': 1.066293716430664, 'epsilon_dpo/beta_margin_std': 0.8393720984458923, 'epsilon_dpo/beta_margin_grad_mean': -0.2859190106391907, 'epsilon_dpo/beta_margin_grad_std': 0.13566403090953827, 'kl/beta': 0.005757847335189581, 'kl/avg_steps': 0.90625, 'epoch': 0.67} 67%|████████████████████████████████████████████████████▏ | 456/681 [33:04<09:51, 2.63s/it] 67%|████████████████████████████████████████████████████▎ | 457/681 [33:07<09:56, 2.66s/it] {'loss': 0.8047, 'grad_norm': 44.855472564697266, 'learning_rate': 1.4901974878202627e-07, 'rewards/chosen': -1.2122774124145508, 'rewards/rejected': -2.0989460945129395, 'rewards/accuracies': 0.875, 'rewards/margins': 0.8866687417030334, 'logps/chosen': -268.1856689453125, 'logps/rejected': -456.13555908203125, 'logps/ref_chosen': -54.24253845214844, 'logps/ref_rejected': -85.10956573486328, 'logits/chosen': 0.3958442211151123, 'logits/rejected': 0.794312596321106, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0056639062240719795, 'epsilon_dpo/loss_margin_mean': 157.08285522460938, 'epsilon_dpo/beta_margin_mean': 0.8866687417030334, 'epsilon_dpo/beta_margin_std': 0.7621463537216187, 'epsilon_dpo/beta_margin_grad_mean': -0.312649130821228, 'epsilon_dpo/beta_margin_grad_std': 0.14824675023555756, 'kl/beta': 0.0057061356492340565, 'kl/avg_steps': 0.75, 'epoch': 0.67} 67%|████████████████████████████████████████████████████▎ | 457/681 [33:07<09:56, 2.66s/it] 67%|████████████████████████████████████████████████████▍ | 458/681 [33:09<09:42, 2.61s/it] {'loss': 0.7622, 'grad_norm': 38.89377212524414, 'learning_rate': 1.4784710168044212e-07, 'rewards/chosen': -1.2583105564117432, 'rewards/rejected': -2.28153133392334, 'rewards/accuracies': 0.84375, 'rewards/margins': 1.0232208967208862, 'logps/chosen': -278.69854736328125, 'logps/rejected': -503.6078796386719, 'logps/ref_chosen': -55.40888214111328, 'logps/ref_rejected': -97.68325805664062, 'logits/chosen': 0.4888191819190979, 'logits/rejected': 0.6527252793312073, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.005625282879918814, 'epsilon_dpo/loss_margin_mean': 182.63494873046875, 'epsilon_dpo/beta_margin_mean': 1.0232208967208862, 'epsilon_dpo/beta_margin_std': 0.9355968832969666, 'epsilon_dpo/beta_margin_grad_mean': -0.29639893770217896, 'epsilon_dpo/beta_margin_grad_std': 0.15821236371994019, 'kl/beta': 0.0056636580266058445, 'kl/avg_steps': 0.6875, 'epoch': 0.67} 67%|████████████████████████████████████████████████████▍ | 458/681 [33:09<09:42, 2.61s/it] 67%|████████████████████████████████████████████████████▌ | 459/681 [33:12<09:46, 2.64s/it] {'loss': 0.8372, 'grad_norm': 44.090911865234375, 'learning_rate': 1.466771464027316e-07, 'rewards/chosen': -1.3366007804870605, 'rewards/rejected': -2.2146730422973633, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.8780723810195923, 'logps/chosen': -285.59722900390625, 'logps/rejected': -483.0376281738281, 'logps/ref_chosen': -46.55748748779297, 'logps/ref_rejected': -86.16854095458984, 'logits/chosen': 0.7341614961624146, 'logits/rejected': 0.9279061555862427, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.005585114937275648, 'epsilon_dpo/loss_margin_mean': 157.82933044433594, 'epsilon_dpo/beta_margin_mean': 0.8780723810195923, 'epsilon_dpo/beta_margin_std': 0.8582127690315247, 'epsilon_dpo/beta_margin_grad_mean': -0.3197575509548187, 'epsilon_dpo/beta_margin_grad_std': 0.15985752642154694, 'kl/beta': 0.005624986253678799, 'kl/avg_steps': 0.71875, 'epoch': 0.67} 67%|████████████████████████████████████████████████████▌ | 459/681 [33:12<09:46, 2.64s/it] 68%|████████████████████████████████████████████████████▋ | 460/681 [33:15<09:53, 2.69s/it] {'loss': 0.8001, 'grad_norm': 38.01408386230469, 'learning_rate': 1.4550991377830423e-07, 'rewards/chosen': -1.403839349746704, 'rewards/rejected': -2.3716347217559814, 'rewards/accuracies': 0.875, 'rewards/margins': 0.9677952527999878, 'logps/chosen': -304.29058837890625, 'logps/rejected': -532.0263061523438, 'logps/ref_chosen': -51.63489532470703, 'logps/ref_rejected': -104.11935424804688, 'logits/chosen': 0.7233595848083496, 'logits/rejected': 0.6271342039108276, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.005545258987694979, 'epsilon_dpo/loss_margin_mean': 175.251220703125, 'epsilon_dpo/beta_margin_mean': 0.9677952527999878, 'epsilon_dpo/beta_margin_std': 0.9130998253822327, 'epsilon_dpo/beta_margin_grad_mean': -0.30482274293899536, 'epsilon_dpo/beta_margin_grad_std': 0.1670825034379959, 'kl/beta': 0.005584845319390297, 'kl/avg_steps': 0.71875, 'epoch': 0.68} 68%|████████████████████████████████████████████████████▋ | 460/681 [33:15<09:53, 2.69s/it] 68%|████████████████████████████████████████████████████▊ | 461/681 [33:17<09:57, 2.72s/it] {'loss': 0.9471, 'grad_norm': 38.21791076660156, 'learning_rate': 1.4434543456482518e-07, 'rewards/chosen': -1.4901607036590576, 'rewards/rejected': -2.2050065994262695, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.7148459553718567, 'logps/chosen': -325.3001708984375, 'logps/rejected': -487.05438232421875, 'logps/ref_chosen': -55.18195343017578, 'logps/ref_rejected': -86.47689819335938, 'logits/chosen': 0.7113257646560669, 'logits/rejected': 0.8079010248184204, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.005510885734111071, 'epsilon_dpo/loss_margin_mean': 130.45925903320312, 'epsilon_dpo/beta_margin_mean': 0.7148459553718567, 'epsilon_dpo/beta_margin_std': 0.8677749037742615, 'epsilon_dpo/beta_margin_grad_mean': -0.3535759449005127, 'epsilon_dpo/beta_margin_grad_std': 0.16537989675998688, 'kl/beta': 0.0055449907667934895, 'kl/avg_steps': 0.625, 'epoch': 0.68} 68%|████████████████████████████████████████████████████▊ | 461/681 [33:17<09:57, 2.72s/it] 68%|████████████████████████████████████████████████████▉ | 462/681 [33:20<09:48, 2.69s/it] {'loss': 0.9378, 'grad_norm': 42.94804763793945, 'learning_rate': 1.4318373944740484e-07, 'rewards/chosen': -1.4143660068511963, 'rewards/rejected': -2.0833849906921387, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.6690188646316528, 'logps/chosen': -327.9981689453125, 'logps/rejected': -459.68743896484375, 'logps/ref_chosen': -69.92803955078125, 'logps/ref_rejected': -78.84111785888672, 'logits/chosen': 0.5266987681388855, 'logits/rejected': 0.8050484657287598, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.00547578651458025, 'epsilon_dpo/loss_margin_mean': 122.77618408203125, 'epsilon_dpo/beta_margin_mean': 0.6690189242362976, 'epsilon_dpo/beta_margin_std': 0.7433232069015503, 'epsilon_dpo/beta_margin_grad_mean': -0.35789817571640015, 'epsilon_dpo/beta_margin_grad_std': 0.14245951175689697, 'kl/beta': 0.0055105495266616344, 'kl/avg_steps': 0.640625, 'epoch': 0.68} 68%|████████████████████████████████████████████████████▉ | 462/681 [33:20<09:48, 2.69s/it] 68%|█████████████████████████████████████████████████████ | 463/681 [33:23<09:44, 2.68s/it] {'loss': 0.9069, 'grad_norm': 50.6060905456543, 'learning_rate': 1.4202485903778976e-07, 'rewards/chosen': -1.4034281969070435, 'rewards/rejected': -2.192054271697998, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.788625955581665, 'logps/chosen': -312.5292663574219, 'logps/rejected': -492.0359191894531, 'logps/ref_chosen': -55.27437210083008, 'logps/ref_rejected': -89.02497863769531, 'logits/chosen': 0.5106614828109741, 'logits/rejected': 0.7114510536193848, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.00544350640848279, 'epsilon_dpo/loss_margin_mean': 145.75604248046875, 'epsilon_dpo/beta_margin_mean': 0.788625955581665, 'epsilon_dpo/beta_margin_std': 0.8829845786094666, 'epsilon_dpo/beta_margin_grad_mean': -0.33866074681282043, 'epsilon_dpo/beta_margin_grad_std': 0.17297625541687012, 'kl/beta': 0.005475472658872604, 'kl/avg_steps': 0.59375, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████ | 463/681 [33:23<09:44, 2.68s/it] 68%|█████████████████████████████████████████████████████▏ | 464/681 [33:25<09:32, 2.64s/it] {'loss': 0.7761, 'grad_norm': 41.503604888916016, 'learning_rate': 1.4086882387355658e-07, 'rewards/chosen': -1.3622095584869385, 'rewards/rejected': -2.404740571975708, 'rewards/accuracies': 0.875, 'rewards/margins': 1.0425310134887695, 'logps/chosen': -302.8582763671875, 'logps/rejected': -548.061767578125, 'logps/ref_chosen': -50.91230010986328, 'logps/ref_rejected': -102.4893798828125, 'logits/chosen': 0.6274369955062866, 'logits/rejected': 0.5430445671081543, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.00540287047624588, 'epsilon_dpo/loss_margin_mean': 193.62644958496094, 'epsilon_dpo/beta_margin_mean': 1.0425310134887695, 'epsilon_dpo/beta_margin_std': 0.9792246222496033, 'epsilon_dpo/beta_margin_grad_mean': -0.2959914207458496, 'epsilon_dpo/beta_margin_grad_std': 0.17414125800132751, 'kl/beta': 0.005443153902888298, 'kl/avg_steps': 0.75, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████▏ | 464/681 [33:25<09:32, 2.64s/it] 68%|█████████████████████████████████████████████████████▎ | 465/681 [33:28<09:34, 2.66s/it] {'loss': 0.7489, 'grad_norm': 51.4481201171875, 'learning_rate': 1.3971566441730714e-07, 'rewards/chosen': -1.2803826332092285, 'rewards/rejected': -2.294574737548828, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.0141922235488892, 'logps/chosen': -298.53082275390625, 'logps/rejected': -542.0888061523438, 'logps/ref_chosen': -60.116851806640625, 'logps/ref_rejected': -113.94602966308594, 'logits/chosen': 0.38318532705307007, 'logits/rejected': 0.6318896412849426, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.005362650379538536, 'epsilon_dpo/loss_margin_mean': 189.72882080078125, 'epsilon_dpo/beta_margin_mean': 1.0141922235488892, 'epsilon_dpo/beta_margin_std': 0.838722825050354, 'epsilon_dpo/beta_margin_grad_mean': -0.29354214668273926, 'epsilon_dpo/beta_margin_grad_std': 0.1528787612915039, 'kl/beta': 0.005402633920311928, 'kl/avg_steps': 0.75, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████▎ | 465/681 [33:28<09:34, 2.66s/it] 68%|█████████████████████████████████████████████████████▎ | 466/681 [33:31<09:35, 2.68s/it] {'loss': 0.8586, 'grad_norm': 54.184085845947266, 'learning_rate': 1.3856541105586545e-07, 'rewards/chosen': -1.3661298751831055, 'rewards/rejected': -2.2118778228759766, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8457480669021606, 'logps/chosen': -308.79168701171875, 'logps/rejected': -505.747314453125, 'logps/ref_chosen': -52.920921325683594, 'logps/ref_rejected': -90.3154296875, 'logits/chosen': 0.6274415254592896, 'logits/rejected': 0.779162585735321, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.005329433362931013, 'epsilon_dpo/loss_margin_mean': 159.5610809326172, 'epsilon_dpo/beta_margin_mean': 0.8457480072975159, 'epsilon_dpo/beta_margin_std': 0.8515949249267578, 'epsilon_dpo/beta_margin_grad_mean': -0.324601948261261, 'epsilon_dpo/beta_margin_grad_std': 0.1643323004245758, 'kl/beta': 0.005362415686249733, 'kl/avg_steps': 0.625, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████▎ | 466/681 [33:31<09:35, 2.68s/it] 69%|█████████████████████████████████████████████████████▍ | 467/681 [33:33<09:37, 2.70s/it] {'loss': 0.8859, 'grad_norm': 60.39503479003906, 'learning_rate': 1.3741809409947729e-07, 'rewards/chosen': -1.5679855346679688, 'rewards/rejected': -2.445469379425049, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.8774838447570801, 'logps/chosen': -374.0175476074219, 'logps/rejected': -564.8558349609375, 'logps/ref_chosen': -78.7158203125, 'logps/ref_rejected': -102.86019897460938, 'logits/chosen': 0.34069177508354187, 'logits/rejected': 0.7749538421630859, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.005297997035086155, 'epsilon_dpo/loss_margin_mean': 166.6938934326172, 'epsilon_dpo/beta_margin_mean': 0.8774838447570801, 'epsilon_dpo/beta_margin_std': 0.9864553809165955, 'epsilon_dpo/beta_margin_grad_mean': -0.32604339718818665, 'epsilon_dpo/beta_margin_grad_std': 0.18723338842391968, 'kl/beta': 0.00532910879701376, 'kl/avg_steps': 0.59375, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████▍ | 467/681 [33:33<09:37, 2.70s/it] 69%|█████████████████████████████████████████████████████▌ | 468/681 [33:36<09:41, 2.73s/it] {'loss': 0.8196, 'grad_norm': 44.3262939453125, 'learning_rate': 1.362737437810114e-07, 'rewards/chosen': -1.2665220499038696, 'rewards/rejected': -2.1838462352752686, 'rewards/accuracies': 0.875, 'rewards/margins': 0.9173241853713989, 'logps/chosen': -310.4765625, 'logps/rejected': -516.6727905273438, 'logps/ref_chosen': -69.93536376953125, 'logps/ref_rejected': -101.02881622314453, 'logits/chosen': 0.3684792220592499, 'logits/rejected': 0.5974057912826538, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.005258447490632534, 'epsilon_dpo/loss_margin_mean': 175.10275268554688, 'epsilon_dpo/beta_margin_mean': 0.9173241853713989, 'epsilon_dpo/beta_margin_std': 0.8768501281738281, 'epsilon_dpo/beta_margin_grad_mean': -0.3130618631839752, 'epsilon_dpo/beta_margin_grad_std': 0.16425372660160065, 'kl/beta': 0.0052976543083786964, 'kl/avg_steps': 0.75, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████▌ | 468/681 [33:36<09:41, 2.73s/it] 69%|█████████████████████████████████████████████████████▋ | 469/681 [33:39<09:49, 2.78s/it] {'loss': 0.7934, 'grad_norm': 38.74735641479492, 'learning_rate': 1.351323902551631e-07, 'rewards/chosen': -1.4098811149597168, 'rewards/rejected': -2.343319892883301, 'rewards/accuracies': 0.875, 'rewards/margins': 0.9334390163421631, 'logps/chosen': -338.1322021484375, 'logps/rejected': -554.191162109375, 'logps/ref_chosen': -68.12469482421875, 'logps/ref_rejected': -104.78640747070312, 'logits/chosen': 0.49617844820022583, 'logits/rejected': 0.7309253811836243, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.005218472797423601, 'epsilon_dpo/loss_margin_mean': 179.397216796875, 'epsilon_dpo/beta_margin_mean': 0.9334389567375183, 'epsilon_dpo/beta_margin_std': 0.8322674632072449, 'epsilon_dpo/beta_margin_grad_mean': -0.30890974402427673, 'epsilon_dpo/beta_margin_grad_std': 0.15223819017410278, 'kl/beta': 0.005258217453956604, 'kl/avg_steps': 0.765625, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████▋ | 469/681 [33:39<09:49, 2.78s/it] 69%|█████████████████████████████████████████████████████▊ | 470/681 [33:42<09:27, 2.69s/it] {'loss': 0.7491, 'grad_norm': 48.59160232543945, 'learning_rate': 1.339940635976592e-07, 'rewards/chosen': -1.136615514755249, 'rewards/rejected': -2.141813278198242, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.0051977634429932, 'logps/chosen': -263.25250244140625, 'logps/rejected': -496.8262939453125, 'logps/ref_chosen': -43.79193115234375, 'logps/ref_rejected': -82.70285034179688, 'logits/chosen': 0.5892876982688904, 'logits/rejected': 0.8713769316673279, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.0051747532561421394, 'epsilon_dpo/loss_margin_mean': 194.66285705566406, 'epsilon_dpo/beta_margin_mean': 1.0051977634429932, 'epsilon_dpo/beta_margin_std': 0.8227752447128296, 'epsilon_dpo/beta_margin_grad_mean': -0.29332658648490906, 'epsilon_dpo/beta_margin_grad_std': 0.14796683192253113, 'kl/beta': 0.005218265112489462, 'kl/avg_steps': 0.84375, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████▊ | 470/681 [33:42<09:27, 2.69s/it] 69%|█████████████████████████████████████████████████████▉ | 471/681 [33:44<09:08, 2.61s/it] {'loss': 0.8468, 'grad_norm': 35.29491424560547, 'learning_rate': 1.3285879380446563e-07, 'rewards/chosen': -1.2650556564331055, 'rewards/rejected': -2.064793825149536, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.7997382879257202, 'logps/chosen': -309.1363525390625, 'logps/rejected': -485.63812255859375, 'logps/ref_chosen': -63.33952331542969, 'logps/ref_rejected': -83.61048126220703, 'logits/chosen': 0.4337306022644043, 'logits/rejected': 0.9008173942565918, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.005142777226865292, 'epsilon_dpo/loss_margin_mean': 156.2308349609375, 'epsilon_dpo/beta_margin_mean': 0.7997382283210754, 'epsilon_dpo/beta_margin_std': 0.7150464057922363, 'epsilon_dpo/beta_margin_grad_mean': -0.3280228078365326, 'epsilon_dpo/beta_margin_grad_std': 0.14275549352169037, 'kl/beta': 0.005174604244530201, 'kl/avg_steps': 0.625, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████▉ | 471/681 [33:44<09:08, 2.61s/it] 69%|██████████████████████████████████████████████████████ | 472/681 [33:47<09:29, 2.73s/it] {'loss': 0.8026, 'grad_norm': 32.15871047973633, 'learning_rate': 1.317266107909975e-07, 'rewards/chosen': -1.1983230113983154, 'rewards/rejected': -2.137685537338257, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.9393625259399414, 'logps/chosen': -318.1406555175781, 'logps/rejected': -536.2745361328125, 'logps/ref_chosen': -83.66609954833984, 'logps/ref_rejected': -117.20919799804688, 'logits/chosen': -0.0728834792971611, 'logits/rejected': 0.19875231385231018, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.005106013268232346, 'epsilon_dpo/loss_margin_mean': 184.59075927734375, 'epsilon_dpo/beta_margin_mean': 0.9393625259399414, 'epsilon_dpo/beta_margin_std': 0.907642126083374, 'epsilon_dpo/beta_margin_grad_mean': -0.3135577142238617, 'epsilon_dpo/beta_margin_grad_std': 0.15046267211437225, 'kl/beta': 0.0051424638368189335, 'kl/avg_steps': 0.71875, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████ | 472/681 [33:47<09:29, 2.73s/it] 69%|██████████████████████████████████████████████████████▏ | 473/681 [33:50<09:32, 2.75s/it] {'loss': 1.0059, 'grad_norm': 61.05051803588867, 'learning_rate': 1.3059754439133002e-07, 'rewards/chosen': -1.3861662149429321, 'rewards/rejected': -2.025066614151001, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.6389002799987793, 'logps/chosen': -335.890625, 'logps/rejected': -480.41339111328125, 'logps/ref_chosen': -63.49696731567383, 'logps/ref_rejected': -81.14657592773438, 'logits/chosen': 0.35071080923080444, 'logits/rejected': 0.8694913387298584, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.005079149734228849, 'epsilon_dpo/loss_margin_mean': 126.8731918334961, 'epsilon_dpo/beta_margin_mean': 0.6389003396034241, 'epsilon_dpo/beta_margin_std': 0.8650482892990112, 'epsilon_dpo/beta_margin_grad_mean': -0.3632872700691223, 'epsilon_dpo/beta_margin_grad_std': 0.1751829981803894, 'kl/beta': 0.0051057664677500725, 'kl/avg_steps': 0.53125, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████▏ | 473/681 [33:50<09:32, 2.75s/it] 70%|██████████████████████████████████████████████████████▎ | 474/681 [33:53<09:36, 2.79s/it] {'loss': 0.9708, 'grad_norm': 47.492401123046875, 'learning_rate': 1.2947162435741277e-07, 'rewards/chosen': -1.2170591354370117, 'rewards/rejected': -1.8874934911727905, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.6704343557357788, 'logps/chosen': -292.8822021484375, 'logps/rejected': -464.0338439941406, 'logps/ref_chosen': -52.6119384765625, 'logps/ref_rejected': -90.08041381835938, 'logits/chosen': 0.5560064315795898, 'logits/rejected': 0.6434218287467957, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.005053896456956863, 'epsilon_dpo/loss_margin_mean': 133.68316650390625, 'epsilon_dpo/beta_margin_mean': 0.670434296131134, 'epsilon_dpo/beta_margin_std': 0.8294155597686768, 'epsilon_dpo/beta_margin_grad_mean': -0.3590001165866852, 'epsilon_dpo/beta_margin_grad_std': 0.16798219084739685, 'kl/beta': 0.0050787851214408875, 'kl/avg_steps': 0.5, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▎ | 474/681 [33:53<09:36, 2.79s/it] 70%|██████████████████████████████████████████████████████▍ | 475/681 [33:55<09:27, 2.75s/it] {'loss': 0.7992, 'grad_norm': 61.19434356689453, 'learning_rate': 1.2834888035828596e-07, 'rewards/chosen': -1.0072057247161865, 'rewards/rejected': -1.9279804229736328, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.9207746982574463, 'logps/chosen': -243.25057983398438, 'logps/rejected': -474.9560546875, 'logps/ref_chosen': -42.49519348144531, 'logps/ref_rejected': -90.06295013427734, 'logits/chosen': 0.6205512881278992, 'logits/rejected': 0.6674166321754456, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.005014538764953613, 'epsilon_dpo/loss_margin_mean': 184.13772583007812, 'epsilon_dpo/beta_margin_mean': 0.9207746982574463, 'epsilon_dpo/beta_margin_std': 0.8331746459007263, 'epsilon_dpo/beta_margin_grad_mean': -0.3117504417896271, 'epsilon_dpo/beta_margin_grad_std': 0.15037527680397034, 'kl/beta': 0.005053517874330282, 'kl/avg_steps': 0.78125, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▍ | 475/681 [33:55<09:27, 2.75s/it] 70%|██████████████████████████████████████████████████████▌ | 476/681 [33:58<09:26, 2.76s/it] {'loss': 0.8502, 'grad_norm': 40.11614990234375, 'learning_rate': 1.2722934197929802e-07, 'rewards/chosen': -1.072633981704712, 'rewards/rejected': -1.8573660850524902, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.7847321033477783, 'logps/chosen': -258.5040283203125, 'logps/rejected': -447.39532470703125, 'logps/ref_chosen': -42.949378967285156, 'logps/ref_rejected': -73.71023559570312, 'logits/chosen': 0.5991215705871582, 'logits/rejected': 0.9088248610496521, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.004972531925886869, 'epsilon_dpo/loss_margin_mean': 158.13043212890625, 'epsilon_dpo/beta_margin_mean': 0.7847320437431335, 'epsilon_dpo/beta_margin_std': 0.702809751033783, 'epsilon_dpo/beta_margin_grad_mean': -0.33235964179039, 'epsilon_dpo/beta_margin_grad_std': 0.1353040486574173, 'kl/beta': 0.005014343187212944, 'kl/avg_steps': 0.84375, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▌ | 476/681 [33:58<09:26, 2.76s/it] 70%|██████████████████████████████████████████████████████▋ | 477/681 [34:01<09:17, 2.73s/it] {'loss': 0.8885, 'grad_norm': 36.1607551574707, 'learning_rate': 1.2611303872132631e-07, 'rewards/chosen': -1.149595022201538, 'rewards/rejected': -1.9216588735580444, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.7720637321472168, 'logps/chosen': -303.25787353515625, 'logps/rejected': -465.7208557128906, 'logps/ref_chosen': -70.77261352539062, 'logps/ref_rejected': -76.13737487792969, 'logits/chosen': 0.23146602511405945, 'logits/rejected': 0.8222414255142212, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0049379123374819756, 'epsilon_dpo/loss_margin_mean': 157.0982208251953, 'epsilon_dpo/beta_margin_mean': 0.7720637321472168, 'epsilon_dpo/beta_margin_std': 0.7845483422279358, 'epsilon_dpo/beta_margin_grad_mean': -0.334118515253067, 'epsilon_dpo/beta_margin_grad_std': 0.1581631898880005, 'kl/beta': 0.004972388502210379, 'kl/avg_steps': 0.703125, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▋ | 477/681 [34:01<09:17, 2.73s/it] 70%|██████████████████████████████████████████████████████▋ | 478/681 [34:04<09:22, 2.77s/it] {'loss': 0.8134, 'grad_norm': 36.4519157409668, 'learning_rate': 1.2500000000000005e-07, 'rewards/chosen': -0.956280529499054, 'rewards/rejected': -1.8194578886032104, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.8631774187088013, 'logps/chosen': -236.5076904296875, 'logps/rejected': -457.0362548828125, 'logps/ref_chosen': -41.440513610839844, 'logps/ref_rejected': -85.36196899414062, 'logits/chosen': 0.5271429419517517, 'logits/rejected': 0.6856623888015747, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'epsilon_dpo/beta': 0.004896498750895262, 'epsilon_dpo/loss_margin_mean': 176.6071014404297, 'epsilon_dpo/beta_margin_mean': 0.8631773591041565, 'epsilon_dpo/beta_margin_std': 0.7796393036842346, 'epsilon_dpo/beta_margin_grad_mean': -0.3210771977901459, 'epsilon_dpo/beta_margin_grad_std': 0.1335819661617279, 'kl/beta': 0.004937670659273863, 'kl/avg_steps': 0.84375, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▋ | 478/681 [34:04<09:22, 2.77s/it] 70%|██████████████████████████████████████████████████████▊ | 479/681 [34:06<09:07, 2.71s/it] {'loss': 0.8927, 'grad_norm': 50.982177734375, 'learning_rate': 1.2389025514492456e-07, 'rewards/chosen': -1.1844937801361084, 'rewards/rejected': -1.9222524166107178, 'rewards/accuracies': 0.875, 'rewards/margins': 0.7377583980560303, 'logps/chosen': -297.18328857421875, 'logps/rejected': -490.7677001953125, 'logps/ref_chosen': -53.907920837402344, 'logps/ref_rejected': -95.1163330078125, 'logits/chosen': 0.5171575546264648, 'logits/rejected': 0.5576011538505554, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.004861651454120874, 'epsilon_dpo/loss_margin_mean': 152.3760223388672, 'epsilon_dpo/beta_margin_mean': 0.7377583980560303, 'epsilon_dpo/beta_margin_std': 0.7295092344284058, 'epsilon_dpo/beta_margin_grad_mean': -0.3410286605358124, 'epsilon_dpo/beta_margin_grad_std': 0.14748983085155487, 'kl/beta': 0.0048963576555252075, 'kl/avg_steps': 0.71875, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▊ | 479/681 [34:06<09:07, 2.71s/it] 70%|██████████████████████████████████████████████████████▉ | 480/681 [34:09<08:59, 2.69s/it] {'loss': 1.0101, 'grad_norm': 57.463008880615234, 'learning_rate': 1.227838333989088e-07, 'rewards/chosen': -1.2613255977630615, 'rewards/rejected': -1.8945307731628418, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.633205235004425, 'logps/chosen': -318.82342529296875, 'logps/rejected': -475.0904541015625, 'logps/ref_chosen': -58.682701110839844, 'logps/ref_rejected': -82.93248748779297, 'logits/chosen': 0.33993053436279297, 'logits/rejected': 0.8602235913276672, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.004836073610931635, 'epsilon_dpo/loss_margin_mean': 132.01724243164062, 'epsilon_dpo/beta_margin_mean': 0.6332051753997803, 'epsilon_dpo/beta_margin_std': 0.8665231466293335, 'epsilon_dpo/beta_margin_grad_mean': -0.36628028750419617, 'epsilon_dpo/beta_margin_grad_std': 0.17329415678977966, 'kl/beta': 0.004861416295170784, 'kl/avg_steps': 0.53125, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████▉ | 480/681 [34:09<08:59, 2.69s/it] 71%|███████████████████████████████████████████████████████ | 481/681 [34:11<08:50, 2.65s/it] {'loss': 0.8738, 'grad_norm': 33.654972076416016, 'learning_rate': 1.2168076391719489e-07, 'rewards/chosen': -1.1127684116363525, 'rewards/rejected': -1.8897743225097656, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.7770059108734131, 'logps/chosen': -286.5435791015625, 'logps/rejected': -486.5083312988281, 'logps/ref_chosen': -54.964271545410156, 'logps/ref_rejected': -92.42044067382812, 'logits/chosen': 0.38666173815727234, 'logits/rejected': 0.6381037831306458, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.004801449831575155, 'epsilon_dpo/loss_margin_mean': 162.50860595703125, 'epsilon_dpo/beta_margin_mean': 0.7770059108734131, 'epsilon_dpo/beta_margin_std': 0.7589020729064941, 'epsilon_dpo/beta_margin_grad_mean': -0.33503931760787964, 'epsilon_dpo/beta_margin_grad_std': 0.15022431313991547, 'kl/beta': 0.004835726227611303, 'kl/avg_steps': 0.71875, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████ | 481/681 [34:12<08:50, 2.65s/it] 71%|███████████████████████████████████████████████████████▏ | 482/681 [34:14<08:55, 2.69s/it] {'loss': 0.9868, 'grad_norm': 37.491363525390625, 'learning_rate': 1.2058107576668938e-07, 'rewards/chosen': -1.1996984481811523, 'rewards/rejected': -1.8349155187606812, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.6352172493934631, 'logps/chosen': -318.55609130859375, 'logps/rejected': -472.59002685546875, 'logps/ref_chosen': -67.55347442626953, 'logps/ref_rejected': -87.58953857421875, 'logits/chosen': 0.3698510527610779, 'logits/rejected': 0.6439001560211182, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.004770186729729176, 'epsilon_dpo/loss_margin_mean': 133.9978485107422, 'epsilon_dpo/beta_margin_mean': 0.6352172493934631, 'epsilon_dpo/beta_margin_std': 0.8041839599609375, 'epsilon_dpo/beta_margin_grad_mean': -0.3630225360393524, 'epsilon_dpo/beta_margin_grad_std': 0.16115719079971313, 'kl/beta': 0.004801217466592789, 'kl/avg_steps': 0.65625, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████▏ | 482/681 [34:14<08:55, 2.69s/it] 71%|███████████████████████████████████████████████████████▎ | 483/681 [34:17<08:46, 2.66s/it] {'loss': 0.8153, 'grad_norm': 32.81733322143555, 'learning_rate': 1.194847979251979e-07, 'rewards/chosen': -1.15290105342865, 'rewards/rejected': -2.0452911853790283, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8923901319503784, 'logps/chosen': -306.38372802734375, 'logps/rejected': -527.9395751953125, 'logps/ref_chosen': -63.32981872558594, 'logps/ref_rejected': -95.78697204589844, 'logits/chosen': 0.19232258200645447, 'logits/rejected': 0.6765154600143433, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.004736104980111122, 'epsilon_dpo/loss_margin_mean': 189.09866333007812, 'epsilon_dpo/beta_margin_mean': 0.8923901319503784, 'epsilon_dpo/beta_margin_std': 0.8510040640830994, 'epsilon_dpo/beta_margin_grad_mean': -0.31758615374565125, 'epsilon_dpo/beta_margin_grad_std': 0.1468149721622467, 'kl/beta': 0.004769915249198675, 'kl/avg_steps': 0.71875, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████▎ | 483/681 [34:17<08:46, 2.66s/it] 71%|███████████████████████████████████████████████████████▍ | 484/681 [34:19<08:28, 2.58s/it] {'loss': 0.8527, 'grad_norm': 38.810420989990234, 'learning_rate': 1.1839195928066101e-07, 'rewards/chosen': -1.0752531290054321, 'rewards/rejected': -1.857426404953003, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.7821731567382812, 'logps/chosen': -287.86785888671875, 'logps/rejected': -479.97686767578125, 'logps/ref_chosen': -59.13812255859375, 'logps/ref_rejected': -84.37144470214844, 'logits/chosen': 0.4545806348323822, 'logits/rejected': 0.8561559319496155, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.004699346609413624, 'epsilon_dpo/loss_margin_mean': 166.8756866455078, 'epsilon_dpo/beta_margin_mean': 0.7821731567382812, 'epsilon_dpo/beta_margin_std': 0.7183220386505127, 'epsilon_dpo/beta_margin_grad_mean': -0.3339075446128845, 'epsilon_dpo/beta_margin_grad_std': 0.13252882659435272, 'kl/beta': 0.004735875874757767, 'kl/avg_steps': 0.78125, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████▍ | 484/681 [34:19<08:28, 2.58s/it] 71%|███████████████████████████████████████████████████████▌ | 485/681 [34:22<08:26, 2.58s/it] {'loss': 0.8269, 'grad_norm': 33.34450149536133, 'learning_rate': 1.1730258863039347e-07, 'rewards/chosen': -0.9522556066513062, 'rewards/rejected': -1.8350727558135986, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.8828170299530029, 'logps/chosen': -262.7619934082031, 'logps/rejected': -497.1927490234375, 'logps/ref_chosen': -58.849571228027344, 'logps/ref_rejected': -103.36408996582031, 'logits/chosen': 0.3416202664375305, 'logits/rejected': 0.5913619995117188, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.004664386156946421, 'epsilon_dpo/loss_margin_mean': 189.91624450683594, 'epsilon_dpo/beta_margin_mean': 0.8828170299530029, 'epsilon_dpo/beta_margin_std': 0.8285797238349915, 'epsilon_dpo/beta_margin_grad_mean': -0.3178929388523102, 'epsilon_dpo/beta_margin_grad_std': 0.1579822152853012, 'kl/beta': 0.004699163604527712, 'kl/avg_steps': 0.75, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████▌ | 485/681 [34:22<08:26, 2.58s/it] 71%|███████████████████████████████████████████████████████▋ | 486/681 [34:24<07:57, 2.45s/it] {'loss': 0.9184, 'grad_norm': 42.30367660522461, 'learning_rate': 1.1621671468032493e-07, 'rewards/chosen': -1.1533944606781006, 'rewards/rejected': -1.9386515617370605, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.78525710105896, 'logps/chosen': -303.60137939453125, 'logps/rejected': -510.9896240234375, 'logps/ref_chosen': -55.25966262817383, 'logps/ref_rejected': -92.13936614990234, 'logits/chosen': 0.42141079902648926, 'logits/rejected': 0.6843166351318359, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.004635494668036699, 'epsilon_dpo/loss_margin_mean': 170.508544921875, 'epsilon_dpo/beta_margin_mean': 0.78525710105896, 'epsilon_dpo/beta_margin_std': 0.9002077579498291, 'epsilon_dpo/beta_margin_grad_mean': -0.33765965700149536, 'epsilon_dpo/beta_margin_grad_std': 0.1793326586484909, 'kl/beta': 0.00466418219730258, 'kl/avg_steps': 0.625, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████▋ | 486/681 [34:24<07:57, 2.45s/it] 72%|███████████████████████████████████████████████████████▊ | 487/681 [34:27<08:19, 2.58s/it] {'loss': 0.856, 'grad_norm': 39.9567756652832, 'learning_rate': 1.1513436604424378e-07, 'rewards/chosen': -1.1482787132263184, 'rewards/rejected': -1.908555269241333, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.7602766156196594, 'logps/chosen': -302.55706787109375, 'logps/rejected': -507.72369384765625, 'logps/ref_chosen': -53.06330871582031, 'logps/ref_rejected': -92.4188232421875, 'logits/chosen': 0.5476217269897461, 'logits/rejected': 0.7340171337127686, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.004598011262714863, 'epsilon_dpo/loss_margin_mean': 165.81109619140625, 'epsilon_dpo/beta_margin_mean': 0.7602766156196594, 'epsilon_dpo/beta_margin_std': 0.6450428366661072, 'epsilon_dpo/beta_margin_grad_mean': -0.33197563886642456, 'epsilon_dpo/beta_margin_grad_std': 0.1348419189453125, 'kl/beta': 0.004635212477296591, 'kl/avg_steps': 0.8125, 'epoch': 0.72} 72%|███████████████████████████████████████████████████████▊ | 487/681 [34:27<08:19, 2.58s/it] 72%|███████████████████████████████████████████████████████▉ | 488/681 [34:30<08:33, 2.66s/it] {'loss': 0.8617, 'grad_norm': 36.45921325683594, 'learning_rate': 1.1405557124304335e-07, 'rewards/chosen': -1.0542188882827759, 'rewards/rejected': -1.7825837135314941, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.7283648252487183, 'logps/chosen': -283.22467041015625, 'logps/rejected': -475.06427001953125, 'logps/ref_chosen': -52.228153228759766, 'logps/ref_rejected': -84.00656127929688, 'logits/chosen': 0.4700571894645691, 'logits/rejected': 0.8733597993850708, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.00456239003688097, 'epsilon_dpo/loss_margin_mean': 160.0612335205078, 'epsilon_dpo/beta_margin_mean': 0.7283648252487183, 'epsilon_dpo/beta_margin_std': 0.597629964351654, 'epsilon_dpo/beta_margin_grad_mean': -0.33869460225105286, 'epsilon_dpo/beta_margin_grad_std': 0.12046254426240921, 'kl/beta': 0.004597854800522327, 'kl/avg_steps': 0.78125, 'epoch': 0.72} 72%|███████████████████████████████████████████████████████▉ | 488/681 [34:30<08:33, 2.66s/it] 72%|████████████████████████████████████████████████████████ | 489/681 [34:32<08:26, 2.64s/it] {'loss': 0.9453, 'grad_norm': 58.11296081542969, 'learning_rate': 1.1298035870396985e-07, 'rewards/chosen': -1.0114222764968872, 'rewards/rejected': -1.6499505043029785, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.6385283470153809, 'logps/chosen': -279.34716796875, 'logps/rejected': -444.14215087890625, 'logps/ref_chosen': -55.989627838134766, 'logps/ref_rejected': -79.39813232421875, 'logits/chosen': 0.3628634214401245, 'logits/rejected': 0.7215201258659363, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.004525597207248211, 'epsilon_dpo/loss_margin_mean': 141.3865203857422, 'epsilon_dpo/beta_margin_mean': 0.6385282874107361, 'epsilon_dpo/beta_margin_std': 0.6957613825798035, 'epsilon_dpo/beta_margin_grad_mean': -0.36369070410728455, 'epsilon_dpo/beta_margin_grad_std': 0.13181878626346588, 'kl/beta': 0.004562212619930506, 'kl/avg_steps': 0.8125, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████ | 489/681 [34:32<08:26, 2.64s/it] 72%|████████████████████████████████████████████████████████ | 490/681 [34:35<08:36, 2.71s/it] {'loss': 0.9136, 'grad_norm': 31.600603103637695, 'learning_rate': 1.1190875675987355e-07, 'rewards/chosen': -1.039729118347168, 'rewards/rejected': -1.7700657844543457, 'rewards/accuracies': 0.875, 'rewards/margins': 0.7303366661071777, 'logps/chosen': -283.4697265625, 'logps/rejected': -504.7529602050781, 'logps/ref_chosen': -52.36639404296875, 'logps/ref_rejected': -110.40904998779297, 'logits/chosen': 0.5707692503929138, 'logits/rejected': 0.488492488861084, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.004493365529924631, 'epsilon_dpo/loss_margin_mean': 163.24058532714844, 'epsilon_dpo/beta_margin_mean': 0.7303367257118225, 'epsilon_dpo/beta_margin_std': 0.7860758304595947, 'epsilon_dpo/beta_margin_grad_mean': -0.3462271988391876, 'epsilon_dpo/beta_margin_grad_std': 0.15604344010353088, 'kl/beta': 0.004525443073362112, 'kl/avg_steps': 0.71875, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████ | 490/681 [34:35<08:36, 2.71s/it] 72%|████████████████████████████████████████████████████████▏ | 491/681 [34:38<08:24, 2.65s/it] {'loss': 1.0265, 'grad_norm': 62.742305755615234, 'learning_rate': 1.1084079364846241e-07, 'rewards/chosen': -1.1519925594329834, 'rewards/rejected': -1.6978408098220825, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.5458483099937439, 'logps/chosen': -317.3819580078125, 'logps/rejected': -453.54693603515625, 'logps/ref_chosen': -60.11626434326172, 'logps/ref_rejected': -73.27278900146484, 'logits/chosen': 0.44595614075660706, 'logits/rejected': 0.9600294828414917, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'epsilon_dpo/beta': 0.004472534172236919, 'epsilon_dpo/loss_margin_mean': 123.00843048095703, 'epsilon_dpo/beta_margin_mean': 0.5458483099937439, 'epsilon_dpo/beta_margin_std': 0.7266311645507812, 'epsilon_dpo/beta_margin_grad_mean': -0.38271310925483704, 'epsilon_dpo/beta_margin_grad_std': 0.14819550514221191, 'kl/beta': 0.004493148531764746, 'kl/avg_steps': 0.46875, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████▏ | 491/681 [34:38<08:24, 2.65s/it] 72%|████████████████████████████████████████████████████████▎ | 492/681 [34:40<08:23, 2.66s/it] {'loss': 1.1302, 'grad_norm': 45.599552154541016, 'learning_rate': 1.097764975115576e-07, 'rewards/chosen': -1.2058026790618896, 'rewards/rejected': -1.6747174263000488, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.46891486644744873, 'logps/chosen': -324.37713623046875, 'logps/rejected': -449.5838623046875, 'logps/ref_chosen': -53.99418258666992, 'logps/ref_rejected': -72.65962219238281, 'logits/chosen': 0.6272152662277222, 'logits/rejected': 1.0693143606185913, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.004450269043445587, 'epsilon_dpo/loss_margin_mean': 106.54131317138672, 'epsilon_dpo/beta_margin_mean': 0.46891486644744873, 'epsilon_dpo/beta_margin_std': 0.8474681973457336, 'epsilon_dpo/beta_margin_grad_mean': -0.39979806542396545, 'epsilon_dpo/beta_margin_grad_std': 0.17615996301174164, 'kl/beta': 0.004472185391932726, 'kl/avg_steps': 0.5, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████▎ | 492/681 [34:40<08:23, 2.66s/it] 72%|████████████████████████████████████████████████████████▍ | 493/681 [34:43<08:37, 2.75s/it] {'loss': 0.9628, 'grad_norm': 57.87881088256836, 'learning_rate': 1.0871589639435203e-07, 'rewards/chosen': -1.1892423629760742, 'rewards/rejected': -1.7950613498687744, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.6058189868927002, 'logps/chosen': -344.1697082519531, 'logps/rejected': -493.72991943359375, 'logps/ref_chosen': -75.49723815917969, 'logps/ref_rejected': -87.32301330566406, 'logits/chosen': 0.34055036306381226, 'logits/rejected': 0.8597230911254883, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.004423956852406263, 'epsilon_dpo/loss_margin_mean': 137.73443603515625, 'epsilon_dpo/beta_margin_mean': 0.605819046497345, 'epsilon_dpo/beta_margin_std': 0.6427890658378601, 'epsilon_dpo/beta_margin_grad_mean': -0.3644724488258362, 'epsilon_dpo/beta_margin_grad_std': 0.139420285820961, 'kl/beta': 0.004449935629963875, 'kl/avg_steps': 0.59375, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████▍ | 493/681 [34:43<08:37, 2.75s/it] 73%|████████████████████████████████████████████████████████▌ | 494/681 [34:46<08:27, 2.71s/it] {'loss': 0.8221, 'grad_norm': 57.35707473754883, 'learning_rate': 1.0765901824467166e-07, 'rewards/chosen': -1.0579063892364502, 'rewards/rejected': -1.8778282403945923, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.8199218511581421, 'logps/chosen': -282.2862548828125, 'logps/rejected': -514.359130859375, 'logps/ref_chosen': -41.35926818847656, 'logps/ref_rejected': -86.09136962890625, 'logits/chosen': 0.807897686958313, 'logits/rejected': 0.8155907392501831, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.004388166591525078, 'epsilon_dpo/loss_margin_mean': 187.34080505371094, 'epsilon_dpo/beta_margin_mean': 0.8199219107627869, 'epsilon_dpo/beta_margin_std': 0.6674307584762573, 'epsilon_dpo/beta_margin_grad_mean': -0.320593923330307, 'epsilon_dpo/beta_margin_grad_std': 0.13530012965202332, 'kl/beta': 0.004423670005053282, 'kl/avg_steps': 0.8125, 'epoch': 0.73} 73%|████████████████████████████████████████████████████████▌ | 494/681 [34:46<08:27, 2.71s/it] 73%|████████████████████████████████████████████████████████▋ | 495/681 [34:49<08:25, 2.72s/it] {'loss': 0.9158, 'grad_norm': 43.34193420410156, 'learning_rate': 1.0660589091223854e-07, 'rewards/chosen': -1.2132760286331177, 'rewards/rejected': -1.9149343967437744, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7016583681106567, 'logps/chosen': -341.39044189453125, 'logps/rejected': -531.0192260742188, 'logps/ref_chosen': -63.53507995605469, 'logps/ref_rejected': -91.42443084716797, 'logits/chosen': 0.3953931927680969, 'logits/rejected': 0.7432557940483093, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.004359656944870949, 'epsilon_dpo/loss_margin_mean': 161.7394256591797, 'epsilon_dpo/beta_margin_mean': 0.7016583681106567, 'epsilon_dpo/beta_margin_std': 0.729976236820221, 'epsilon_dpo/beta_margin_grad_mean': -0.3493143618106842, 'epsilon_dpo/beta_margin_grad_std': 0.14684849977493286, 'kl/beta': 0.004388017579913139, 'kl/avg_steps': 0.65625, 'epoch': 0.73} 73%|████████████████████████████████████████████████████████▋ | 495/681 [34:49<08:25, 2.72s/it] 73%|████████████████████████████████████████████████████████▊ | 496/681 [34:51<08:24, 2.73s/it] {'loss': 0.9678, 'grad_norm': 47.66548156738281, 'learning_rate': 1.0555654214793722e-07, 'rewards/chosen': -1.277395486831665, 'rewards/rejected': -1.8552603721618652, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.577864944934845, 'logps/chosen': -366.8771057128906, 'logps/rejected': -512.6502075195312, 'logps/ref_chosen': -72.59192657470703, 'logps/ref_rejected': -84.32933807373047, 'logits/chosen': 0.2811706066131592, 'logits/rejected': 0.7893705368041992, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.004335320554673672, 'epsilon_dpo/loss_margin_mean': 134.03567504882812, 'epsilon_dpo/beta_margin_mean': 0.5778650045394897, 'epsilon_dpo/beta_margin_std': 0.5912160873413086, 'epsilon_dpo/beta_margin_grad_mean': -0.3704771101474762, 'epsilon_dpo/beta_margin_grad_std': 0.12676607072353363, 'kl/beta': 0.004359408747404814, 'kl/avg_steps': 0.5625, 'epoch': 0.73} 73%|████████████████████████████████████████████████████████▊ | 496/681 [34:51<08:24, 2.73s/it] 73%|████████████████████████████████████████████████████████▉ | 497/681 [34:54<08:22, 2.73s/it] {'loss': 1.0195, 'grad_norm': 44.63701248168945, 'learning_rate': 1.0451099960308374e-07, 'rewards/chosen': -1.3173977136611938, 'rewards/rejected': -1.8486762046813965, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.5312784314155579, 'logps/chosen': -363.6978454589844, 'logps/rejected': -505.45977783203125, 'logps/ref_chosen': -58.593971252441406, 'logps/ref_rejected': -76.28836822509766, 'logits/chosen': 0.5063110589981079, 'logits/rejected': 0.9023481607437134, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0043083615601062775, 'epsilon_dpo/loss_margin_mean': 124.06751251220703, 'epsilon_dpo/beta_margin_mean': 0.5312784314155579, 'epsilon_dpo/beta_margin_std': 0.6518020033836365, 'epsilon_dpo/beta_margin_grad_mean': -0.38184332847595215, 'epsilon_dpo/beta_margin_grad_std': 0.14038600027561188, 'kl/beta': 0.004335024394094944, 'kl/avg_steps': 0.625, 'epoch': 0.73} 73%|████████████████████████████████████████████████████████▉ | 497/681 [34:54<08:22, 2.73s/it] 73%|█████████████████████████████████████████████████████████ | 498/681 [34:57<08:25, 2.76s/it] {'loss': 0.9804, 'grad_norm': 52.54181671142578, 'learning_rate': 1.0346929082869641e-07, 'rewards/chosen': -1.2692654132843018, 'rewards/rejected': -1.8775451183319092, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.6082795858383179, 'logps/chosen': -367.08575439453125, 'logps/rejected': -522.775390625, 'logps/ref_chosen': -71.20565795898438, 'logps/ref_rejected': -83.95803833007812, 'logits/chosen': 0.5228114724159241, 'logits/rejected': 0.9948470592498779, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.004284294322133064, 'epsilon_dpo/loss_margin_mean': 142.9373016357422, 'epsilon_dpo/beta_margin_mean': 0.6082795858383179, 'epsilon_dpo/beta_margin_std': 0.7224895358085632, 'epsilon_dpo/beta_margin_grad_mean': -0.36809319257736206, 'epsilon_dpo/beta_margin_grad_std': 0.14858005940914154, 'kl/beta': 0.004308098927140236, 'kl/avg_steps': 0.5625, 'epoch': 0.73} 73%|█████████████████████████████████████████████████████████ | 498/681 [34:57<08:25, 2.76s/it] 73%|█████████████████████████████████████████████████████████▏ | 499/681 [35:00<08:14, 2.72s/it] {'loss': 0.9273, 'grad_norm': 46.36258316040039, 'learning_rate': 1.0243144327477013e-07, 'rewards/chosen': -1.2557456493377686, 'rewards/rejected': -1.949074149131775, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.6933284997940063, 'logps/chosen': -345.83740234375, 'logps/rejected': -559.477783203125, 'logps/ref_chosen': -51.25519561767578, 'logps/ref_rejected': -101.07870483398438, 'logits/chosen': 0.7224333882331848, 'logits/rejected': 0.7317450046539307, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.00425497442483902, 'epsilon_dpo/loss_margin_mean': 163.8168182373047, 'epsilon_dpo/beta_margin_mean': 0.6933284997940063, 'epsilon_dpo/beta_margin_std': 0.732176661491394, 'epsilon_dpo/beta_margin_grad_mean': -0.3490248918533325, 'epsilon_dpo/beta_margin_grad_std': 0.15517489612102509, 'kl/beta': 0.004284001421183348, 'kl/avg_steps': 0.6875, 'epoch': 0.73} 73%|█████████████████████████████████████████████████████████▏ | 499/681 [35:00<08:14, 2.72s/it] 73%|█████████████████████████████████████████████████████████▎ | 500/681 [35:02<08:08, 2.70s/it] {'loss': 0.9214, 'grad_norm': 33.654090881347656, 'learning_rate': 1.0139748428955333e-07, 'rewards/chosen': -1.3463799953460693, 'rewards/rejected': -2.02809476852417, 'rewards/accuracies': 0.875, 'rewards/margins': 0.6817148923873901, 'logps/chosen': -375.20428466796875, 'logps/rejected': -574.1239013671875, 'logps/ref_chosen': -57.027442932128906, 'logps/ref_rejected': -93.93421173095703, 'logits/chosen': 0.6236047744750977, 'logits/rejected': 0.7619317173957825, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0042272512800991535, 'epsilon_dpo/loss_margin_mean': 162.0128173828125, 'epsilon_dpo/beta_margin_mean': 0.6817148327827454, 'epsilon_dpo/beta_margin_std': 0.7146762013435364, 'epsilon_dpo/beta_margin_grad_mean': -0.3537231981754303, 'epsilon_dpo/beta_margin_grad_std': 0.13802658021450043, 'kl/beta': 0.0042547499760985374, 'kl/avg_steps': 0.65625, 'epoch': 0.73} 73%|█████████████████████████████████████████████████████████▎ | 500/681 [35:02<08:08, 2.70s/it][INFO|trainer.py:4307] 2026-04-18 01:13:11,981 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 01:13:11,981 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 01:13:11,981 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 01:18:23,773 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 01:18:23,773 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-600 [INFO|configuration_utils.py:419] 2026-04-18 01:19:21,982 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-600/config.json [INFO|configuration_utils.py:911] 2026-04-18 01:19:21,997 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-600/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-18 01:20:09,590 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-600/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-18 01:20:09,624 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-600/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-18 01:20:09,648 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-600/special_tokens_map.json [INFO|trainer.py:4083] 2026-04-18 01:23:42,026 >> Deleting older checkpoint [/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-200] due to args.save_total_limit 88%|███████████████████████████████████████████████████████████████████ | 601/681 [45:37<2:11:56, 98.95s/it] {'loss': 1.0689, 'grad_norm': 25.276933670043945, 'learning_rate': 2.1301532877994742e-08, 'rewards/chosen': -0.886534571647644, 'rewards/rejected': -1.300047755241394, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.4135131239891052, 'logps/chosen': -459.1831970214844, 'logps/rejected': -682.9464111328125, 'logps/ref_chosen': -59.13360595703125, 'logps/ref_rejected': -94.69093322753906, 'logits/chosen': 1.3206804990768433, 'logits/rejected': 1.483794093132019, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.002211540238931775, 'epsilon_dpo/loss_margin_mean': 188.20594787597656, 'epsilon_dpo/beta_margin_mean': 0.4135131537914276, 'epsilon_dpo/beta_margin_std': 0.48056861758232117, 'epsilon_dpo/beta_margin_grad_mean': -0.40351107716560364, 'epsilon_dpo/beta_margin_grad_std': 0.10921073704957962, 'kl/beta': 0.0022238281089812517, 'kl/avg_steps': 0.5625, 'epoch': 0.88} 88%|███████████████████████████████████████████████████████████████████ | 601/681 [45:37<2:11:56, 98.95s/it] 88%|███████████████████████████████████████████████████████████████████▏ | 602/681 [45:40<1:32:15, 70.07s/it] {'loss': 1.005, 'grad_norm': 31.37278175354004, 'learning_rate': 2.0786184285784298e-08, 'rewards/chosen': -0.7031533122062683, 'rewards/rejected': -1.1950111389160156, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.4918578863143921, 'logps/chosen': -368.17822265625, 'logps/rejected': -632.10400390625, 'logps/ref_chosen': -48.59352111816406, 'logps/ref_rejected': -87.6685562133789, 'logits/chosen': 1.2354438304901123, 'logits/rejected': 1.4991683959960938, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.002197787631303072, 'epsilon_dpo/loss_margin_mean': 224.8507843017578, 'epsilon_dpo/beta_margin_mean': 0.4918578863143921, 'epsilon_dpo/beta_margin_std': 0.47307488322257996, 'epsilon_dpo/beta_margin_grad_mean': -0.3858841061592102, 'epsilon_dpo/beta_margin_grad_std': 0.10464838147163391, 'kl/beta': 0.0022113891318440437, 'kl/avg_steps': 0.625, 'epoch': 0.88} 88%|███████████████████████████████████████████████████████████████████▏ | 602/681 [45:40<1:32:15, 70.07s/it] 89%|███████████████████████████████████████████████████████████████████▎ | 603/681 [45:43<1:04:44, 49.81s/it] {'loss': 1.0967, 'grad_norm': 38.248104095458984, 'learning_rate': 2.0276875690788204e-08, 'rewards/chosen': -0.7941126823425293, 'rewards/rejected': -1.1842149496078491, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.3901022970676422, 'logps/chosen': -432.961181640625, 'logps/rejected': -642.5441284179688, 'logps/ref_chosen': -70.41461944580078, 'logps/ref_rejected': -100.32560729980469, 'logits/chosen': 0.6783925294876099, 'logits/rejected': 1.2061731815338135, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0021861973218619823, 'epsilon_dpo/loss_margin_mean': 179.67196655273438, 'epsilon_dpo/beta_margin_mean': 0.3901022970676422, 'epsilon_dpo/beta_margin_std': 0.5260531902313232, 'epsilon_dpo/beta_margin_grad_mean': -0.4108930826187134, 'epsilon_dpo/beta_margin_grad_std': 0.11496561765670776, 'kl/beta': 0.002197653753682971, 'kl/avg_steps': 0.53125, 'epoch': 0.89} 89%|███████████████████████████████████████████████████████████████████▎ | 603/681 [45:43<1:04:44, 49.81s/it] 89%|█████████████████████████████████████████████████████████████████████▏ | 604/681 [45:45<45:48, 35.69s/it] {'loss': 1.0241, 'grad_norm': 33.354339599609375, 'learning_rate': 1.977362051376158e-08, 'rewards/chosen': -0.7053238153457642, 'rewards/rejected': -1.1758427619934082, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.47051894664764404, 'logps/chosen': -370.6407470703125, 'logps/rejected': -633.7752075195312, 'logps/ref_chosen': -46.45808029174805, 'logps/ref_rejected': -91.8544921875, 'logits/chosen': 1.2945600748062134, 'logits/rejected': 1.3630659580230713, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0021719117648899555, 'epsilon_dpo/loss_margin_mean': 217.738037109375, 'epsilon_dpo/beta_margin_mean': 0.47051891684532166, 'epsilon_dpo/beta_margin_std': 0.48104289174079895, 'epsilon_dpo/beta_margin_grad_mean': -0.39027997851371765, 'epsilon_dpo/beta_margin_grad_std': 0.10892639309167862, 'kl/beta': 0.0021860403940081596, 'kl/avg_steps': 0.65625, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▏ | 604/681 [45:45<45:48, 35.69s/it] 89%|█████████████████████████████████████████████████████████████████████▎ | 605/681 [45:48<32:42, 25.82s/it] {'loss': 1.0539, 'grad_norm': 30.262615203857422, 'learning_rate': 1.9276432015946446e-08, 'rewards/chosen': -0.8125105500221252, 'rewards/rejected': -1.2484371662139893, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.4359266459941864, 'logps/chosen': -442.30731201171875, 'logps/rejected': -681.3118896484375, 'logps/ref_chosen': -66.24933624267578, 'logps/ref_rejected': -102.30496978759766, 'logits/chosen': 0.8817493319511414, 'logits/rejected': 1.3649321794509888, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0021577514708042145, 'epsilon_dpo/loss_margin_mean': 202.94888305664062, 'epsilon_dpo/beta_margin_mean': 0.4359266459941864, 'epsilon_dpo/beta_margin_std': 0.5075231194496155, 'epsilon_dpo/beta_margin_grad_mean': -0.40043067932128906, 'epsilon_dpo/beta_margin_grad_std': 0.10629180818796158, 'kl/beta': 0.0021717881318181753, 'kl/avg_steps': 0.65625, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▎ | 605/681 [45:48<32:42, 25.82s/it] 89%|█████████████████████████████████████████████████████████████████████▍ | 606/681 [45:51<23:32, 18.84s/it] {'loss': 1.0297, 'grad_norm': 30.18663215637207, 'learning_rate': 1.8785323298722093e-08, 'rewards/chosen': -0.7678847312927246, 'rewards/rejected': -1.2263665199279785, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.4584817886352539, 'logps/chosen': -412.505126953125, 'logps/rejected': -670.90869140625, 'logps/ref_chosen': -54.819122314453125, 'logps/ref_rejected': -98.37147521972656, 'logits/chosen': 1.1324771642684937, 'logits/rejected': 1.3250871896743774, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.002144357655197382, 'epsilon_dpo/loss_margin_mean': 214.8511962890625, 'epsilon_dpo/beta_margin_mean': 0.4584818184375763, 'epsilon_dpo/beta_margin_std': 0.4664647877216339, 'epsilon_dpo/beta_margin_grad_mean': -0.3933078348636627, 'epsilon_dpo/beta_margin_grad_std': 0.10445917397737503, 'kl/beta': 0.0021576285362243652, 'kl/avg_steps': 0.625, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▍ | 606/681 [45:51<23:32, 18.84s/it] 89%|█████████████████████████████████████████████████████████████████████▌ | 607/681 [45:54<17:23, 14.11s/it] {'loss': 1.0947, 'grad_norm': 27.074480056762695, 'learning_rate': 1.8300307303259904e-08, 'rewards/chosen': -0.7858235836029053, 'rewards/rejected': -1.1412138938903809, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.3553902506828308, 'logps/chosen': -426.2294616699219, 'logps/rejected': -615.494140625, 'logps/ref_chosen': -58.08403778076172, 'logps/ref_rejected': -79.777099609375, 'logits/chosen': 1.0228676795959473, 'logits/rejected': 1.6952282190322876, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0021323792170733213, 'epsilon_dpo/loss_margin_mean': 167.57159423828125, 'epsilon_dpo/beta_margin_mean': 0.3553902506828308, 'epsilon_dpo/beta_margin_std': 0.37019211053848267, 'epsilon_dpo/beta_margin_grad_mean': -0.4152858853340149, 'epsilon_dpo/beta_margin_grad_std': 0.08593456447124481, 'kl/beta': 0.002144227270036936, 'kl/avg_steps': 0.5625, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▌ | 607/681 [45:54<17:23, 14.11s/it] 89%|█████████████████████████████████████████████████████████████████████▋ | 608/681 [45:56<12:59, 10.68s/it] {'loss': 1.0573, 'grad_norm': 36.970916748046875, 'learning_rate': 1.7821396810182437e-08, 'rewards/chosen': -0.7769819498062134, 'rewards/rejected': -1.1765248775482178, 'rewards/accuracies': 0.875, 'rewards/margins': 0.39954301714897156, 'logps/chosen': -424.0223388671875, 'logps/rejected': -650.75927734375, 'logps/ref_chosen': -57.450836181640625, 'logps/ref_rejected': -94.77339172363281, 'logits/chosen': 0.9039729833602905, 'logits/rejected': 1.3142175674438477, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.002117785857990384, 'epsilon_dpo/loss_margin_mean': 189.41441345214844, 'epsilon_dpo/beta_margin_mean': 0.39954301714897156, 'epsilon_dpo/beta_margin_std': 0.361564576625824, 'epsilon_dpo/beta_margin_grad_mean': -0.4044342637062073, 'epsilon_dpo/beta_margin_grad_std': 0.08454929292201996, 'kl/beta': 0.0021322332322597504, 'kl/avg_steps': 0.6875, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▋ | 608/681 [45:56<12:59, 10.68s/it] 89%|█████████████████████████████████████████████████████████████████████▊ | 609/681 [45:59<09:50, 8.20s/it] {'loss': 1.033, 'grad_norm': 31.127103805541992, 'learning_rate': 1.7348604439226617e-08, 'rewards/chosen': -0.7576903104782104, 'rewards/rejected': -1.2041070461273193, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.4464166760444641, 'logps/chosen': -418.4666442871094, 'logps/rejected': -661.6102294921875, 'logps/ref_chosen': -58.805355072021484, 'logps/ref_rejected': -88.81600952148438, 'logits/chosen': 1.02443528175354, 'logits/rejected': 1.3518232107162476, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.002102663740515709, 'epsilon_dpo/loss_margin_mean': 213.13291931152344, 'epsilon_dpo/beta_margin_mean': 0.4464167058467865, 'epsilon_dpo/beta_margin_std': 0.43577244877815247, 'epsilon_dpo/beta_margin_grad_mean': -0.3947716951370239, 'epsilon_dpo/beta_margin_grad_std': 0.09765525162220001, 'kl/beta': 0.002117674332112074, 'kl/avg_steps': 0.71875, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████▊ | 609/681 [45:59<09:50, 8.20s/it] 90%|█████████████████████████████████████████████████████████████████████▊ | 610/681 [46:01<07:42, 6.52s/it] {'loss': 1.0964, 'grad_norm': 23.951202392578125, 'learning_rate': 1.6881942648911074e-08, 'rewards/chosen': -0.736303985118866, 'rewards/rejected': -1.1059353351593018, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.36963140964508057, 'logps/chosen': -417.2803955078125, 'logps/rejected': -613.0189819335938, 'logps/ref_chosen': -65.69503784179688, 'logps/ref_rejected': -83.4053955078125, 'logits/chosen': 0.9360833168029785, 'logits/rejected': 1.605763554573059, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.002089630113914609, 'epsilon_dpo/loss_margin_mean': 178.02821350097656, 'epsilon_dpo/beta_margin_mean': 0.36963143944740295, 'epsilon_dpo/beta_margin_std': 0.44020283222198486, 'epsilon_dpo/beta_margin_grad_mean': -0.4126569330692291, 'epsilon_dpo/beta_margin_grad_std': 0.1021689847111702, 'kl/beta': 0.002102562226355076, 'kl/avg_steps': 0.625, 'epoch': 0.9} 90%|█████████████████████████████████████████████████████████████████████▊ | 610/681 [46:01<07:42, 6.52s/it] 90%|█████████████████████████████████████████████████████████████████████▉ | 611/681 [46:04<06:09, 5.28s/it] {'loss': 1.0646, 'grad_norm': 23.67366600036621, 'learning_rate': 1.6421423736208e-08, 'rewards/chosen': -0.7785675525665283, 'rewards/rejected': -1.1844526529312134, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.40588513016700745, 'logps/chosen': -427.18963623046875, 'logps/rejected': -657.4437255859375, 'logps/ref_chosen': -52.59947204589844, 'logps/ref_rejected': -86.33099365234375, 'logits/chosen': 1.289412021636963, 'logits/rejected': 1.4389369487762451, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.002075997879728675, 'epsilon_dpo/loss_margin_mean': 196.52256774902344, 'epsilon_dpo/beta_margin_mean': 0.40588513016700745, 'epsilon_dpo/beta_margin_std': 0.4290505051612854, 'epsilon_dpo/beta_margin_grad_mean': -0.40418821573257446, 'epsilon_dpo/beta_margin_grad_std': 0.09916716814041138, 'kl/beta': 0.0020895027555525303, 'kl/avg_steps': 0.65625, 'epoch': 0.9} 90%|█████████████████████████████████████████████████████████████████████▉ | 611/681 [46:04<06:09, 5.28s/it] 90%|██████████████████████████████████████████████████████████████████████ | 612/681 [46:06<05:06, 4.44s/it] {'loss': 1.0247, 'grad_norm': 27.82708740234375, 'learning_rate': 1.5967059836219042e-08, 'rewards/chosen': -0.8178003430366516, 'rewards/rejected': -1.2739319801330566, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.4561316967010498, 'logps/chosen': -454.82489013671875, 'logps/rejected': -706.016357421875, 'logps/ref_chosen': -59.32372283935547, 'logps/ref_rejected': -88.31239318847656, 'logits/chosen': 1.0549695491790771, 'logits/rejected': 1.8483834266662598, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0020631118677556515, 'epsilon_dpo/loss_margin_mean': 222.2028045654297, 'epsilon_dpo/beta_margin_mean': 0.4561316967010498, 'epsilon_dpo/beta_margin_std': 0.4290175139904022, 'epsilon_dpo/beta_margin_grad_mean': -0.3926045596599579, 'epsilon_dpo/beta_margin_grad_std': 0.09865017980337143, 'kl/beta': 0.0020758798345923424, 'kl/avg_steps': 0.625, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████ | 612/681 [46:06<05:06, 4.44s/it] 90%|██████████████████████████████████████████████████████████████████████▏ | 613/681 [46:09<04:25, 3.91s/it] {'loss': 1.0419, 'grad_norm': 23.780893325805664, 'learning_rate': 1.551886292185553e-08, 'rewards/chosen': -0.7217282056808472, 'rewards/rejected': -1.1425280570983887, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.4207998812198639, 'logps/chosen': -412.083740234375, 'logps/rejected': -663.5151977539062, 'logps/ref_chosen': -59.72996520996094, 'logps/ref_rejected': -105.10753631591797, 'logits/chosen': 1.0429415702819824, 'logits/rejected': 1.2024908065795898, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0020477185025811195, 'epsilon_dpo/loss_margin_mean': 206.05389404296875, 'epsilon_dpo/beta_margin_mean': 0.4207998812198639, 'epsilon_dpo/beta_margin_std': 0.3745913803577423, 'epsilon_dpo/beta_margin_grad_mean': -0.4002331495285034, 'epsilon_dpo/beta_margin_grad_std': 0.08480419963598251, 'kl/beta': 0.0020629861392080784, 'kl/avg_steps': 0.75, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████▏ | 613/681 [46:09<04:25, 3.91s/it] 90%|██████████████████████████████████████████████████████████████████████▎ | 614/681 [46:12<03:56, 3.54s/it] {'loss': 1.034, 'grad_norm': 31.241605758666992, 'learning_rate': 1.507684480352292e-08, 'rewards/chosen': -0.7898073196411133, 'rewards/rejected': -1.237969994544983, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.44816261529922485, 'logps/chosen': -439.93914794921875, 'logps/rejected': -713.0341796875, 'logps/ref_chosen': -52.93898010253906, 'logps/ref_rejected': -104.67938232421875, 'logits/chosen': 1.2069741487503052, 'logits/rejected': 1.3134090900421143, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0020369545090943575, 'epsilon_dpo/loss_margin_mean': 221.35459899902344, 'epsilon_dpo/beta_margin_mean': 0.44816258549690247, 'epsilon_dpo/beta_margin_std': 0.4428095519542694, 'epsilon_dpo/beta_margin_grad_mean': -0.39440545439720154, 'epsilon_dpo/beta_margin_grad_std': 0.10275840014219284, 'kl/beta': 0.002047628862783313, 'kl/avg_steps': 0.53125, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████▎ | 614/681 [46:12<03:56, 3.54s/it] 90%|██████████████████████████████████████████████████████████████████████▍ | 615/681 [46:14<03:37, 3.30s/it] {'loss': 1.1252, 'grad_norm': 22.429214477539062, 'learning_rate': 1.4641017128809801e-08, 'rewards/chosen': -0.7508677244186401, 'rewards/rejected': -1.0793850421905518, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.328517347574234, 'logps/chosen': -436.15289306640625, 'logps/rejected': -628.677001953125, 'logps/ref_chosen': -65.81727600097656, 'logps/ref_rejected': -95.17749786376953, 'logits/chosen': 1.0206799507141113, 'logits/rejected': 1.1133084297180176, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0020242806058377028, 'epsilon_dpo/loss_margin_mean': 163.16390991210938, 'epsilon_dpo/beta_margin_mean': 0.328517347574234, 'epsilon_dpo/beta_margin_std': 0.420515775680542, 'epsilon_dpo/beta_margin_grad_mean': -0.42308035492897034, 'epsilon_dpo/beta_margin_grad_std': 0.09278630465269089, 'kl/beta': 0.002036808291450143, 'kl/avg_steps': 0.625, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████▍ | 615/681 [46:14<03:37, 3.30s/it] 90%|██████████████████████████████████████████████████████████████████████▌ | 616/681 [46:17<03:24, 3.15s/it] {'loss': 1.1731, 'grad_norm': 30.985132217407227, 'learning_rate': 1.4211391382180637e-08, 'rewards/chosen': -0.874975323677063, 'rewards/rejected': -1.148017406463623, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.27304205298423767, 'logps/chosen': -499.29022216796875, 'logps/rejected': -645.5740966796875, 'logps/ref_chosen': -65.13285827636719, 'logps/ref_rejected': -74.70050048828125, 'logits/chosen': 1.1943162679672241, 'logits/rejected': 1.915705919265747, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.002012972952798009, 'epsilon_dpo/loss_margin_mean': 136.7162322998047, 'epsilon_dpo/beta_margin_mean': 0.27304208278656006, 'epsilon_dpo/beta_margin_std': 0.41767895221710205, 'epsilon_dpo/beta_margin_grad_mean': -0.4355887174606323, 'epsilon_dpo/beta_margin_grad_std': 0.09638901799917221, 'kl/beta': 0.00202415743842721, 'kl/avg_steps': 0.5625, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████▌ | 616/681 [46:17<03:24, 3.15s/it] 91%|██████████████████████████████████████████████████████████████████████▋ | 617/681 [46:20<03:14, 3.05s/it] {'loss': 1.1713, 'grad_norm': 37.784576416015625, 'learning_rate': 1.378797888467345e-08, 'rewards/chosen': -0.8111328482627869, 'rewards/rejected': -1.0775700807571411, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.26643723249435425, 'logps/chosen': -466.6313781738281, 'logps/rejected': -601.8916625976562, 'logps/ref_chosen': -63.005550384521484, 'logps/ref_rejected': -64.234130859375, 'logits/chosen': 1.0707861185073853, 'logits/rejected': 1.959951400756836, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.3125, 'epsilon_dpo/beta': 0.002005487447604537, 'epsilon_dpo/loss_margin_mean': 134.0316925048828, 'epsilon_dpo/beta_margin_mean': 0.26643723249435425, 'epsilon_dpo/beta_margin_std': 0.3760392665863037, 'epsilon_dpo/beta_margin_grad_mean': -0.43639859557151794, 'epsilon_dpo/beta_margin_grad_std': 0.08785145729780197, 'kl/beta': 0.002012835117056966, 'kl/avg_steps': 0.375, 'epoch': 0.91} 91%|██████████████████████████████████████████████████████████████████████▋ | 617/681 [46:20<03:14, 3.05s/it] 91%|██████████████████████████████████████████████████████████████████████▊ | 618/681 [46:23<03:08, 2.99s/it] {'loss': 1.0813, 'grad_norm': 46.317352294921875, 'learning_rate': 1.3370790793601371e-08, 'rewards/chosen': -0.8346935510635376, 'rewards/rejected': -1.2265982627868652, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.39190471172332764, 'logps/chosen': -485.0655822753906, 'logps/rejected': -707.9747314453125, 'logps/ref_chosen': -67.10135650634766, 'logps/ref_rejected': -92.15339660644531, 'logits/chosen': 1.0298012495040894, 'logits/rejected': 1.473135232925415, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.001993608195334673, 'epsilon_dpo/loss_margin_mean': 197.85704040527344, 'epsilon_dpo/beta_margin_mean': 0.39190471172332764, 'epsilon_dpo/beta_margin_std': 0.4546877443790436, 'epsilon_dpo/beta_margin_grad_mean': -0.40763044357299805, 'epsilon_dpo/beta_margin_grad_std': 0.1053873673081398, 'kl/beta': 0.002005315385758877, 'kl/avg_steps': 0.59375, 'epoch': 0.91} 91%|██████████████████████████████████████████████████████████████████████▊ | 618/681 [46:23<03:08, 2.99s/it] 91%|██████████████████████████████████████████████████████████████████████▉ | 619/681 [46:26<03:04, 2.97s/it] {'loss': 1.1228, 'grad_norm': 25.143455505371094, 'learning_rate': 1.2959838102258535e-08, 'rewards/chosen': -0.7980378866195679, 'rewards/rejected': -1.1399478912353516, 'rewards/accuracies': 0.75, 'rewards/margins': 0.3419099450111389, 'logps/chosen': -457.4405517578125, 'logps/rejected': -668.3460693359375, 'logps/ref_chosen': -55.978233337402344, 'logps/ref_rejected': -93.1854019165039, 'logits/chosen': 1.0666018724441528, 'logits/rejected': 1.462690830230713, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.001983709866181016, 'epsilon_dpo/loss_margin_mean': 173.6983642578125, 'epsilon_dpo/beta_margin_mean': 0.3419099450111389, 'epsilon_dpo/beta_margin_std': 0.45738500356674194, 'epsilon_dpo/beta_margin_grad_mean': -0.4199216365814209, 'epsilon_dpo/beta_margin_grad_std': 0.10553835332393646, 'kl/beta': 0.001993478974327445, 'kl/avg_steps': 0.5, 'epoch': 0.91} 91%|██████████████████████████████████████████████████████████████████████▉ | 619/681 [46:26<03:04, 2.97s/it] 91%|███████████████████████████████████████████████████████████████████████ | 620/681 [46:28<02:54, 2.87s/it] {'loss': 1.0958, 'grad_norm': 34.329288482666016, 'learning_rate': 1.2555131639630567e-08, 'rewards/chosen': -0.7542145252227783, 'rewards/rejected': -1.1289727687835693, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.374758243560791, 'logps/chosen': -441.5105895996094, 'logps/rejected': -651.3795166015625, 'logps/ref_chosen': -59.79750061035156, 'logps/ref_rejected': -78.41075134277344, 'logits/chosen': 1.2032023668289185, 'logits/rejected': 1.7684378623962402, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.001972600817680359, 'epsilon_dpo/loss_margin_mean': 191.25567626953125, 'epsilon_dpo/beta_margin_mean': 0.37475821375846863, 'epsilon_dpo/beta_margin_std': 0.4594890773296356, 'epsilon_dpo/beta_margin_grad_mean': -0.4120746850967407, 'epsilon_dpo/beta_margin_grad_std': 0.10483256727457047, 'kl/beta': 0.001983561087399721, 'kl/avg_steps': 0.5625, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████ | 620/681 [46:28<02:54, 2.87s/it] 91%|███████████████████████████████████████████████████████████████████████▏ | 621/681 [46:31<02:49, 2.83s/it] {'loss': 1.0227, 'grad_norm': 31.05923843383789, 'learning_rate': 1.2156682070109086e-08, 'rewards/chosen': -0.7442139387130737, 'rewards/rejected': -1.2052245140075684, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.46101054549217224, 'logps/chosen': -433.49237060546875, 'logps/rejected': -704.26220703125, 'logps/ref_chosen': -53.933753967285156, 'logps/ref_rejected': -88.36952209472656, 'logits/chosen': 1.324573278427124, 'logits/rejected': 1.5857019424438477, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.001958484761416912, 'epsilon_dpo/loss_margin_mean': 236.33409118652344, 'epsilon_dpo/beta_margin_mean': 0.46101054549217224, 'epsilon_dpo/beta_margin_std': 0.44019511342048645, 'epsilon_dpo/beta_margin_grad_mean': -0.3918313682079315, 'epsilon_dpo/beta_margin_grad_std': 0.09972291439771652, 'kl/beta': 0.0019724660087376833, 'kl/avg_steps': 0.71875, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████▏ | 621/681 [46:31<02:49, 2.83s/it] 91%|███████████████████████████████████████████████████████████████████████▏ | 622/681 [46:34<02:43, 2.78s/it] {'loss': 1.0998, 'grad_norm': 23.87201499938965, 'learning_rate': 1.1764499893210878e-08, 'rewards/chosen': -0.7119662761688232, 'rewards/rejected': -1.0652704238891602, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.35330408811569214, 'logps/chosen': -425.43084716796875, 'logps/rejected': -633.0950317382812, 'logps/ref_chosen': -60.28582000732422, 'logps/ref_rejected': -85.51873779296875, 'logits/chosen': 0.9237732887268066, 'logits/rejected': 1.5608757734298706, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0019463448552414775, 'epsilon_dpo/loss_margin_mean': 182.4312744140625, 'epsilon_dpo/beta_margin_mean': 0.35330408811569214, 'epsilon_dpo/beta_margin_std': 0.3922674357891083, 'epsilon_dpo/beta_margin_grad_mean': -0.41644492745399475, 'epsilon_dpo/beta_margin_grad_std': 0.08895543962717056, 'kl/beta': 0.0019583902321755886, 'kl/avg_steps': 0.625, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████▏ | 622/681 [46:34<02:43, 2.78s/it] 91%|███████████████████████████████████████████████████████████████████████▎ | 623/681 [46:36<02:34, 2.67s/it] {'loss': 1.1568, 'grad_norm': 26.694618225097656, 'learning_rate': 1.1378595443300998e-08, 'rewards/chosen': -0.7869951725006104, 'rewards/rejected': -1.081409215927124, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.2944139540195465, 'logps/chosen': -470.3033447265625, 'logps/rejected': -644.4393310546875, 'logps/ref_chosen': -64.15696716308594, 'logps/ref_rejected': -85.08304595947266, 'logits/chosen': 1.110864520072937, 'logits/rejected': 1.6331336498260498, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0019348639762029052, 'epsilon_dpo/loss_margin_mean': 153.20993041992188, 'epsilon_dpo/beta_margin_mean': 0.2944139540195465, 'epsilon_dpo/beta_margin_std': 0.4290635883808136, 'epsilon_dpo/beta_margin_grad_mean': -0.4306987226009369, 'epsilon_dpo/beta_margin_grad_std': 0.09799355268478394, 'kl/beta': 0.0019462262280285358, 'kl/avg_steps': 0.59375, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████▎ | 623/681 [46:36<02:34, 2.67s/it] 92%|███████████████████████████████████████████████████████████████████████▍ | 624/681 [46:39<02:31, 2.65s/it] {'loss': 1.0441, 'grad_norm': 36.52444839477539, 'learning_rate': 1.0998978889320582e-08, 'rewards/chosen': -0.7600715160369873, 'rewards/rejected': -1.1889166831970215, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.4288451075553894, 'logps/chosen': -467.0516357421875, 'logps/rejected': -716.4813232421875, 'logps/ref_chosen': -71.91862487792969, 'logps/ref_rejected': -97.13203430175781, 'logits/chosen': 0.7286983728408813, 'logits/rejected': 1.4310073852539062, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0019210248719900846, 'epsilon_dpo/loss_margin_mean': 224.21632385253906, 'epsilon_dpo/beta_margin_mean': 0.4288451373577118, 'epsilon_dpo/beta_margin_std': 0.41765204071998596, 'epsilon_dpo/beta_margin_grad_mean': -0.3984293043613434, 'epsilon_dpo/beta_margin_grad_std': 0.09685565531253815, 'kl/beta': 0.0019347387133166194, 'kl/avg_steps': 0.71875, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████▍ | 624/681 [46:39<02:31, 2.65s/it] 92%|███████████████████████████████████████████████████████████████████████▌ | 625/681 [46:42<02:28, 2.66s/it] {'loss': 1.0812, 'grad_norm': 29.51067543029785, 'learning_rate': 1.0625660234518913e-08, 'rewards/chosen': -0.7215262651443481, 'rewards/rejected': -1.0983374118804932, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.37681108713150024, 'logps/chosen': -436.4429931640625, 'logps/rejected': -662.4771728515625, 'logps/ref_chosen': -58.342071533203125, 'logps/ref_rejected': -86.09038543701172, 'logits/chosen': 1.1471667289733887, 'logits/rejected': 1.666570782661438, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.001906715682707727, 'epsilon_dpo/loss_margin_mean': 198.28582763671875, 'epsilon_dpo/beta_margin_mean': 0.37681111693382263, 'epsilon_dpo/beta_margin_std': 0.39676207304000854, 'epsilon_dpo/beta_margin_grad_mean': -0.4110877215862274, 'epsilon_dpo/beta_margin_grad_std': 0.08938928693532944, 'kl/beta': 0.0019209319725632668, 'kl/avg_steps': 0.75, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████▌ | 625/681 [46:42<02:28, 2.66s/it] 92%|███████████████████████████████████████████████████████████████████████▋ | 626/681 [46:44<02:30, 2.73s/it] {'loss': 1.1565, 'grad_norm': 33.83115005493164, 'learning_rate': 1.0258649316189721e-08, 'rewards/chosen': -0.7822318077087402, 'rewards/rejected': -1.068045973777771, 'rewards/accuracies': 0.75, 'rewards/margins': 0.28581416606903076, 'logps/chosen': -486.8077392578125, 'logps/rejected': -662.682373046875, 'logps/ref_chosen': -75.11260986328125, 'logps/ref_rejected': -99.18872833251953, 'logits/chosen': 0.8864705562591553, 'logits/rejected': 1.213023066520691, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'epsilon_dpo/beta': 0.0018984805792570114, 'epsilon_dpo/loss_margin_mean': 151.79847717285156, 'epsilon_dpo/beta_margin_mean': 0.28581419587135315, 'epsilon_dpo/beta_margin_std': 0.3867412805557251, 'epsilon_dpo/beta_margin_grad_mean': -0.43188056349754333, 'epsilon_dpo/beta_margin_grad_std': 0.09042949974536896, 'kl/beta': 0.0019066323293372989, 'kl/avg_steps': 0.4375, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████▋ | 626/681 [46:44<02:30, 2.73s/it] 92%|███████████████████████████████████████████████████████████████████████▊ | 627/681 [46:47<02:28, 2.76s/it] {'loss': 1.0634, 'grad_norm': 30.784543991088867, 'learning_rate': 9.897955805412e-09, 'rewards/chosen': -0.6593598127365112, 'rewards/rejected': -1.101252794265747, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.4418928623199463, 'logps/chosen': -396.13018798828125, 'logps/rejected': -690.677001953125, 'logps/ref_chosen': -47.74314880371094, 'logps/ref_rejected': -106.75448608398438, 'logits/chosen': 1.2974317073822021, 'logits/rejected': 1.250240445137024, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0018884310266003013, 'epsilon_dpo/loss_margin_mean': 235.53546142578125, 'epsilon_dpo/beta_margin_mean': 0.4418928921222687, 'epsilon_dpo/beta_margin_std': 0.5634236335754395, 'epsilon_dpo/beta_margin_grad_mean': -0.40042755007743835, 'epsilon_dpo/beta_margin_grad_std': 0.1201905831694603, 'kl/beta': 0.0018983271438628435, 'kl/avg_steps': 0.53125, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████▊ | 627/681 [46:47<02:28, 2.76s/it] 92%|███████████████████████████████████████████████████████████████████████▉ | 628/681 [46:50<02:25, 2.74s/it] {'loss': 1.0624, 'grad_norm': 32.134822845458984, 'learning_rate': 9.543589206795238e-09, 'rewards/chosen': -0.7467862367630005, 'rewards/rejected': -1.1560611724853516, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.4092748761177063, 'logps/chosen': -457.1246032714844, 'logps/rejected': -717.7559814453125, 'logps/ref_chosen': -60.182945251464844, 'logps/ref_rejected': -101.55467224121094, 'logits/chosen': 1.1723408699035645, 'logits/rejected': 1.2950220108032227, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0018784517887979746, 'epsilon_dpo/loss_margin_mean': 219.25965881347656, 'epsilon_dpo/beta_margin_mean': 0.4092748761177063, 'epsilon_dpo/beta_margin_std': 0.4315047264099121, 'epsilon_dpo/beta_margin_grad_mean': -0.4034620523452759, 'epsilon_dpo/beta_margin_grad_std': 0.0999772772192955, 'kl/beta': 0.0018882955191656947, 'kl/avg_steps': 0.53125, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████▉ | 628/681 [46:50<02:25, 2.74s/it] 92%|████████████████████████████████████████████████████████████████████████ | 629/681 [46:53<02:24, 2.77s/it] {'loss': 1.1026, 'grad_norm': 30.716339111328125, 'learning_rate': 9.19555885822887e-09, 'rewards/chosen': -0.7622989416122437, 'rewards/rejected': -1.1019093990325928, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.3396104574203491, 'logps/chosen': -472.028076171875, 'logps/rejected': -682.2704467773438, 'logps/ref_chosen': -64.21353912353516, 'logps/ref_rejected': -91.65367126464844, 'logits/chosen': 1.0403006076812744, 'logits/rejected': 1.5764471292495728, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0018667641561478376, 'epsilon_dpo/loss_margin_mean': 182.80221557617188, 'epsilon_dpo/beta_margin_mean': 0.3396104574203491, 'epsilon_dpo/beta_margin_std': 0.3389366567134857, 'epsilon_dpo/beta_margin_grad_mean': -0.4184776246547699, 'epsilon_dpo/beta_margin_grad_std': 0.07906016707420349, 'kl/beta': 0.001878316979855299, 'kl/avg_steps': 0.625, 'epoch': 0.92} 92%|████████████████████████████████████████████████████████████████████████ | 629/681 [46:53<02:24, 2.77s/it] 93%|████████████████████████████████████████████████████████████████████████▏ | 630/681 [46:56<02:20, 2.76s/it] {'loss': 1.1782, 'grad_norm': 22.31944465637207, 'learning_rate': 8.85387393063622e-09, 'rewards/chosen': -0.6638652086257935, 'rewards/rejected': -0.9232956171035767, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.259430468082428, 'logps/chosen': -416.2958068847656, 'logps/rejected': -581.345947265625, 'logps/ref_chosen': -59.29100036621094, 'logps/ref_rejected': -83.59829711914062, 'logits/chosen': 0.9522106647491455, 'logits/rejected': 1.4870188236236572, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0018557527801021934, 'epsilon_dpo/loss_margin_mean': 140.74285888671875, 'epsilon_dpo/beta_margin_mean': 0.2594304382801056, 'epsilon_dpo/beta_margin_std': 0.38279202580451965, 'epsilon_dpo/beta_margin_grad_mean': -0.43844300508499146, 'epsilon_dpo/beta_margin_grad_std': 0.08784260600805283, 'kl/beta': 0.0018666504183784127, 'kl/avg_steps': 0.59375, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▏ | 630/681 [46:56<02:20, 2.76s/it] 93%|████████████████████████████████████████████████████████████████████████▎ | 631/681 [46:58<02:19, 2.79s/it] {'loss': 1.1514, 'grad_norm': 26.94605255126953, 'learning_rate': 8.518543427732949e-09, 'rewards/chosen': -0.7797050476074219, 'rewards/rejected': -1.0901647806167603, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.31045979261398315, 'logps/chosen': -480.91925048828125, 'logps/rejected': -672.0191040039062, 'logps/ref_chosen': -59.45360565185547, 'logps/ref_rejected': -80.95157623291016, 'logits/chosen': 1.2326271533966064, 'logits/rejected': 1.7843652963638306, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0018459591083228588, 'epsilon_dpo/loss_margin_mean': 169.60186767578125, 'epsilon_dpo/beta_margin_mean': 0.31045976281166077, 'epsilon_dpo/beta_margin_std': 0.4674583971500397, 'epsilon_dpo/beta_margin_grad_mean': -0.427374005317688, 'epsilon_dpo/beta_margin_grad_std': 0.10752011835575104, 'kl/beta': 0.0018556325230747461, 'kl/avg_steps': 0.53125, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▎ | 631/681 [46:58<02:19, 2.79s/it] 93%|████████████████████████████████████████████████████████████████████████▍ | 632/681 [47:01<02:11, 2.68s/it] {'loss': 1.1232, 'grad_norm': 23.3004207611084, 'learning_rate': 8.189576185789637e-09, 'rewards/chosen': -0.7532949447631836, 'rewards/rejected': -1.0957008600234985, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.34240591526031494, 'logps/chosen': -470.71783447265625, 'logps/rejected': -683.5587158203125, 'logps/ref_chosen': -61.35155487060547, 'logps/ref_rejected': -86.16017150878906, 'logits/chosen': 1.1802589893341064, 'logits/rejected': 1.6396427154541016, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0018362043192610145, 'epsilon_dpo/loss_margin_mean': 188.03225708007812, 'epsilon_dpo/beta_margin_mean': 0.34240588545799255, 'epsilon_dpo/beta_margin_std': 0.45832404494285583, 'epsilon_dpo/beta_margin_grad_mean': -0.4185718297958374, 'epsilon_dpo/beta_margin_grad_std': 0.10718811303377151, 'kl/beta': 0.0018458266276866198, 'kl/avg_steps': 0.53125, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▍ | 632/681 [47:01<02:11, 2.68s/it] 93%|████████████████████████████████████████████████████████████████████████▌ | 633/681 [47:03<02:06, 2.64s/it] {'loss': 1.1998, 'grad_norm': 25.7509765625, 'learning_rate': 7.866980873399015e-09, 'rewards/chosen': -0.7698875665664673, 'rewards/rejected': -1.0084669589996338, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.2385793775320053, 'logps/chosen': -478.14373779296875, 'logps/rejected': -644.4025268554688, 'logps/ref_chosen': -57.278167724609375, 'logps/ref_rejected': -91.58395385742188, 'logits/chosen': 1.1726946830749512, 'logits/rejected': 1.3165276050567627, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0018265009857714176, 'epsilon_dpo/loss_margin_mean': 131.95301818847656, 'epsilon_dpo/beta_margin_mean': 0.2385793775320053, 'epsilon_dpo/beta_margin_std': 0.39600229263305664, 'epsilon_dpo/beta_margin_grad_mean': -0.44286853075027466, 'epsilon_dpo/beta_margin_grad_std': 0.09384477883577347, 'kl/beta': 0.0018360725371167064, 'kl/avg_steps': 0.53125, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▌ | 633/681 [47:03<02:06, 2.64s/it] 93%|████████████████████████████████████████████████████████████████████████▌ | 634/681 [47:06<02:05, 2.67s/it] {'loss': 1.1678, 'grad_norm': 20.374540328979492, 'learning_rate': 7.550765991247654e-09, 'rewards/chosen': -0.7717798948287964, 'rewards/rejected': -1.048460602760315, 'rewards/accuracies': 0.75, 'rewards/margins': 0.2766806483268738, 'logps/chosen': -490.03021240234375, 'logps/rejected': -684.000244140625, 'logps/ref_chosen': -66.61896514892578, 'logps/ref_rejected': -107.12565612792969, 'logits/chosen': 1.1397655010223389, 'logits/rejected': 1.5167808532714844, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.3125, 'epsilon_dpo/beta': 0.0018197029130533338, 'epsilon_dpo/loss_margin_mean': 153.4633026123047, 'epsilon_dpo/beta_margin_mean': 0.2766806483268738, 'epsilon_dpo/beta_margin_std': 0.4046916365623474, 'epsilon_dpo/beta_margin_grad_mean': -0.43454012274742126, 'epsilon_dpo/beta_margin_grad_std': 0.09465321153402328, 'kl/beta': 0.0018263699021190405, 'kl/avg_steps': 0.375, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▌ | 634/681 [47:06<02:05, 2.67s/it] 93%|████████████████████████████████████████████████████████████████████████▋ | 635/681 [47:09<02:04, 2.70s/it] {'loss': 1.1695, 'grad_norm': 24.014848709106445, 'learning_rate': 7.240939871891699e-09, 'rewards/chosen': -0.7364508509635925, 'rewards/rejected': -1.0062596797943115, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.2698087692260742, 'logps/chosen': -480.21636962890625, 'logps/rejected': -638.931396484375, 'logps/ref_chosen': -73.95551300048828, 'logps/ref_rejected': -82.50045776367188, 'logits/chosen': 1.0775097608566284, 'logits/rejected': 1.6850745677947998, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0018100612796843052, 'epsilon_dpo/loss_margin_mean': 150.1700897216797, 'epsilon_dpo/beta_margin_mean': 0.2698087692260742, 'epsilon_dpo/beta_margin_std': 0.3833208382129669, 'epsilon_dpo/beta_margin_grad_mean': -0.43598106503486633, 'epsilon_dpo/beta_margin_grad_std': 0.08887307345867157, 'kl/beta': 0.0018195465672761202, 'kl/avg_steps': 0.53125, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▋ | 635/681 [47:09<02:04, 2.70s/it] 93%|████████████████████████████████████████████████████████████████████████▊ | 636/681 [47:12<02:03, 2.74s/it] {'loss': 1.1013, 'grad_norm': 24.99462127685547, 'learning_rate': 6.937510679537628e-09, 'rewards/chosen': -0.6957628726959229, 'rewards/rejected': -1.0395703315734863, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.3438073992729187, 'logps/chosen': -446.11859130859375, 'logps/rejected': -660.63134765625, 'logps/ref_chosen': -59.628910064697266, 'logps/ref_rejected': -81.97883605957031, 'logits/chosen': 1.1912219524383545, 'logits/rejected': 1.6938612461090088, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0017982334829866886, 'epsilon_dpo/loss_margin_mean': 192.16282653808594, 'epsilon_dpo/beta_margin_mean': 0.3438073992729187, 'epsilon_dpo/beta_margin_std': 0.35238152742385864, 'epsilon_dpo/beta_margin_grad_mean': -0.4176199436187744, 'epsilon_dpo/beta_margin_grad_std': 0.08222676068544388, 'kl/beta': 0.0018099313601851463, 'kl/avg_steps': 0.65625, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████▊ | 636/681 [47:12<02:03, 2.74s/it] 94%|████████████████████████████████████████████████████████████████████████▉ | 637/681 [47:14<01:59, 2.71s/it] {'loss': 1.0647, 'grad_norm': 29.209657669067383, 'learning_rate': 6.640486409826785e-09, 'rewards/chosen': -0.735283613204956, 'rewards/rejected': -1.1332693099975586, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.397985577583313, 'logps/chosen': -461.0699462890625, 'logps/rejected': -733.6002197265625, 'logps/ref_chosen': -49.652687072753906, 'logps/ref_rejected': -98.40513610839844, 'logits/chosen': 1.282531976699829, 'logits/rejected': 1.531841516494751, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.001785947591997683, 'epsilon_dpo/loss_margin_mean': 223.77783203125, 'epsilon_dpo/beta_margin_mean': 0.397985577583313, 'epsilon_dpo/beta_margin_std': 0.3974331319332123, 'epsilon_dpo/beta_margin_grad_mean': -0.40572530031204224, 'epsilon_dpo/beta_margin_grad_std': 0.0916060358285904, 'kl/beta': 0.0017981311539188027, 'kl/avg_steps': 0.6875, 'epoch': 0.94} 94%|████████████████████████████████████████████████████████████████████████▉ | 637/681 [47:14<01:59, 2.71s/it] 94%|█████████████████████████████████████████████████████████████████████████ | 638/681 [47:17<01:58, 2.75s/it] {'loss': 1.1051, 'grad_norm': 24.57693862915039, 'learning_rate': 6.349874889624962e-09, 'rewards/chosen': -0.6871584057807922, 'rewards/rejected': -1.028141736984253, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.34098342061042786, 'logps/chosen': -444.61712646484375, 'logps/rejected': -658.889404296875, 'logps/ref_chosen': -58.156646728515625, 'logps/ref_rejected': -79.3014907836914, 'logits/chosen': 1.1940290927886963, 'logits/rejected': 1.8498295545578003, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0017759855836629868, 'epsilon_dpo/loss_margin_mean': 193.12738037109375, 'epsilon_dpo/beta_margin_mean': 0.34098342061042786, 'epsilon_dpo/beta_margin_std': 0.3596995174884796, 'epsilon_dpo/beta_margin_grad_mean': -0.418316513299942, 'epsilon_dpo/beta_margin_grad_std': 0.08480395376682281, 'kl/beta': 0.001785853412002325, 'kl/avg_steps': 0.5625, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████ | 638/681 [47:17<01:58, 2.75s/it] 94%|█████████████████████████████████████████████████████████████████████████▏ | 639/681 [47:20<01:54, 2.72s/it] {'loss': 1.205, 'grad_norm': 35.84690856933594, 'learning_rate': 6.065683776815933e-09, 'rewards/chosen': -0.810213029384613, 'rewards/rejected': -1.039353609085083, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.22914057970046997, 'logps/chosen': -530.078369140625, 'logps/rejected': -663.0965576171875, 'logps/ref_chosen': -72.32319641113281, 'logps/ref_rejected': -74.2749252319336, 'logits/chosen': 0.9751791954040527, 'logits/rejected': 2.1251561641693115, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'epsilon_dpo/beta': 0.0017677166033536196, 'epsilon_dpo/loss_margin_mean': 131.0664825439453, 'epsilon_dpo/beta_margin_mean': 0.22914056479930878, 'epsilon_dpo/beta_margin_std': 0.3779585361480713, 'epsilon_dpo/beta_margin_grad_mean': -0.4446796178817749, 'epsilon_dpo/beta_margin_grad_std': 0.09073272347450256, 'kl/beta': 0.0017758641624823213, 'kl/avg_steps': 0.46875, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████▏ | 639/681 [47:20<01:54, 2.72s/it] 94%|█████████████████████████████████████████████████████████████████████████▎ | 640/681 [47:23<01:53, 2.78s/it] {'loss': 1.0597, 'grad_norm': 22.977806091308594, 'learning_rate': 5.7879205600998296e-09, 'rewards/chosen': -0.7000018358230591, 'rewards/rejected': -1.1090929508209229, 'rewards/accuracies': 0.875, 'rewards/margins': 0.40909114480018616, 'logps/chosen': -454.34246826171875, 'logps/rejected': -740.6744384765625, 'logps/ref_chosen': -56.13436508178711, 'logps/ref_rejected': -108.60014343261719, 'logits/chosen': 1.0889379978179932, 'logits/rejected': 1.4330050945281982, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0017561545828357339, 'epsilon_dpo/loss_margin_mean': 233.86619567871094, 'epsilon_dpo/beta_margin_mean': 0.40909114480018616, 'epsilon_dpo/beta_margin_std': 0.42611533403396606, 'epsilon_dpo/beta_margin_grad_mean': -0.40443187952041626, 'epsilon_dpo/beta_margin_grad_std': 0.09270329028367996, 'kl/beta': 0.0017675786511972547, 'kl/avg_steps': 0.65625, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████▎ | 640/681 [47:23<01:53, 2.78s/it] 94%|█████████████████████████████████████████████████████████████████████████▍ | 641/681 [47:25<01:49, 2.75s/it] {'loss': 1.1644, 'grad_norm': 22.582042694091797, 'learning_rate': 5.516592558795746e-09, 'rewards/chosen': -0.8040879964828491, 'rewards/rejected': -1.0888439416885376, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.2847559452056885, 'logps/chosen': -524.75146484375, 'logps/rejected': -711.1512451171875, 'logps/ref_chosen': -64.99689483642578, 'logps/ref_rejected': -86.99232482910156, 'logits/chosen': 1.144837737083435, 'logits/rejected': 1.7584877014160156, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0017463513650000095, 'epsilon_dpo/loss_margin_mean': 164.40440368652344, 'epsilon_dpo/beta_margin_mean': 0.2847559452056885, 'epsilon_dpo/beta_margin_std': 0.4316932260990143, 'epsilon_dpo/beta_margin_grad_mean': -0.43330150842666626, 'epsilon_dpo/beta_margin_grad_std': 0.09495817869901657, 'kl/beta': 0.0017560544656589627, 'kl/avg_steps': 0.5625, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████▍ | 641/681 [47:25<01:49, 2.75s/it] 94%|█████████████████████████████████████████████████████████████████████████▌ | 642/681 [47:28<01:47, 2.75s/it] {'loss': 1.1442, 'grad_norm': 30.074098587036133, 'learning_rate': 5.251706922648868e-09, 'rewards/chosen': -0.7122145891189575, 'rewards/rejected': -1.030344843864441, 'rewards/accuracies': 0.75, 'rewards/margins': 0.318130224943161, 'logps/chosen': -474.4324951171875, 'logps/rejected': -703.515380859375, 'logps/ref_chosen': -65.68924713134766, 'logps/ref_rejected': -110.24205017089844, 'logits/chosen': 1.054837942123413, 'logits/rejected': 1.3688443899154663, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'epsilon_dpo/beta': 0.0017393117304891348, 'epsilon_dpo/loss_margin_mean': 184.53005981445312, 'epsilon_dpo/beta_margin_mean': 0.318130224943161, 'epsilon_dpo/beta_margin_std': 0.47046786546707153, 'epsilon_dpo/beta_margin_grad_mean': -0.42638909816741943, 'epsilon_dpo/beta_margin_grad_std': 0.10415295511484146, 'kl/beta': 0.001746231922879815, 'kl/avg_steps': 0.40625, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████▌ | 642/681 [47:28<01:47, 2.75s/it] 94%|█████████████████████████████████████████████████████████████████████████▋ | 643/681 [47:31<01:47, 2.82s/it] {'loss': 1.1245, 'grad_norm': 19.92149543762207, 'learning_rate': 4.993270631642038e-09, 'rewards/chosen': -0.7049366235733032, 'rewards/rejected': -1.0159623622894287, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.3110256791114807, 'logps/chosen': -459.1343994140625, 'logps/rejected': -675.4744873046875, 'logps/ref_chosen': -51.94999694824219, 'logps/ref_rejected': -87.46833801269531, 'logits/chosen': 1.3465564250946045, 'logits/rejected': 1.6335021257400513, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0017290131654590368, 'epsilon_dpo/loss_margin_mean': 180.82168579101562, 'epsilon_dpo/beta_margin_mean': 0.3110256791114807, 'epsilon_dpo/beta_margin_std': 0.3235742747783661, 'epsilon_dpo/beta_margin_grad_mean': -0.4249967932701111, 'epsilon_dpo/beta_margin_grad_std': 0.07674811780452728, 'kl/beta': 0.0017391665605828166, 'kl/avg_steps': 0.59375, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████▋ | 643/681 [47:31<01:47, 2.82s/it] 95%|█████████████████████████████████████████████████████████████████████████▊ | 644/681 [47:34<01:44, 2.83s/it] {'loss': 1.1544, 'grad_norm': 24.23951530456543, 'learning_rate': 4.741290495811873e-09, 'rewards/chosen': -0.704412579536438, 'rewards/rejected': -0.9988888502120972, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.29447638988494873, 'logps/chosen': -467.9019470214844, 'logps/rejected': -668.5538330078125, 'logps/ref_chosen': -59.017662048339844, 'logps/ref_rejected': -87.13668823242188, 'logits/chosen': 1.12656569480896, 'logits/rejected': 1.5934898853302002, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.0017204286996275187, 'epsilon_dpo/loss_margin_mean': 172.5329132080078, 'epsilon_dpo/beta_margin_mean': 0.29447636008262634, 'epsilon_dpo/beta_margin_std': 0.4145286977291107, 'epsilon_dpo/beta_margin_grad_mean': -0.4298701286315918, 'epsilon_dpo/beta_margin_grad_std': 0.09674103558063507, 'kl/beta': 0.0017289011739194393, 'kl/avg_steps': 0.5, 'epoch': 0.95} 95%|█████████████████████████████████████████████████████████████████████████▊ | 644/681 [47:34<01:44, 2.83s/it] 95%|█████████████████████████████████████████████████████████████████████████▉ | 645/681 [47:37<01:41, 2.81s/it] {'loss': 1.2067, 'grad_norm': 33.30598068237305, 'learning_rate': 4.495773155069299e-09, 'rewards/chosen': -0.7081266641616821, 'rewards/rejected': -0.9458829164505005, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.23775628209114075, 'logps/chosen': -468.2938232421875, 'logps/rejected': -650.449951171875, 'logps/ref_chosen': -55.87602233886719, 'logps/ref_rejected': -97.78080749511719, 'logits/chosen': 1.328917145729065, 'logits/rejected': 1.2759854793548584, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.3125, 'epsilon_dpo/beta': 0.001714020036160946, 'epsilon_dpo/loss_margin_mean': 140.25137329101562, 'epsilon_dpo/beta_margin_mean': 0.23775628209114075, 'epsilon_dpo/beta_margin_std': 0.4299026131629944, 'epsilon_dpo/beta_margin_grad_mean': -0.44414132833480835, 'epsilon_dpo/beta_margin_grad_std': 0.10007171332836151, 'kl/beta': 0.0017202997114509344, 'kl/avg_steps': 0.375, 'epoch': 0.95} 95%|█████████████████████████████████████████████████████████████████████████▉ | 645/681 [47:37<01:41, 2.81s/it] 95%|█████████████████████████████████████████████████████████████████████████▉ | 646/681 [47:39<01:36, 2.75s/it] {'loss': 1.1482, 'grad_norm': 27.597734451293945, 'learning_rate': 4.256725079024553e-09, 'rewards/chosen': -0.7204253673553467, 'rewards/rejected': -1.01108980178833, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.2906644940376282, 'logps/chosen': -483.1705627441406, 'logps/rejected': -671.0547485351562, 'logps/ref_chosen': -61.275787353515625, 'logps/ref_rejected': -77.50580596923828, 'logits/chosen': 1.1945412158966064, 'logits/rejected': 1.949883222579956, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0017049381276592612, 'epsilon_dpo/loss_margin_mean': 171.6541748046875, 'epsilon_dpo/beta_margin_mean': 0.2906644940376282, 'epsilon_dpo/beta_margin_std': 0.3636190891265869, 'epsilon_dpo/beta_margin_grad_mean': -0.43052205443382263, 'epsilon_dpo/beta_margin_grad_std': 0.08512377738952637, 'kl/beta': 0.001713872654363513, 'kl/avg_steps': 0.53125, 'epoch': 0.95} 95%|█████████████████████████████████████████████████████████████████████████▉ | 646/681 [47:39<01:36, 2.75s/it] 95%|██████████████████████████████████████████████████████████████████████████ | 647/681 [47:42<01:34, 2.78s/it] {'loss': 1.1495, 'grad_norm': 26.752613067626953, 'learning_rate': 4.024152566816791e-09, 'rewards/chosen': -0.6383639574050903, 'rewards/rejected': -0.9308550953865051, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.2924911379814148, 'logps/chosen': -430.82135009765625, 'logps/rejected': -643.2625732421875, 'logps/ref_chosen': -54.852413177490234, 'logps/ref_rejected': -93.5194091796875, 'logits/chosen': 1.186145544052124, 'logits/rejected': 1.2575244903564453, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0016948629636317492, 'epsilon_dpo/loss_margin_mean': 173.77426147460938, 'epsilon_dpo/beta_margin_mean': 0.2924911379814148, 'epsilon_dpo/beta_margin_std': 0.37851130962371826, 'epsilon_dpo/beta_margin_grad_mean': -0.4299209713935852, 'epsilon_dpo/beta_margin_grad_std': 0.08932579308748245, 'kl/beta': 0.001704815891571343, 'kl/avg_steps': 0.59375, 'epoch': 0.95} 95%|██████████████████████████████████████████████████████████████████████████ | 647/681 [47:42<01:34, 2.78s/it] 95%|██████████████████████████████████████████████████████████████████████████▏ | 648/681 [47:45<01:28, 2.70s/it] {'loss': 1.0918, 'grad_norm': 22.86119842529297, 'learning_rate': 3.798061746947995e-09, 'rewards/chosen': -0.6989763975143433, 'rewards/rejected': -1.0812969207763672, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.3823206424713135, 'logps/chosen': -468.0401611328125, 'logps/rejected': -740.9541625976562, 'logps/ref_chosen': -54.17146682739258, 'logps/ref_rejected': -98.71279907226562, 'logits/chosen': 1.4940239191055298, 'logits/rejected': 1.5447354316711426, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.001684859162196517, 'epsilon_dpo/loss_margin_mean': 228.3726806640625, 'epsilon_dpo/beta_margin_mean': 0.3823206424713135, 'epsilon_dpo/beta_margin_std': 0.47271671891212463, 'epsilon_dpo/beta_margin_grad_mean': -0.41094687581062317, 'epsilon_dpo/beta_margin_grad_std': 0.10572434961795807, 'kl/beta': 0.0016947533003985882, 'kl/avg_steps': 0.59375, 'epoch': 0.95} 95%|██████████████████████████████████████████████████████████████████████████▏ | 648/681 [47:45<01:28, 2.70s/it] 95%|██████████████████████████████████████████████████████████████████████████▎ | 649/681 [47:48<01:27, 2.74s/it] {'loss': 1.1922, 'grad_norm': 20.313772201538086, 'learning_rate': 3.5784585771215235e-09, 'rewards/chosen': -0.7163372039794922, 'rewards/rejected': -0.9623730182647705, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.24603573977947235, 'logps/chosen': -488.5963134765625, 'logps/rejected': -654.1318969726562, 'logps/ref_chosen': -62.4803466796875, 'logps/ref_rejected': -80.07717895507812, 'logits/chosen': 1.1369266510009766, 'logits/rejected': 1.797896146774292, 'kl/p_epsilon_steps': 0.65625, 'kl/n_epsilon_steps': 0.34375, 'epsilon_dpo/beta': 0.0016796531854197383, 'epsilon_dpo/loss_margin_mean': 147.93875122070312, 'epsilon_dpo/beta_margin_mean': 0.24603573977947235, 'epsilon_dpo/beta_margin_std': 0.3954623341560364, 'epsilon_dpo/beta_margin_grad_mean': -0.44180822372436523, 'epsilon_dpo/beta_margin_grad_std': 0.09041650593280792, 'kl/beta': 0.0016847500810399652, 'kl/avg_steps': 0.3125, 'epoch': 0.95} 95%|██████████████████████████████████████████████████████████████████████████▎ | 649/681 [47:48<01:27, 2.74s/it] 95%|██████████████████████████████████████████████████████████████████████████▍ | 650/681 [47:50<01:22, 2.67s/it] {'loss': 1.0995, 'grad_norm': 23.547075271606445, 'learning_rate': 3.3653488440851253e-09, 'rewards/chosen': -0.6900794506072998, 'rewards/rejected': -1.0517977476119995, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.36171823740005493, 'logps/chosen': -468.0220947265625, 'logps/rejected': -728.15625, 'logps/ref_chosen': -56.09281921386719, 'logps/ref_rejected': -98.26483917236328, 'logits/chosen': 1.4009277820587158, 'logits/rejected': 1.5325489044189453, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0016707462491467595, 'epsilon_dpo/loss_margin_mean': 217.962158203125, 'epsilon_dpo/beta_margin_mean': 0.36171820759773254, 'epsilon_dpo/beta_margin_std': 0.42442724108695984, 'epsilon_dpo/beta_margin_grad_mean': -0.4147437810897827, 'epsilon_dpo/beta_margin_grad_std': 0.09809747338294983, 'kl/beta': 0.0016795016126707196, 'kl/avg_steps': 0.53125, 'epoch': 0.95} 95%|██████████████████████████████████████████████████████████████████████████▍ | 650/681 [47:50<01:22, 2.67s/it] 96%|██████████████████████████████████████████████████████████████████████████▌ | 651/681 [47:53<01:19, 2.67s/it] {'loss': 1.078, 'grad_norm': 26.650527954101562, 'learning_rate': 3.158738163478475e-09, 'rewards/chosen': -0.5953332185745239, 'rewards/rejected': -0.967962920665741, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.37262973189353943, 'logps/chosen': -401.56536865234375, 'logps/rejected': -683.5089111328125, 'logps/ref_chosen': -43.42544937133789, 'logps/ref_rejected': -99.9579086303711, 'logits/chosen': 1.4459567070007324, 'logits/rejected': 1.4230878353118896, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0016598289366811514, 'epsilon_dpo/loss_margin_mean': 225.4111328125, 'epsilon_dpo/beta_margin_mean': 0.37262973189353943, 'epsilon_dpo/beta_margin_std': 0.35590580105781555, 'epsilon_dpo/beta_margin_grad_mean': -0.41106364130973816, 'epsilon_dpo/beta_margin_grad_std': 0.08254638314247131, 'kl/beta': 0.0016706264577805996, 'kl/avg_steps': 0.65625, 'epoch': 0.96} 96%|██████████████████████████████████████████████████████████████████████████▌ | 651/681 [47:53<01:19, 2.67s/it] 96%|██████████████████████████████████████████████████████████████████████████▋ | 652/681 [47:56<01:18, 2.71s/it] {'loss': 1.1318, 'grad_norm': 22.533578872680664, 'learning_rate': 2.9586319796851555e-09, 'rewards/chosen': -0.6445059776306152, 'rewards/rejected': -0.9546687602996826, 'rewards/accuracies': 0.875, 'rewards/margins': 0.31016284227371216, 'logps/chosen': -452.76849365234375, 'logps/rejected': -690.8995361328125, 'logps/ref_chosen': -62.57680892944336, 'logps/ref_rejected': -111.76779174804688, 'logits/chosen': 0.9193169474601746, 'logits/rejected': 1.0220489501953125, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0016495260642841458, 'epsilon_dpo/loss_margin_mean': 188.94003295898438, 'epsilon_dpo/beta_margin_mean': 0.31016284227371216, 'epsilon_dpo/beta_margin_std': 0.36632609367370605, 'epsilon_dpo/beta_margin_grad_mean': -0.42614489793777466, 'epsilon_dpo/beta_margin_grad_std': 0.08484771847724915, 'kl/beta': 0.001659734407439828, 'kl/avg_steps': 0.625, 'epoch': 0.96} 96%|██████████████████████████████████████████████████████████████████████████▋ | 652/681 [47:56<01:18, 2.71s/it] 96%|██████████████████████████████████████████████████████████████████████████▊ | 653/681 [47:58<01:15, 2.69s/it] {'loss': 1.1119, 'grad_norm': 28.52092933654785, 'learning_rate': 2.7650355656892166e-09, 'rewards/chosen': -0.7029273509979248, 'rewards/rejected': -1.038485050201416, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.335557758808136, 'logps/chosen': -489.69744873046875, 'logps/rejected': -737.5400390625, 'logps/ref_chosen': -61.11295700073242, 'logps/ref_rejected': -103.24960327148438, 'logits/chosen': 1.114861011505127, 'logits/rejected': 1.439771056175232, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0016382494941353798, 'epsilon_dpo/loss_margin_mean': 205.70590209960938, 'epsilon_dpo/beta_margin_mean': 0.335557758808136, 'epsilon_dpo/beta_margin_std': 0.37546736001968384, 'epsilon_dpo/beta_margin_grad_mean': -0.42025431990623474, 'epsilon_dpo/beta_margin_grad_std': 0.08646851778030396, 'kl/beta': 0.0016494254814460874, 'kl/avg_steps': 0.6875, 'epoch': 0.96} 96%|██████████████████████████████████████████████████████████████████████████▊ | 653/681 [47:58<01:15, 2.69s/it] 96%|██████████████████████████████████████████████████████████████████████████▉ | 654/681 [48:01<01:12, 2.68s/it] {'loss': 1.1746, 'grad_norm': 23.749624252319336, 'learning_rate': 2.577954022936174e-09, 'rewards/chosen': -0.7043324112892151, 'rewards/rejected': -0.9723464846611023, 'rewards/accuracies': 0.75, 'rewards/margins': 0.2680141031742096, 'logps/chosen': -492.3443908691406, 'logps/rejected': -695.1796875, 'logps/ref_chosen': -61.7281379699707, 'logps/ref_rejected': -98.7738037109375, 'logits/chosen': 1.2121856212615967, 'logits/rejected': 1.547875165939331, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.3125, 'epsilon_dpo/beta': 0.0016321832081303, 'epsilon_dpo/loss_margin_mean': 165.78965759277344, 'epsilon_dpo/beta_margin_mean': 0.2680140733718872, 'epsilon_dpo/beta_margin_std': 0.4029082953929901, 'epsilon_dpo/beta_margin_grad_mean': -0.4364525079727173, 'epsilon_dpo/beta_margin_grad_std': 0.09288126975297928, 'kl/beta': 0.0016381631139665842, 'kl/avg_steps': 0.375, 'epoch': 0.96} 96%|██████████████████████████████████████████████████████████████████████████▉ | 654/681 [48:01<01:12, 2.68s/it] 96%|███████████████████████████████████████████████████████████████████████████ | 655/681 [48:03<01:09, 2.66s/it] {'loss': 1.1333, 'grad_norm': 30.1119384765625, 'learning_rate': 2.397392281198729e-09, 'rewards/chosen': -0.6509025692939758, 'rewards/rejected': -0.9643889665603638, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.31348636746406555, 'logps/chosen': -450.25946044921875, 'logps/rejected': -693.3289794921875, 'logps/ref_chosen': -49.576812744140625, 'logps/ref_rejected': -98.29183197021484, 'logits/chosen': 1.3309032917022705, 'logits/rejected': 1.3572744131088257, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0016220049001276493, 'epsilon_dpo/loss_margin_mean': 194.35455322265625, 'epsilon_dpo/beta_margin_mean': 0.31348636746406555, 'epsilon_dpo/beta_margin_std': 0.3893417716026306, 'epsilon_dpo/beta_margin_grad_mean': -0.42558589577674866, 'epsilon_dpo/beta_margin_grad_std': 0.09087540209293365, 'kl/beta': 0.0016320429276674986, 'kl/avg_steps': 0.625, 'epoch': 0.96} 96%|███████████████████████████████████████████████████████████████████████████ | 655/681 [48:03<01:09, 2.66s/it] 96%|███████████████████████████████████████████████████████████████████████████▏ | 656/681 [48:07<01:09, 2.79s/it] {'loss': 0.9972, 'grad_norm': 34.51503372192383, 'learning_rate': 2.223355098446622e-09, 'rewards/chosen': -0.6157395839691162, 'rewards/rejected': -1.0845030546188354, 'rewards/accuracies': 0.9375, 'rewards/margins': 0.46876341104507446, 'logps/chosen': -435.2579345703125, 'logps/rejected': -788.3082275390625, 'logps/ref_chosen': -52.54943084716797, 'logps/ref_rejected': -113.67464447021484, 'logits/chosen': 1.2120656967163086, 'logits/rejected': 1.238845944404602, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.09375, 'epsilon_dpo/beta': 0.0016091398429125547, 'epsilon_dpo/loss_margin_mean': 291.9250793457031, 'epsilon_dpo/beta_margin_mean': 0.46876344084739685, 'epsilon_dpo/beta_margin_std': 0.32952094078063965, 'epsilon_dpo/beta_margin_grad_mean': -0.387954443693161, 'epsilon_dpo/beta_margin_grad_std': 0.07566812634468079, 'kl/beta': 0.0016219060635194182, 'kl/avg_steps': 0.796875, 'epoch': 0.96} 96%|███████████████████████████████████████████████████████████████████████████▏ | 656/681 [48:07<01:09, 2.79s/it] 96%|███████████████████████████████████████████████████████████████████████████▎ | 657/681 [48:09<01:04, 2.70s/it] {'loss': 1.0724, 'grad_norm': 32.31270980834961, 'learning_rate': 2.055847060721566e-09, 'rewards/chosen': -0.6334355473518372, 'rewards/rejected': -1.0120875835418701, 'rewards/accuracies': 0.875, 'rewards/margins': 0.37865209579467773, 'logps/chosen': -442.75128173828125, 'logps/rejected': -731.791259765625, 'logps/ref_chosen': -46.700538635253906, 'logps/ref_rejected': -97.91487121582031, 'logits/chosen': 1.4408583641052246, 'logits/rejected': 1.516934871673584, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0015976781724020839, 'epsilon_dpo/loss_margin_mean': 237.8256378173828, 'epsilon_dpo/beta_margin_mean': 0.3786521255970001, 'epsilon_dpo/beta_margin_std': 0.35267093777656555, 'epsilon_dpo/beta_margin_grad_mean': -0.40962353348731995, 'epsilon_dpo/beta_margin_grad_std': 0.08101543039083481, 'kl/beta': 0.0016090837307274342, 'kl/avg_steps': 0.71875, 'epoch': 0.96} 96%|███████████████████████████████████████████████████████████████████████████▎ | 657/681 [48:09<01:04, 2.70s/it] 97%|███████████████████████████████████████████████████████████████████████████▎ | 658/681 [48:11<01:00, 2.62s/it] {'loss': 1.1358, 'grad_norm': 29.608266830444336, 'learning_rate': 1.8948725820160662e-09, 'rewards/chosen': -0.6918191909790039, 'rewards/rejected': -0.988710880279541, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.2968916893005371, 'logps/chosen': -495.99505615234375, 'logps/rejected': -719.032470703125, 'logps/ref_chosen': -60.958213806152344, 'logps/ref_rejected': -95.93949127197266, 'logits/chosen': 1.3499019145965576, 'logits/rejected': 1.6195529699325562, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0015887732151895761, 'epsilon_dpo/loss_margin_mean': 188.05613708496094, 'epsilon_dpo/beta_margin_mean': 0.2968916893005371, 'epsilon_dpo/beta_margin_std': 0.3177672028541565, 'epsilon_dpo/beta_margin_grad_mean': -0.428006112575531, 'epsilon_dpo/beta_margin_grad_std': 0.07601499557495117, 'kl/beta': 0.0015976008726283908, 'kl/avg_steps': 0.5625, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▎ | 658/681 [48:11<01:00, 2.62s/it] 97%|███████████████████████████████████████████████████████████████████████████▍ | 659/681 [48:14<00:59, 2.69s/it] {'loss': 1.1164, 'grad_norm': 21.250402450561523, 'learning_rate': 1.7404359041573723e-09, 'rewards/chosen': -0.6578041315078735, 'rewards/rejected': -0.9832782745361328, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.32547420263290405, 'logps/chosen': -492.78631591796875, 'logps/rejected': -710.77294921875, 'logps/ref_chosen': -76.74298095703125, 'logps/ref_rejected': -87.4709701538086, 'logits/chosen': 0.7977874279022217, 'logits/rejected': 1.7436567544937134, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.0015798864187672734, 'epsilon_dpo/loss_margin_mean': 207.2586212158203, 'epsilon_dpo/beta_margin_mean': 0.32547417283058167, 'epsilon_dpo/beta_margin_std': 0.35009095072746277, 'epsilon_dpo/beta_margin_grad_mean': -0.4219341576099396, 'epsilon_dpo/beta_margin_grad_std': 0.08217877149581909, 'kl/beta': 0.0015886647161096334, 'kl/avg_steps': 0.5625, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▍ | 659/681 [48:14<00:59, 2.69s/it] 97%|███████████████████████████████████████████████████████████████████████████▌ | 660/681 [48:17<00:55, 2.64s/it] {'loss': 1.1002, 'grad_norm': 19.845855712890625, 'learning_rate': 1.592541096695571e-09, 'rewards/chosen': -0.6557613611221313, 'rewards/rejected': -0.9964351058006287, 'rewards/accuracies': 0.875, 'rewards/margins': 0.3406737148761749, 'logps/chosen': -476.62286376953125, 'logps/rejected': -711.6376953125, 'logps/ref_chosen': -59.047882080078125, 'logps/ref_rejected': -75.96005249023438, 'logits/chosen': 1.3539650440216064, 'logits/rejected': 2.1877787113189697, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0015685806283727288, 'epsilon_dpo/loss_margin_mean': 218.10269165039062, 'epsilon_dpo/beta_margin_mean': 0.3406737148761749, 'epsilon_dpo/beta_margin_std': 0.32832127809524536, 'epsilon_dpo/beta_margin_grad_mean': -0.417694091796875, 'epsilon_dpo/beta_margin_grad_std': 0.07718788832426071, 'kl/beta': 0.001579778385348618, 'kl/avg_steps': 0.71875, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▌ | 660/681 [48:17<00:55, 2.64s/it] 97%|███████████████████████████████████████████████████████████████████████████▋ | 661/681 [48:19<00:50, 2.53s/it] {'loss': 1.1278, 'grad_norm': 21.829837799072266, 'learning_rate': 1.4511920567963908e-09, 'rewards/chosen': -0.6036039590835571, 'rewards/rejected': -0.9187537431716919, 'rewards/accuracies': 0.875, 'rewards/margins': 0.31514978408813477, 'logps/chosen': -437.73468017578125, 'logps/rejected': -676.3663330078125, 'logps/ref_chosen': -50.673973083496094, 'logps/ref_rejected': -86.00569152832031, 'logits/chosen': 1.1137815713882446, 'logits/rejected': 1.8122587203979492, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0015568967210128903, 'epsilon_dpo/loss_margin_mean': 203.2999267578125, 'epsilon_dpo/beta_margin_mean': 0.31514978408813477, 'epsilon_dpo/beta_margin_std': 0.3677848279476166, 'epsilon_dpo/beta_margin_grad_mean': -0.4247366786003113, 'epsilon_dpo/beta_margin_grad_std': 0.08477568626403809, 'kl/beta': 0.0015685048419982195, 'kl/avg_steps': 0.75, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▋ | 661/681 [48:19<00:50, 2.53s/it] 97%|███████████████████████████████████████████████████████████████████████████▊ | 662/681 [48:22<00:49, 2.62s/it] {'loss': 1.1641, 'grad_norm': 19.571136474609375, 'learning_rate': 1.3163925091384532e-09, 'rewards/chosen': -0.6591126918792725, 'rewards/rejected': -0.9289331436157227, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.2698204517364502, 'logps/chosen': -493.9742431640625, 'logps/rejected': -689.3358154296875, 'logps/ref_chosen': -69.26106262207031, 'logps/ref_rejected': -89.05593872070312, 'logits/chosen': 1.0110704898834229, 'logits/rejected': 1.527016282081604, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'epsilon_dpo/beta': 0.00154968595597893, 'epsilon_dpo/loss_margin_mean': 175.56666564941406, 'epsilon_dpo/beta_margin_mean': 0.2698204517364502, 'epsilon_dpo/beta_margin_std': 0.3489322364330292, 'epsilon_dpo/beta_margin_grad_mean': -0.4349568486213684, 'epsilon_dpo/beta_margin_grad_std': 0.08338849246501923, 'kl/beta': 0.0015568286180496216, 'kl/avg_steps': 0.46875, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▊ | 662/681 [48:22<00:49, 2.62s/it] 97%|███████████████████████████████████████████████████████████████████████████▉ | 663/681 [48:25<00:48, 2.70s/it] {'loss': 1.1277, 'grad_norm': 21.666109085083008, 'learning_rate': 1.1881460058152382e-09, 'rewards/chosen': -0.6028608083724976, 'rewards/rejected': -0.9170067310333252, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.31414592266082764, 'logps/chosen': -456.033447265625, 'logps/rejected': -710.153564453125, 'logps/ref_chosen': -64.87891387939453, 'logps/ref_rejected': -113.92536926269531, 'logits/chosen': 0.893703043460846, 'logits/rejected': 0.9986363053321838, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0015390656189993024, 'epsilon_dpo/loss_margin_mean': 205.07363891601562, 'epsilon_dpo/beta_margin_mean': 0.31414592266082764, 'epsilon_dpo/beta_margin_std': 0.36153239011764526, 'epsilon_dpo/beta_margin_grad_mean': -0.42479458451271057, 'epsilon_dpo/beta_margin_grad_std': 0.08391547948122025, 'kl/beta': 0.0015495650004595518, 'kl/avg_steps': 0.6875, 'epoch': 0.97} 97%|███████████████████████████████████████████████████████████████████████████▉ | 663/681 [48:25<00:48, 2.70s/it] 98%|████████████████████████████████████████████████████████████████████████████ | 664/681 [48:28<00:46, 2.74s/it] {'loss': 1.1214, 'grad_norm': 21.28402328491211, 'learning_rate': 1.066455926241383e-09, 'rewards/chosen': -0.6699653267860413, 'rewards/rejected': -0.9948194622993469, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.3248541057109833, 'logps/chosen': -497.7940673828125, 'logps/rejected': -756.0826416015625, 'logps/ref_chosen': -60.88847351074219, 'logps/ref_rejected': -105.521728515625, 'logits/chosen': 1.216430902481079, 'logits/rejected': 1.6328184604644775, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'epsilon_dpo/beta': 0.001530480687506497, 'epsilon_dpo/loss_margin_mean': 213.6553497314453, 'epsilon_dpo/beta_margin_mean': 0.3248541057109833, 'epsilon_dpo/beta_margin_std': 0.37472057342529297, 'epsilon_dpo/beta_margin_grad_mean': -0.4216407239437103, 'epsilon_dpo/beta_margin_grad_std': 0.08829301595687866, 'kl/beta': 0.001538984477519989, 'kl/avg_steps': 0.5625, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████ | 664/681 [48:28<00:46, 2.74s/it] 98%|████████████████████████████████████████████████████████████████████████████▏ | 665/681 [48:30<00:42, 2.66s/it] {'loss': 1.1182, 'grad_norm': 20.151309967041016, 'learning_rate': 9.513254770636137e-10, 'rewards/chosen': -0.645055890083313, 'rewards/rejected': -0.962983250617981, 'rewards/accuracies': 0.875, 'rewards/margins': 0.3179273307323456, 'logps/chosen': -484.7784423828125, 'logps/rejected': -719.0540771484375, 'logps/ref_chosen': -60.56413269042969, 'logps/ref_rejected': -84.8088150024414, 'logits/chosen': 1.220696210861206, 'logits/rejected': 1.80734384059906, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0015195284504443407, 'epsilon_dpo/loss_margin_mean': 210.031005859375, 'epsilon_dpo/beta_margin_mean': 0.3179273307323456, 'epsilon_dpo/beta_margin_std': 0.32217973470687866, 'epsilon_dpo/beta_margin_grad_mean': -0.4235166311264038, 'epsilon_dpo/beta_margin_grad_std': 0.07510911673307419, 'kl/beta': 0.0015303761465474963, 'kl/avg_steps': 0.71875, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▏ | 665/681 [48:30<00:42, 2.66s/it] 98%|████████████████████████████████████████████████████████████████████████████▎ | 666/681 [48:33<00:41, 2.75s/it] {'loss': 1.1198, 'grad_norm': 21.010101318359375, 'learning_rate': 8.427576920763956e-10, 'rewards/chosen': -0.6052336692810059, 'rewards/rejected': -0.9149041175842285, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.30967044830322266, 'logps/chosen': -465.0032958984375, 'logps/rejected': -702.5185546875, 'logps/ref_chosen': -64.41996002197266, 'logps/ref_rejected': -95.89163208007812, 'logits/chosen': 1.0827593803405762, 'logits/rejected': 1.5170735120773315, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.0015096345450729132, 'epsilon_dpo/loss_margin_mean': 206.04354858398438, 'epsilon_dpo/beta_margin_mean': 0.30967044830322266, 'epsilon_dpo/beta_margin_std': 0.2827000617980957, 'epsilon_dpo/beta_margin_grad_mean': -0.4246806502342224, 'epsilon_dpo/beta_margin_grad_std': 0.06764908134937286, 'kl/beta': 0.001519454992376268, 'kl/avg_steps': 0.65625, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▎ | 666/681 [48:33<00:41, 2.75s/it] 98%|████████████████████████████████████████████████████████████████████████████▍ | 667/681 [48:36<00:38, 2.76s/it] {'loss': 1.1422, 'grad_norm': 27.546092987060547, 'learning_rate': 7.407554321417764e-10, 'rewards/chosen': -0.6668318510055542, 'rewards/rejected': -0.9519304037094116, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.28509852290153503, 'logps/chosen': -513.462890625, 'logps/rejected': -723.0411987304688, 'logps/ref_chosen': -69.27703094482422, 'logps/ref_rejected': -87.83549499511719, 'logits/chosen': 1.0864355564117432, 'logits/rejected': 1.7724149227142334, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.15625, 'epsilon_dpo/beta': 0.0014995539095252752, 'epsilon_dpo/loss_margin_mean': 191.0198516845703, 'epsilon_dpo/beta_margin_mean': 0.28509852290153503, 'epsilon_dpo/beta_margin_std': 0.2928607165813446, 'epsilon_dpo/beta_margin_grad_mean': -0.430759996175766, 'epsilon_dpo/beta_margin_grad_std': 0.07021576911211014, 'kl/beta': 0.001509548630565405, 'kl/avg_steps': 0.671875, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▍ | 667/681 [48:36<00:38, 2.76s/it] 98%|████████████████████████████████████████████████████████████████████████████▌ | 668/681 [48:39<00:36, 2.83s/it] {'loss': 1.2058, 'grad_norm': 32.91999816894531, 'learning_rate': 6.453213851142225e-10, 'rewards/chosen': -0.6804643869400024, 'rewards/rejected': -0.903580367565155, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.22311602532863617, 'logps/chosen': -527.62158203125, 'logps/rejected': -709.686279296875, 'logps/ref_chosen': -72.60400390625, 'logps/ref_rejected': -103.73905181884766, 'logits/chosen': 1.0455001592636108, 'logits/rejected': 1.1969373226165771, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.001492125797085464, 'epsilon_dpo/loss_margin_mean': 150.9296875, 'epsilon_dpo/beta_margin_mean': 0.22311602532863617, 'epsilon_dpo/beta_margin_std': 0.3523963391780853, 'epsilon_dpo/beta_margin_grad_mean': -0.44598451256752014, 'epsilon_dpo/beta_margin_grad_std': 0.08418892323970795, 'kl/beta': 0.001499474048614502, 'kl/avg_steps': 0.5, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▌ | 668/681 [48:39<00:36, 2.83s/it] 98%|████████████████████████████████████████████████████████████████████████████▋ | 669/681 [48:42<00:34, 2.88s/it] {'loss': 1.1146, 'grad_norm': 18.236900329589844, 'learning_rate': 5.564580657695939e-10, 'rewards/chosen': -0.5582731366157532, 'rewards/rejected': -0.8860403299331665, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.3277672529220581, 'logps/chosen': -422.7698974609375, 'logps/rejected': -676.8389892578125, 'logps/ref_chosen': -46.116416931152344, 'logps/ref_rejected': -77.92434692382812, 'logits/chosen': 1.1504340171813965, 'logits/rejected': 1.864037275314331, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'epsilon_dpo/beta': 0.0014805055689066648, 'epsilon_dpo/loss_margin_mean': 222.26109313964844, 'epsilon_dpo/beta_margin_mean': 0.3277672231197357, 'epsilon_dpo/beta_margin_std': 0.34995928406715393, 'epsilon_dpo/beta_margin_grad_mean': -0.42099520564079285, 'epsilon_dpo/beta_margin_grad_std': 0.08258918672800064, 'kl/beta': 0.0014920139219611883, 'kl/avg_steps': 0.78125, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▋ | 669/681 [48:42<00:34, 2.88s/it] 98%|████████████████████████████████████████████████████████████████████████████▋ | 670/681 [48:45<00:31, 2.82s/it] {'loss': 1.0996, 'grad_norm': 17.821779251098633, 'learning_rate': 4.741678157389739e-10, 'rewards/chosen': -0.5237119793891907, 'rewards/rejected': -0.8698786497116089, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.3461666703224182, 'logps/chosen': -417.00958251953125, 'logps/rejected': -688.0401611328125, 'logps/ref_chosen': -62.34575653076172, 'logps/ref_rejected': -96.9405517578125, 'logits/chosen': 1.1266391277313232, 'logits/rejected': 1.5027225017547607, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'epsilon_dpo/beta': 0.0014731930568814278, 'epsilon_dpo/loss_margin_mean': 236.43572998046875, 'epsilon_dpo/beta_margin_mean': 0.3461667001247406, 'epsilon_dpo/beta_margin_std': 0.35282188653945923, 'epsilon_dpo/beta_margin_grad_mean': -0.4170011579990387, 'epsilon_dpo/beta_margin_grad_std': 0.0830448642373085, 'kl/beta': 0.0014804479433223605, 'kl/avg_steps': 0.5, 'epoch': 0.98} 98%|████████████████████████████████████████████████████████████████████████████▋ | 670/681 [48:45<00:31, 2.82s/it] 99%|████████████████████████████████████████████████████████████████████████████▊ | 671/681 [48:47<00:27, 2.72s/it] {'loss': 1.1511, 'grad_norm': 22.806631088256836, 'learning_rate': 3.9845280344705245e-10, 'rewards/chosen': -0.6287015676498413, 'rewards/rejected': -0.9122143387794495, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.28351277112960815, 'logps/chosen': -476.3839111328125, 'logps/rejected': -707.0560302734375, 'logps/ref_chosen': -48.00010681152344, 'logps/ref_rejected': -83.81932067871094, 'logits/chosen': 1.5543603897094727, 'logits/rejected': 1.8255066871643066, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.001465403358452022, 'epsilon_dpo/loss_margin_mean': 194.85296630859375, 'epsilon_dpo/beta_margin_mean': 0.28351274132728577, 'epsilon_dpo/beta_margin_std': 0.3416160047054291, 'epsilon_dpo/beta_margin_grad_mean': -0.4314085841178894, 'epsilon_dpo/beta_margin_grad_std': 0.08191683143377304, 'kl/beta': 0.001473082578741014, 'kl/avg_steps': 0.53125, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████▊ | 671/681 [48:47<00:27, 2.72s/it] 99%|████████████████████████████████████████████████████████████████████████████▉ | 672/681 [48:50<00:24, 2.69s/it] {'loss': 1.1769, 'grad_norm': 22.33818244934082, 'learning_rate': 3.293150240547549e-10, 'rewards/chosen': -0.6801042556762695, 'rewards/rejected': -0.9327975511550903, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.2526932656764984, 'logps/chosen': -524.9901123046875, 'logps/rejected': -734.224365234375, 'logps/ref_chosen': -58.583290100097656, 'logps/ref_rejected': -93.14014434814453, 'logits/chosen': 1.262594223022461, 'logits/rejected': 1.5828099250793457, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0014567435719072819, 'epsilon_dpo/loss_margin_mean': 174.6773223876953, 'epsilon_dpo/beta_margin_mean': 0.2526932656764984, 'epsilon_dpo/beta_margin_std': 0.336892694234848, 'epsilon_dpo/beta_margin_grad_mean': -0.4389883875846863, 'epsilon_dpo/beta_margin_grad_std': 0.08010012656450272, 'kl/beta': 0.0014652981190010905, 'kl/avg_steps': 0.59375, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████▉ | 672/681 [48:50<00:24, 2.69s/it] 99%|█████████████████████████████████████████████████████████████████████████████ | 673/681 [48:52<00:20, 2.60s/it] {'loss': 1.1214, 'grad_norm': 25.738595962524414, 'learning_rate': 2.6675629940689504e-10, 'rewards/chosen': -0.5871830582618713, 'rewards/rejected': -0.8955413699150085, 'rewards/accuracies': 0.875, 'rewards/margins': 0.30835825204849243, 'logps/chosen': -452.23095703125, 'logps/rejected': -704.8712768554688, 'logps/ref_chosen': -46.72320556640625, 'logps/ref_rejected': -85.29623413085938, 'logits/chosen': 1.4295220375061035, 'logits/rejected': 1.8196861743927002, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'epsilon_dpo/beta': 0.0014463242841884494, 'epsilon_dpo/loss_margin_mean': 214.06727600097656, 'epsilon_dpo/beta_margin_mean': 0.3083582818508148, 'epsilon_dpo/beta_margin_std': 0.2860918939113617, 'epsilon_dpo/beta_margin_grad_mean': -0.42500680685043335, 'epsilon_dpo/beta_margin_grad_std': 0.06854859739542007, 'kl/beta': 0.001456649275496602, 'kl/avg_steps': 0.71875, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████ | 673/681 [48:52<00:20, 2.60s/it] 99%|█████████████████████████████████████████████████████████████████████████████▏| 674/681 [48:55<00:18, 2.66s/it] {'loss': 1.0994, 'grad_norm': 17.829362869262695, 'learning_rate': 2.1077827798404725e-10, 'rewards/chosen': -0.5376299619674683, 'rewards/rejected': -0.8791646957397461, 'rewards/accuracies': 0.875, 'rewards/margins': 0.34153473377227783, 'logps/chosen': -419.6162109375, 'logps/rejected': -682.877685546875, 'logps/ref_chosen': -45.445526123046875, 'logps/ref_rejected': -70.04593658447266, 'logits/chosen': 1.5478763580322266, 'logits/rejected': 2.0943801403045654, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'epsilon_dpo/beta': 0.0014355509774759412, 'epsilon_dpo/loss_margin_mean': 238.66107177734375, 'epsilon_dpo/beta_margin_mean': 0.34153473377227783, 'epsilon_dpo/beta_margin_std': 0.3308585286140442, 'epsilon_dpo/beta_margin_grad_mean': -0.4180900752544403, 'epsilon_dpo/beta_margin_grad_std': 0.07587840408086777, 'kl/beta': 0.0014462543185800314, 'kl/avg_steps': 0.75, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████▏| 674/681 [48:55<00:18, 2.66s/it] 99%|█████████████████████████████████████████████████████████████████████████████▎| 675/681 [48:57<00:15, 2.59s/it] {'loss': 1.0892, 'grad_norm': 16.485437393188477, 'learning_rate': 1.6138243485910863e-10, 'rewards/chosen': -0.5436429977416992, 'rewards/rejected': -0.8952072858810425, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.35156434774398804, 'logps/chosen': -425.02178955078125, 'logps/rejected': -702.3570556640625, 'logps/ref_chosen': -44.17628479003906, 'logps/ref_rejected': -74.09197998046875, 'logits/chosen': 1.6052007675170898, 'logits/rejected': 2.07883358001709, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.001426659058779478, 'epsilon_dpo/loss_margin_mean': 247.4196014404297, 'epsilon_dpo/beta_margin_mean': 0.35156434774398804, 'epsilon_dpo/beta_margin_std': 0.31687280535697937, 'epsilon_dpo/beta_margin_grad_mean': -0.41543251276016235, 'epsilon_dpo/beta_margin_grad_std': 0.0737193152308464, 'kl/beta': 0.0014354882296174765, 'kl/avg_steps': 0.625, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████▎| 675/681 [48:57<00:15, 2.59s/it] 99%|█████████████████████████████████████████████████████████████████████████████▍| 676/681 [49:00<00:13, 2.64s/it] {'loss': 1.1331, 'grad_norm': 24.32298469543457, 'learning_rate': 1.1857007165852472e-10, 'rewards/chosen': -0.5800386667251587, 'rewards/rejected': -0.8739074468612671, 'rewards/accuracies': 0.875, 'rewards/margins': 0.2938688397407532, 'logps/chosen': -479.9848327636719, 'logps/rejected': -705.1788330078125, 'logps/ref_chosen': -71.39852142333984, 'logps/ref_rejected': -88.3587646484375, 'logits/chosen': 1.0184485912322998, 'logits/rejected': 1.7093250751495361, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'epsilon_dpo/beta': 0.0014177978737279773, 'epsilon_dpo/loss_margin_mean': 208.23379516601562, 'epsilon_dpo/beta_margin_mean': 0.29386886954307556, 'epsilon_dpo/beta_margin_std': 0.2816160023212433, 'epsilon_dpo/beta_margin_grad_mean': -0.42855557799339294, 'epsilon_dpo/beta_margin_grad_std': 0.06745462119579315, 'kl/beta': 0.0014265720965340734, 'kl/avg_steps': 0.625, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████▍| 676/681 [49:00<00:13, 2.64s/it] 99%|█████████████████████████████████████████████████████████████████████████████▌| 677/681 [49:03<00:10, 2.60s/it] {'loss': 1.1279, 'grad_norm': 15.692850112915039, 'learning_rate': 8.23423165278725e-11, 'rewards/chosen': -0.5956122875213623, 'rewards/rejected': -0.8965206146240234, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.3009083867073059, 'logps/chosen': -478.5683288574219, 'logps/rejected': -714.8244018554688, 'logps/ref_chosen': -56.52743911743164, 'logps/ref_rejected': -78.22654724121094, 'logits/chosen': 1.3182166814804077, 'logits/rejected': 2.0713021755218506, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.0014094345970079303, 'epsilon_dpo/loss_margin_mean': 214.5569610595703, 'epsilon_dpo/beta_margin_mean': 0.3009083867073059, 'epsilon_dpo/beta_margin_std': 0.28873759508132935, 'epsilon_dpo/beta_margin_grad_mean': -0.42703139781951904, 'epsilon_dpo/beta_margin_grad_std': 0.06837636977434158, 'kl/beta': 0.0014177113771438599, 'kl/avg_steps': 0.59375, 'epoch': 0.99} 99%|█████████████████████████████████████████████████████████████████████████████▌| 677/681 [49:03<00:10, 2.60s/it] 100%|█████████████████████████████████████████████████████████████████████████████▋| 678/681 [49:05<00:07, 2.57s/it] {'loss': 1.1224, 'grad_norm': 23.71813201904297, 'learning_rate': 5.270012410216185e-11, 'rewards/chosen': -0.5572014451026917, 'rewards/rejected': -0.8709613084793091, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.31375986337661743, 'logps/chosen': -443.7493896484375, 'logps/rejected': -703.380615234375, 'logps/ref_chosen': -46.13447570800781, 'logps/ref_rejected': -80.60462951660156, 'logits/chosen': 1.5611693859100342, 'logits/rejected': 1.884830117225647, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'epsilon_dpo/beta': 0.001400234643369913, 'epsilon_dpo/loss_margin_mean': 225.1610870361328, 'epsilon_dpo/beta_margin_mean': 0.3137598931789398, 'epsilon_dpo/beta_margin_std': 0.32407626509666443, 'epsilon_dpo/beta_margin_grad_mean': -0.4241836965084076, 'epsilon_dpo/beta_margin_grad_std': 0.07738611847162247, 'kl/beta': 0.0014093434438109398, 'kl/avg_steps': 0.65625, 'epoch': 1.0} 100%|█████████████████████████████████████████████████████████████████████████████▋| 678/681 [49:05<00:07, 2.57s/it] 100%|█████████████████████████████████████████████████████████████████████████████▊| 679/681 [49:08<00:05, 2.67s/it] {'loss': 1.169, 'grad_norm': 23.833972930908203, 'learning_rate': 2.9644275480772416e-11, 'rewards/chosen': -0.5649718046188354, 'rewards/rejected': -0.8241901397705078, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.25921830534935, 'logps/chosen': -455.281494140625, 'logps/rejected': -669.05859375, 'logps/ref_chosen': -50.294921875, 'logps/ref_rejected': -76.59813690185547, 'logits/chosen': 1.2658469676971436, 'logits/rejected': 1.8166303634643555, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'epsilon_dpo/beta': 0.0013928557746112347, 'epsilon_dpo/loss_margin_mean': 187.47393798828125, 'epsilon_dpo/beta_margin_mean': 0.2592182755470276, 'epsilon_dpo/beta_margin_std': 0.32160359621047974, 'epsilon_dpo/beta_margin_grad_mean': -0.4372449219226837, 'epsilon_dpo/beta_margin_grad_std': 0.0773983970284462, 'kl/beta': 0.0014001548988744617, 'kl/avg_steps': 0.53125, 'epoch': 1.0} 100%|█████████████████████████████████████████████████████████████████████████████▊| 679/681 [49:08<00:05, 2.67s/it] 100%|█████████████████████████████████████████████████████████████████████████████▉| 680/681 [49:11<00:02, 2.71s/it] {'loss': 1.1318, 'grad_norm': 21.873979568481445, 'learning_rate': 1.31753782067201e-11, 'rewards/chosen': -0.5775318145751953, 'rewards/rejected': -0.8859383463859558, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.3084065318107605, 'logps/chosen': -493.4278869628906, 'logps/rejected': -752.9612426757812, 'logps/ref_chosen': -76.91569519042969, 'logps/ref_rejected': -112.384765625, 'logits/chosen': 1.0442841053009033, 'logits/rejected': 1.2980847358703613, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'epsilon_dpo/beta': 0.001384624862112105, 'epsilon_dpo/loss_margin_mean': 224.0642852783203, 'epsilon_dpo/beta_margin_mean': 0.3084065318107605, 'epsilon_dpo/beta_margin_std': 0.3552207052707672, 'epsilon_dpo/beta_margin_grad_mean': -0.42590072751045227, 'epsilon_dpo/beta_margin_grad_std': 0.0837513729929924, 'kl/beta': 0.0013927558902651072, 'kl/avg_steps': 0.59375, 'epoch': 1.0} 100%|█████████████████████████████████████████████████████████████████████████████▉| 680/681 [49:11<00:02, 2.71s/it] 100%|██████████████████████████████████████████████████████████████████████████████| 681/681 [49:13<00:00, 2.73s/it] {'loss': 1.1631, 'grad_norm': 20.94184684753418, 'learning_rate': 3.2938662507808745e-12, 'rewards/chosen': -0.5930610299110413, 'rewards/rejected': -0.8560398817062378, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.26297879219055176, 'logps/chosen': -490.7020263671875, 'logps/rejected': -710.5421142578125, 'logps/ref_chosen': -60.957279205322266, 'logps/ref_rejected': -88.5579833984375, 'logits/chosen': 1.2411224842071533, 'logits/rejected': 1.6379847526550293, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'epsilon_dpo/beta': 0.0013781830202788115, 'epsilon_dpo/loss_margin_mean': 192.23939514160156, 'epsilon_dpo/beta_margin_mean': 0.26297879219055176, 'epsilon_dpo/beta_margin_std': 0.3052116930484772, 'epsilon_dpo/beta_margin_grad_mean': -0.4362422525882721, 'epsilon_dpo/beta_margin_grad_std': 0.07315977662801743, 'kl/beta': 0.0013845352223142982, 'kl/avg_steps': 0.46875, 'epoch': 1.0} 100%|██████████████████████████████████████████████████████████████████████████████| 681/681 [49:14<00:00, 2.73s/it][INFO|trainer.py:3984] 2026-04-18 01:27:38,141 >> Saving model checkpoint to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-681 [INFO|configuration_utils.py:419] 2026-04-18 01:27:38,147 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-681/config.json [INFO|configuration_utils.py:911] 2026-04-18 01:27:38,153 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-681/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-18 01:28:26,255 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-681/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-18 01:28:26,271 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-681/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-18 01:28:26,281 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-681/special_tokens_map.json [INFO|trainer.py:4083] 2026-04-18 01:31:51,939 >> Deleting older checkpoint [/scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/checkpoint-400] due to args.save_total_limit [INFO|trainer.py:2681] 2026-04-18 01:31:55,585 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 3235.1211, 'train_samples_per_second': 13.476, 'train_steps_per_second': 0.211, 'train_loss': 0.8920053139482126, 'epoch': 1.0} 100%|██████████████████████████████████████████████████████████████████████████████| 681/681 [53:46<00:00, 2.73s/it] 100%|██████████████████████████████████████████████████████████████████████████████| 681/681 [53:46<00:00, 4.74s/it] ***** train metrics ***** epoch = 1.0 total_flos = 0GF train_loss = 0.892 train_runtime = 0:53:55.12 train_samples = 43598 train_samples_per_second = 13.476 train_steps_per_second = 0.211 2026-04-18 01:31:55 - INFO - __main__ - *** Training complete *** 2026-04-18 01:31:55 - INFO - __main__ - *** Save model *** [INFO|configuration_utils.py:419] 2026-04-18 01:32:12,833 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/config.json [INFO|configuration_utils.py:911] 2026-04-18 01:32:12,907 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-18 01:33:07,044 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-18 01:33:07,177 >> tokenizer config file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-18 01:33:07,341 >> Special tokens file saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/special_tokens_map.json 2026-04-18 01:33:07 - INFO - __main__ - Saved HF-compatible model artifacts to /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920 [INFO|modelcard.py:450] 2026-04-18 01:33:07,736 >> Dropping the following result as it does not have all the necessary fields: {'dataset': {'name': 'Anthropic/hh-rlhf', 'type': 'Anthropic/hh-rlhf'}} [INFO|configuration_utils.py:419] 2026-04-18 01:33:07,976 >> Configuration saved in /scratch/feng.yulu/dynamic-dpo-v4/outputs/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920/config.json 2026-04-18 01:33:07 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:4307] 2026-04-18 01:33:07,976 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-18 01:33:07,976 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-18 01:33:07,976 >> Batch size = 8 0%| | 0/73 [00:00