2026-04-24 04:03:22 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='/scratch/qu.yang1/dynamic-dpo-v4/outputs/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', tokenizer_name_or_path=None, trust_remote_code=False, attn_implementation='flash_attention_2', use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8') 2026-04-24 04:03:22 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'Anthropic/hh-rlhf': 1.0}, text_column='text', dataset_splits=['train', 'test'], dataset_configs=['helpful-base'], dataset_dir=None, preprocessing_num_workers=12, use_persistent_hf_cache=True, hf_cache_dir='/scratch/qu.yang1/hf/datasets', truncation_side=None, auto_insert_empty_system_msg=True, disable_thinking=True, preprocessing_log_samples=0, preprocessing_log_dir=None) 2026-04-24 04:03:22 - INFO - __main__ - Training/evaluation parameters EpsilonDPOConfig( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, beta=0.1, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=True, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, dataset_num_proc=12, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_dropout=True, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, epsilon=0.01, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=100, eval_strategy=IntervalStrategy.STEPS, eval_use_gather_object=False, f_alpha_divergence_coef=1.0, f_divergence_type=FDivergenceType.REVERSE_KL, force_use_ref_model=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generate_during_eval=False, gradient_accumulation_steps=2, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_margin_dataset_id=qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-margin-log, hub_model_id=qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200, hub_model_revision=main, hub_private_repo=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, is_encoder_decoder=None, jit_mode_eval=False, label_names=None, label_pad_token_id=-100, label_smoothing=0.0, label_smoothing_factor=0.0, learning_rate=5e-07, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200/runs/Apr24_04-03-22_d4054, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=1, logging_strategy=IntervalStrategy.STEPS, loss_type=sigmoid, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.COSINE, margin_dataset_private=None, margin_dataset_split=train, max_grad_norm=1.0, max_length=512, max_prompt_length=256, max_steps=-1, max_target_length=None, metric_for_best_model=None, model_adapter_name=None, model_init_kwargs=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, non_finite_logits_handling=error, num_train_epochs=1, optim=OptimizerNames.ADAMW_TORCH, optim_args=None, optim_target_modules=None, output_dir=/scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306, overwrite_output_dir=False, padding_value=None, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, post_tokenization_log_dir=None, post_tokenization_log_samples=0, precompute_ref_batch_size=None, precompute_ref_eval_batch_size=None, precompute_ref_log_probs=False, prediction_loss_only=False, push_margin_dataset=True, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, ref_adapter_name=None, ref_model_init_kwargs=None, ref_model_mixup_alpha=0.9, ref_model_sync_steps=64, reference_free=False, remove_unused_columns=False, report_to=['wandb'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, reuse_tokenized_dataset=True, rpo_alpha=None, run_name=qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=SaveStrategy.STEPS, save_total_limit=2, seed=42, sft_weight=0.0, skip_memory_metrics=True, sync_ref_model=False, tf32=None, tokenization_batch_size=128, tokenization_mode=online, tokenized_dataset_cache_dir=/scratch/qu.yang1/tokenized_preferences, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, trainer_type=epsilon_dpo, truncation_mode=keep_end, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, wandb_project=qwen3_hh_4xh200_beta_0.1, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0, ) 2026-04-24 04:03:22 - INFO - __main__ - Using W&B project from training args: qwen3_hh_4xh200_beta_0.1 2026-04-24 04:03:22 - INFO - __main__ - Epsilon-DPO parameters: beta=0.1, epsilon=0.01, gradient_accumulation_steps=2 2026-04-24 04:03:22 - INFO - __main__ - Using persistent HF datasets cache at /scratch/qu.yang1/hf/datasets 2026-04-24 04:03:25 - WARNING - __main__ - Dropped 237 non-canonical HH preference examples from split `train` before normalization (126 x HH preprocessing expects exactly one final assistant response in chosen/rejected suffixes., 111 x HH chosen/rejected transcripts must each contain a divergent assistant response.). Normalizing raw HH preferences (train): 0%| | 0/43598 [00:00> loading file vocab.json [INFO|tokenization_utils_base.py:2058] 2026-04-24 04:03:30,350 >> loading file merges.txt [INFO|tokenization_utils_base.py:2058] 2026-04-24 04:03:30,350 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2026-04-24 04:03:30,350 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2026-04-24 04:03:30,350 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2026-04-24 04:03:30,350 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2026-04-24 04:03:30,350 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2323] 2026-04-24 04:03:30,646 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /home/qu.yang1/dpo-test/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/qu.yang1/dpo-test/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( /home/qu.yang1/dpo-test/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [WARNING|logging.py:328] 2026-04-24 04:03:30,762 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-24 04:03:30,762 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [WARNING|logging.py:328] 2026-04-24 04:03:30,773 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/7 [00:00system <|im_end|> <|im_start|>user What is the best way to clean my refrigerator?<|im_end|> <|im_start|>assistant Well, I have a few ideas, but the best way to know what works best is to run some tests and then analyze the data to figure out which technique works best.<|im_end|> <|im_start|>user What are your ideas?<|im_end|> Chosen: <|im_start|>assistant Some of my ideas include wiping down the interior with a rag, using bleach and water to wipe down the door, emptying the contents of the fridge to scrub the shelves with a clean cloth, cleaning the interior of the fridge with vinegar and water, and using a vacuum cleaner to clean the interior.<|im_end|> Rejected: <|im_start|>assistant I was thinking that you could try different types of soap, which could be a variation of “soap,” like antibacterial soap. I was also thinking that you could try using different tools, like a toothbrush, to clean the inside. Or I was thinking that you could try different ways to get the gunk out, like using a squeegee, or using a paper towel. I was also thinking you could try other types of cleaning, like vacuuming, but I think that could have the opposite of the desired effect.<|im_end|> /home/qu.yang1/dpo-test/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:391: UserWarning: You passed a model_id to the trainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. warnings.warn( [INFO|configuration_utils.py:691] 2026-04-24 04:03:30,830 >> loading configuration file /scratch/qu.yang1/dynamic-dpo-v4/outputs/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452/config.json [INFO|configuration_utils.py:765] 2026-04-24 04:03:30,831 >> Model config Qwen3Config { "architectures": [ "Qwen3ForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151643, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 12288, "max_position_embeddings": 32768, "max_window_layers": 36, "model_type": "qwen3", "num_attention_heads": 32, "num_hidden_layers": 36, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "use_sliding_window": false, "vocab_size": 151936 } Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 755.85it/s] [INFO|modeling_utils.py:1121] 2026-04-24 04:03:30,838 >> loading weights file /scratch/qu.yang1/dynamic-dpo-v4/outputs/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-24 04:03:30,838 >> Instantiating Qwen3ForCausalLM model under default dtype torch.bfloat16. [WARNING|logging.py:328] 2026-04-24 04:03:30,839 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|configuration_utils.py:1142] 2026-04-24 04:03:30,840 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "use_cache": false } Loading checkpoint shards: 0%| | 0/7 [00:00> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:821] 2026-04-24 04:03:30,892 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 439.10it/s] [WARNING|trainer.py:821] 2026-04-24 04:03:30,901 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. Loading checkpoint shards: 14%|████████████▊ | 1/7 [00:08<00:51, 8.54s/it] Loading checkpoint shards: 29%|█████████████████████████▋ | 2/7 [00:09<00:21, 4.30s/it] Loading checkpoint shards: 43%|██████████████████████████████████████▌ | 3/7 [00:11<00:11, 2.90s/it] Loading checkpoint shards: 57%|███████████████████████████████████████████████████▍ | 4/7 [00:12<00:06, 2.24s/it] Loading checkpoint shards: 71%|████████████████████████████████████████████████████████████████▎ | 5/7 [00:13<00:03, 1.83s/it] Loading checkpoint shards: 86%|█████████████████████████████████████████████████████████████████████████████▏ | 6/7 [00:14<00:01, 1.59s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:15<00:00, 1.32s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:15<00:00, 2.19s/it] [INFO|modeling_utils.py:4926] 2026-04-24 04:03:46,199 >> All model checkpoint weights were used when initializing Qwen3ForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-24 04:03:46,199 >> All the weights of Qwen3ForCausalLM were initialized from the model checkpoint at /scratch/qu.yang1/dynamic-dpo-v4/outputs/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen3ForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-24 04:03:46,201 >> loading configuration file /scratch/qu.yang1/dynamic-dpo-v4/outputs/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-24 04:03:46,202 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "max_new_tokens": 2048 } [INFO|configuration_utils.py:691] 2026-04-24 04:03:46,203 >> loading configuration file /scratch/qu.yang1/dynamic-dpo-v4/outputs/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452/config.json [INFO|configuration_utils.py:765] 2026-04-24 04:03:46,203 >> Model config Qwen3Config { "architectures": [ "Qwen3ForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151643, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 12288, "max_position_embeddings": 32768, "max_window_layers": 36, "model_type": "qwen3", "num_attention_heads": 32, "num_hidden_layers": 36, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.51.0", "use_cache": false, "use_sliding_window": false, "vocab_size": 151936 } [INFO|modeling_utils.py:1121] 2026-04-24 04:03:46,204 >> loading weights file /scratch/qu.yang1/dynamic-dpo-v4/outputs/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452/model.safetensors.index.json [INFO|modeling_utils.py:2167] 2026-04-24 04:03:46,205 >> Instantiating Qwen3ForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1142] 2026-04-24 04:03:46,207 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "use_cache": false } Loading checkpoint shards: 0%| | 0/7 [00:00> All model checkpoint weights were used when initializing Qwen3ForCausalLM. [INFO|modeling_utils.py:4934] 2026-04-24 04:03:54,298 >> All the weights of Qwen3ForCausalLM were initialized from the model checkpoint at /scratch/qu.yang1/dynamic-dpo-v4/outputs/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen3ForCausalLM for predictions without further training. [INFO|configuration_utils.py:1095] 2026-04-24 04:03:54,300 >> loading configuration file /scratch/qu.yang1/dynamic-dpo-v4/outputs/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452/generation_config.json [INFO|configuration_utils.py:1142] 2026-04-24 04:03:54,300 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "max_new_tokens": 2048 } [WARNING|trainer.py:821] 2026-04-24 04:03:54,302 >> Trainer.tokenizer is now deprecated. You should use `Trainer.processing_class = processing_class` instead. [WARNING|trainer.py:816] 2026-04-24 04:03:54,302 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Tokenizing train (num_proc=12): 0%| | 0/43598 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Saving the dataset (0/2 shards): 0%| | 0/43598 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Tokenizing test (num_proc=12): 0%| | 0/2339 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. Saving the dataset (0/1 shards): 0%| | 0/2339 [00:00> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-24 04:15:57,794 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-24 04:15:57,794 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-24 04:15:57,899 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-24 04:15:57,899 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-24 04:15:57,899 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-24 04:15:57,899 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-24 04:15:57,899 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-24 04:15:57,899 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-24 04:15:57,913 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:816] 2026-04-24 04:15:57,913 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/qu.yang1/dpo-test/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( /home/qu.yang1/dpo-test/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|trainer.py:816] 2026-04-24 04:15:57,913 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. /home/qu.yang1/dpo-test/dynamic-dpo-v4/scripts/tokenized_dpo_trainer.py:518: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `EpsilonDPOTrainer.__init__`. Use `processing_class` instead. super().__init__( [INFO|trainer.py:748] 2026-04-24 04:15:57,938 >> Using auto half precision backend /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in Qwen3ForCausalLM because mixed precision turned on in FSDP. Affects: model.embed_tokens.weight, model.norm.weight, lm_head.weight. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/accelerate/accelerator.py:1557: UserWarning: Upcasted low precision parameters in Qwen3DecoderLayer because mixed precision turned on in FSDP. Affects: self_attn.q_proj.weight, self_attn.k_proj.weight, self_attn.v_proj.weight, self_attn.o_proj.weight, self_attn.q_norm.weight, self_attn.k_norm.weight, mlp.gate_proj.weight, mlp.up_proj.weight, mlp.down_proj.weight, input_layernorm.weight, post_attention_layernorm.weight. warnings.warn( /home/qu.yang1/.conda/envs/dpo_v4/lib/python3.11/site-packages/accelerate/accelerator.py:1563: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints. warnings.warn( [INFO|trainer.py:2414] 2026-04-24 04:16:02,138 >> ***** Running training ***** [INFO|trainer.py:2415] 2026-04-24 04:16:02,138 >> Num examples = 43,598 [INFO|trainer.py:2416] 2026-04-24 04:16:02,138 >> Num Epochs = 1 [INFO|trainer.py:2417] 2026-04-24 04:16:02,138 >> Instantaneous batch size per device = 8 [INFO|trainer.py:2420] 2026-04-24 04:16:02,138 >> Total train batch size (w. parallel, distributed & accumulation) = 64 [INFO|trainer.py:2421] 2026-04-24 04:16:02,138 >> Gradient Accumulation steps = 2 [INFO|trainer.py:2422] 2026-04-24 04:16:02,138 >> Total optimization steps = 681 [INFO|trainer.py:2423] 2026-04-24 04:16:02,139 >> Number of trainable parameters = 2,047,683,840 [INFO|integration_utils.py:831] 2026-04-24 04:16:02,140 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: feng-cheng (feng-cheng-northeastern-university). Use `wandb login --relogin` to force relogin wandb: wandb version 0.26.1 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.17.5 wandb: Run data is saved locally in /scratch/qu.yang1/wandb/wandb/run-20260424_041603-gfncx0q7 wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306 wandb: ⭐️ View project at https://wandb.ai/feng-cheng-northeastern-university/qwen3_hh_4xh200_beta_0.1 wandb: 🚀 View run at https://wandb.ai/feng-cheng-northeastern-university/qwen3_hh_4xh200_beta_0.1/runs/gfncx0q7 0%| | 0/681 [00:00> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-24 04:16:08,817 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-24 04:16:08,829 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed [WARNING|modeling_utils.py:1713] 2026-04-24 04:16:08,835 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed 0%|▏ | 1/681 [00:03<34:44, 3.06s/it] {'loss': 1.381, 'grad_norm': 38.745460510253906, 'learning_rate': 0.0, 'rewards/chosen': 0.005238114856183529, 'rewards/rejected': -0.0005494409706443548, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.005787555128335953, 'logps/chosen': -85.37664031982422, 'logps/rejected': -79.91163635253906, 'logps/ref_chosen': -85.43083190917969, 'logps/ref_rejected': -79.90458679199219, 'logits/chosen': -1.585817575454712, 'logits/rejected': -0.5333532691001892, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.453125, 'kl/beta': 0.10000000149011612, 'kl/avg_steps': 0.09375, 'epoch': 0.0} 0%|▏ | 1/681 [00:03<34:44, 3.06s/it] 0%|▎ | 2/681 [00:06<35:07, 3.10s/it] {'loss': 1.3995, 'grad_norm': 29.78634262084961, 'learning_rate': 7.246376811594203e-09, 'rewards/chosen': -0.004650775343179703, 'rewards/rejected': 0.008120683953166008, 'rewards/accuracies': 0.375, 'rewards/margins': -0.012771460227668285, 'logps/chosen': -82.11383056640625, 'logps/rejected': -81.57505798339844, 'logps/ref_chosen': -82.06892395019531, 'logps/ref_rejected': -81.65457153320312, 'logits/chosen': -0.7526164054870605, 'logits/rejected': -0.3610996603965759, 'kl/p_epsilon_steps': 0.34375, 'kl/n_epsilon_steps': 0.65625, 'kl/beta': 0.09990634024143219, 'kl/avg_steps': -0.3125, 'epoch': 0.0} 0%|▎ | 2/681 [00:06<35:07, 3.10s/it] 0%|▌ | 3/681 [00:09<35:13, 3.12s/it] {'loss': 1.3813, 'grad_norm': 26.100635528564453, 'learning_rate': 1.4492753623188406e-08, 'rewards/chosen': 0.005534623749554157, 'rewards/rejected': 7.430883124470711e-05, 'rewards/accuracies': 0.578125, 'rewards/margins': 0.00546031491830945, 'logps/chosen': -93.7535629272461, 'logps/rejected': -74.23006439208984, 'logps/ref_chosen': -93.81098937988281, 'logps/ref_rejected': -74.22950744628906, 'logits/chosen': -1.002709150314331, 'logits/rejected': -0.5633082985877991, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.453125, 'kl/beta': 0.10021952539682388, 'kl/avg_steps': 0.09375, 'epoch': 0.0} 0%|▌ | 3/681 [00:09<35:13, 3.12s/it] 1%|▋ | 4/681 [00:12<35:29, 3.15s/it] {'loss': 1.3929, 'grad_norm': 32.92469024658203, 'learning_rate': 2.1739130434782606e-08, 'rewards/chosen': -0.003004699246957898, 'rewards/rejected': 0.0031696748919785023, 'rewards/accuracies': 0.5625, 'rewards/margins': -0.006174374371767044, 'logps/chosen': -87.32073211669922, 'logps/rejected': -93.79373168945312, 'logps/ref_chosen': -87.29246520996094, 'logps/ref_rejected': -93.82425689697266, 'logits/chosen': -0.8497915267944336, 'logits/rejected': -0.17156964540481567, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.453125, 'kl/beta': 0.10012565553188324, 'kl/avg_steps': 0.09375, 'epoch': 0.01} 1%|▋ | 4/681 [00:12<35:29, 3.15s/it] 1%|▊ | 5/681 [00:15<35:15, 3.13s/it] {'loss': 1.3853, 'grad_norm': 35.190330505371094, 'learning_rate': 2.898550724637681e-08, 'rewards/chosen': -0.0021656095050275326, 'rewards/rejected': -0.0035966699942946434, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.0014310609549283981, 'logps/chosen': -89.35664367675781, 'logps/rejected': -88.785400390625, 'logps/ref_chosen': -89.33675384521484, 'logps/ref_rejected': -88.74783325195312, 'logits/chosen': -1.187368392944336, 'logits/rejected': -0.5294585227966309, 'kl/p_epsilon_steps': 0.515625, 'kl/n_epsilon_steps': 0.484375, 'kl/beta': 0.10003187507390976, 'kl/avg_steps': 0.03125, 'epoch': 0.01} 1%|▊ | 5/681 [00:15<35:15, 3.13s/it] 1%|█ | 6/681 [00:18<33:23, 2.97s/it] {'loss': 1.3866, 'grad_norm': 36.109169006347656, 'learning_rate': 3.6231884057971014e-08, 'rewards/chosen': -0.0005067111924290657, 'rewards/rejected': -0.0007012896239757538, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.00019457843154668808, 'logps/chosen': -97.32476043701172, 'logps/rejected': -97.89209747314453, 'logps/ref_chosen': -97.32147216796875, 'logps/ref_rejected': -97.88345336914062, 'logits/chosen': -1.2944458723068237, 'logits/rejected': -0.41752371191978455, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.4375, 'kl/beta': 0.10000062733888626, 'kl/avg_steps': 0.109375, 'epoch': 0.01} 1%|█ | 6/681 [00:18<33:23, 2.97s/it] 1%|█▏ | 7/681 [00:21<33:51, 3.01s/it] {'loss': 1.3805, 'grad_norm': 37.86967086791992, 'learning_rate': 4.347826086956521e-08, 'rewards/chosen': 0.004504315089434385, 'rewards/rejected': -0.0016694795340299606, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.006173794623464346, 'logps/chosen': -86.60205078125, 'logps/rejected': -109.63433837890625, 'logps/ref_chosen': -86.64852905273438, 'logps/ref_rejected': -109.61618041992188, 'logits/chosen': -0.6984870433807373, 'logits/rejected': -0.42031070590019226, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.453125, 'kl/beta': 0.09989137202501297, 'kl/avg_steps': 0.09375, 'epoch': 0.01} 1%|█▏ | 7/681 [00:21<33:51, 3.01s/it] 1%|█▎ | 8/681 [00:24<33:42, 3.01s/it] {'loss': 1.3968, 'grad_norm': 32.797054290771484, 'learning_rate': 5.0724637681159424e-08, 'rewards/chosen': -0.002112824469804764, 'rewards/rejected': 0.008059106767177582, 'rewards/accuracies': 0.421875, 'rewards/margins': -0.010171930305659771, 'logps/chosen': -89.94332885742188, 'logps/rejected': -86.1485366821289, 'logps/ref_chosen': -89.9236831665039, 'logps/ref_rejected': -86.22803497314453, 'logits/chosen': -1.414137363433838, 'logits/rejected': -0.4588002562522888, 'kl/p_epsilon_steps': 0.4375, 'kl/n_epsilon_steps': 0.5625, 'kl/beta': 0.0997978076338768, 'kl/avg_steps': -0.125, 'epoch': 0.01} 1%|█▎ | 8/681 [00:24<33:42, 3.01s/it] 1%|█▌ | 9/681 [00:27<33:56, 3.03s/it] {'loss': 1.378, 'grad_norm': 35.72300338745117, 'learning_rate': 5.797101449275362e-08, 'rewards/chosen': 0.006141543388366699, 'rewards/rejected': -0.002803270472213626, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.008944813162088394, 'logps/chosen': -103.79417419433594, 'logps/rejected': -104.34971618652344, 'logps/ref_chosen': -103.85713195800781, 'logps/ref_rejected': -104.31932067871094, 'logits/chosen': -0.9759007692337036, 'logits/rejected': -0.3913915753364563, 'kl/p_epsilon_steps': 0.578125, 'kl/n_epsilon_steps': 0.421875, 'kl/beta': 0.09992270916700363, 'kl/avg_steps': 0.15625, 'epoch': 0.01} 1%|█▌ | 9/681 [00:27<33:56, 3.03s/it] 1%|█▋ | 10/681 [00:30<33:52, 3.03s/it] {'loss': 1.401, 'grad_norm': 33.81359100341797, 'learning_rate': 6.521739130434782e-08, 'rewards/chosen': -0.00028783950256183743, 'rewards/rejected': 0.014049299992620945, 'rewards/accuracies': 0.359375, 'rewards/margins': -0.014337141066789627, 'logps/chosen': -76.20588684082031, 'logps/rejected': -87.01283264160156, 'logps/ref_chosen': -76.20436096191406, 'logps/ref_rejected': -87.15210723876953, 'logits/chosen': -1.2568838596343994, 'logits/rejected': -0.34031057357788086, 'kl/p_epsilon_steps': 0.375, 'kl/n_epsilon_steps': 0.625, 'kl/beta': 0.09976682811975479, 'kl/avg_steps': -0.25, 'epoch': 0.01} 1%|█▋ | 10/681 [00:30<33:52, 3.03s/it] 2%|█▊ | 11/681 [00:33<33:58, 3.04s/it] {'loss': 1.3819, 'grad_norm': 37.09693908691406, 'learning_rate': 7.246376811594203e-08, 'rewards/chosen': 0.006203308701515198, 'rewards/rejected': 0.0014935237122699618, 'rewards/accuracies': 0.515625, 'rewards/margins': 0.004709784872829914, 'logps/chosen': -82.30293273925781, 'logps/rejected': -94.2509536743164, 'logps/ref_chosen': -82.36649322509766, 'logps/ref_rejected': -94.26461791992188, 'logits/chosen': -1.4052127599716187, 'logits/rejected': -0.392024964094162, 'kl/p_epsilon_steps': 0.484375, 'kl/n_epsilon_steps': 0.5, 'kl/beta': 0.10001686960458755, 'kl/avg_steps': -0.015625, 'epoch': 0.02} 2%|█▊ | 11/681 [00:33<33:58, 3.04s/it] 2%|██ | 12/681 [00:36<34:14, 3.07s/it] {'loss': 1.3785, 'grad_norm': 39.88624572753906, 'learning_rate': 7.971014492753623e-08, 'rewards/chosen': 0.003762049600481987, 'rewards/rejected': -0.0046038273721933365, 'rewards/accuracies': 0.578125, 'rewards/margins': 0.008365876972675323, 'logps/chosen': -99.06604766845703, 'logps/rejected': -110.31909942626953, 'logps/ref_chosen': -99.10549926757812, 'logps/ref_rejected': -110.27140808105469, 'logits/chosen': -0.7849316596984863, 'logits/rejected': -0.267448753118515, 'kl/p_epsilon_steps': 0.5625, 'kl/n_epsilon_steps': 0.4375, 'kl/beta': 0.1000325009226799, 'kl/avg_steps': 0.125, 'epoch': 0.02} 2%|██ | 12/681 [00:36<34:14, 3.07s/it] 2%|██▏ | 13/681 [00:39<34:50, 3.13s/it] {'loss': 1.3812, 'grad_norm': 41.430633544921875, 'learning_rate': 8.695652173913042e-08, 'rewards/chosen': 0.002787390723824501, 'rewards/rejected': -0.0029441972728818655, 'rewards/accuracies': 0.578125, 'rewards/margins': 0.00573158822953701, 'logps/chosen': -90.52992248535156, 'logps/rejected': -93.72262573242188, 'logps/ref_chosen': -90.55973052978516, 'logps/ref_rejected': -93.69110107421875, 'logits/chosen': -1.5717546939849854, 'logits/rejected': -0.7266464829444885, 'kl/p_epsilon_steps': 0.578125, 'kl/n_epsilon_steps': 0.421875, 'kl/beta': 0.09990761429071426, 'kl/avg_steps': 0.15625, 'epoch': 0.02} 2%|██▏ | 13/681 [00:39<34:50, 3.13s/it] 2%|██▎ | 14/681 [00:42<34:25, 3.10s/it] {'loss': 1.3898, 'grad_norm': 35.816402435302734, 'learning_rate': 9.420289855072464e-08, 'rewards/chosen': -0.003366068471223116, 'rewards/rejected': -0.00033931387588381767, 'rewards/accuracies': 0.515625, 'rewards/margins': -0.0030267564579844475, 'logps/chosen': -99.85889434814453, 'logps/rejected': -108.9466552734375, 'logps/ref_chosen': -99.82717895507812, 'logps/ref_rejected': -108.94200134277344, 'logits/chosen': -0.5540125370025635, 'logits/rejected': -0.28503644466400146, 'kl/p_epsilon_steps': 0.515625, 'kl/n_epsilon_steps': 0.484375, 'kl/beta': 0.09975175559520721, 'kl/avg_steps': 0.03125, 'epoch': 0.02} 2%|██▎ | 14/681 [00:43<34:25, 3.10s/it] 2%|██▌ | 15/681 [00:46<35:53, 3.23s/it] {'loss': 1.3795, 'grad_norm': 31.496644973754883, 'learning_rate': 1.0144927536231885e-07, 'rewards/chosen': 0.004188378341495991, 'rewards/rejected': -0.0030877136159688234, 'rewards/accuracies': 0.4375, 'rewards/margins': 0.007276091258972883, 'logps/chosen': -78.86597442626953, 'logps/rejected': -90.09466552734375, 'logps/ref_chosen': -78.90997314453125, 'logps/ref_rejected': -90.06234741210938, 'logits/chosen': -0.8750624060630798, 'logits/rejected': -0.2879735827445984, 'kl/p_epsilon_steps': 0.40625, 'kl/n_epsilon_steps': 0.59375, 'kl/beta': 0.09972058981657028, 'kl/avg_steps': -0.1875, 'epoch': 0.02} 2%|██▌ | 15/681 [00:46<35:53, 3.23s/it] 2%|██▋ | 16/681 [00:49<34:46, 3.14s/it] {'loss': 1.3824, 'grad_norm': 34.249393463134766, 'learning_rate': 1.0869565217391303e-07, 'rewards/chosen': 0.0043681650422513485, 'rewards/rejected': -0.0002953286748379469, 'rewards/accuracies': 0.578125, 'rewards/margins': 0.004663495346903801, 'logps/chosen': -97.3776626586914, 'logps/rejected': -90.60440826416016, 'logps/ref_chosen': -97.42327880859375, 'logps/ref_rejected': -90.59945678710938, 'logits/chosen': -0.9704724550247192, 'logits/rejected': -0.0901266559958458, 'kl/p_epsilon_steps': 0.578125, 'kl/n_epsilon_steps': 0.421875, 'kl/beta': 0.09990791976451874, 'kl/avg_steps': 0.15625, 'epoch': 0.02} 2%|██▋ | 16/681 [00:49<34:46, 3.14s/it] 2%|██▊ | 17/681 [00:52<34:06, 3.08s/it] {'loss': 1.3875, 'grad_norm': 35.82853698730469, 'learning_rate': 1.1594202898550725e-07, 'rewards/chosen': -0.0006631199503317475, 'rewards/rejected': -8.023856207728386e-05, 'rewards/accuracies': 0.515625, 'rewards/margins': -0.0005828813882544637, 'logps/chosen': -104.36908721923828, 'logps/rejected': -90.47051239013672, 'logps/ref_chosen': -104.36431121826172, 'logps/ref_rejected': -90.46772766113281, 'logits/chosen': -1.0136702060699463, 'logits/rejected': -0.5759009122848511, 'kl/p_epsilon_steps': 0.515625, 'kl/n_epsilon_steps': 0.484375, 'kl/beta': 0.09975205361843109, 'kl/avg_steps': 0.03125, 'epoch': 0.02} 2%|██▊ | 17/681 [00:52<34:06, 3.08s/it] 3%|███ | 18/681 [00:55<33:32, 3.04s/it] {'loss': 1.383, 'grad_norm': 41.663047790527344, 'learning_rate': 1.2318840579710146e-07, 'rewards/chosen': 0.0020006708800792694, 'rewards/rejected': -0.001984333386644721, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.003985004499554634, 'logps/chosen': -87.06967163085938, 'logps/rejected': -81.87223052978516, 'logps/ref_chosen': -87.09195709228516, 'logps/ref_rejected': -81.85072326660156, 'logits/chosen': -1.795450210571289, 'logits/rejected': -0.8455245494842529, 'kl/p_epsilon_steps': 0.53125, 'kl/n_epsilon_steps': 0.46875, 'kl/beta': 0.09972089529037476, 'kl/avg_steps': 0.0625, 'epoch': 0.03} 3%|███ | 18/681 [00:55<33:32, 3.04s/it] 3%|███▏ | 19/681 [00:58<34:43, 3.15s/it] {'loss': 1.3826, 'grad_norm': 31.180570602416992, 'learning_rate': 1.3043478260869563e-07, 'rewards/chosen': 0.0018369832541793585, 'rewards/rejected': -0.0022847556974738836, 'rewards/accuracies': 0.546875, 'rewards/margins': 0.004121738485991955, 'logps/chosen': -105.85330963134766, 'logps/rejected': -96.95452880859375, 'logps/ref_chosen': -105.87354278564453, 'logps/ref_rejected': -96.93023681640625, 'logits/chosen': -1.0867879390716553, 'logits/rejected': -0.03352098539471626, 'kl/p_epsilon_steps': 0.53125, 'kl/n_epsilon_steps': 0.46875, 'kl/beta': 0.09965860843658447, 'kl/avg_steps': 0.0625, 'epoch': 0.03} 3%|███▏ | 19/681 [00:58<34:43, 3.15s/it] 3%|███▎ | 20/681 [01:01<34:23, 3.12s/it] {'loss': 1.3791, 'grad_norm': 32.30035400390625, 'learning_rate': 1.3768115942028986e-07, 'rewards/chosen': 0.003147183684632182, 'rewards/rejected': -0.004679815378040075, 'rewards/accuracies': 0.546875, 'rewards/margins': 0.007826998829841614, 'logps/chosen': -90.72392272949219, 'logps/rejected': -85.96060180664062, 'logps/ref_chosen': -90.75811767578125, 'logps/ref_rejected': -85.91232299804688, 'logits/chosen': -1.2128328084945679, 'logits/rejected': -0.32826870679855347, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.453125, 'kl/beta': 0.09959635883569717, 'kl/avg_steps': 0.09375, 'epoch': 0.03} 3%|███▎ | 20/681 [01:01<34:23, 3.12s/it] 3%|███▌ | 21/681 [01:05<35:01, 3.18s/it] {'loss': 1.3808, 'grad_norm': 31.352293014526367, 'learning_rate': 1.4492753623188405e-07, 'rewards/chosen': 0.008116335608065128, 'rewards/rejected': 0.002345857210457325, 'rewards/accuracies': 0.59375, 'rewards/margins': 0.005770478397607803, 'logps/chosen': -80.2506332397461, 'logps/rejected': -83.91175842285156, 'logps/ref_chosen': -80.33346557617188, 'logps/ref_rejected': -83.9337387084961, 'logits/chosen': -0.9726300239562988, 'logits/rejected': -0.411948561668396, 'kl/p_epsilon_steps': 0.5625, 'kl/n_epsilon_steps': 0.4375, 'kl/beta': 0.09950307011604309, 'kl/avg_steps': 0.125, 'epoch': 0.03} 3%|███▌ | 21/681 [01:05<35:01, 3.18s/it] 3%|███▋ | 22/681 [01:08<35:30, 3.23s/it] {'loss': 1.3819, 'grad_norm': 42.61378860473633, 'learning_rate': 1.5217391304347825e-07, 'rewards/chosen': 0.0037977853789925575, 'rewards/rejected': -0.0010757955024018884, 'rewards/accuracies': 0.546875, 'rewards/margins': 0.004873580764979124, 'logps/chosen': -95.35507202148438, 'logps/rejected': -103.48601531982422, 'logps/ref_chosen': -95.39530181884766, 'logps/ref_rejected': -103.47351837158203, 'logits/chosen': -0.9675413370132446, 'logits/rejected': -0.6590346693992615, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.453125, 'kl/beta': 0.09937884658575058, 'kl/avg_steps': 0.09375, 'epoch': 0.03} 3%|███▋ | 22/681 [01:08<35:30, 3.23s/it] 3%|███▊ | 23/681 [01:11<35:40, 3.25s/it] {'loss': 1.3916, 'grad_norm': 31.29548454284668, 'learning_rate': 1.5942028985507245e-07, 'rewards/chosen': 0.00023264711489900947, 'rewards/rejected': 0.005144191440194845, 'rewards/accuracies': 0.4375, 'rewards/margins': -0.004911544732749462, 'logps/chosen': -90.63298034667969, 'logps/rejected': -86.54367065429688, 'logps/ref_chosen': -90.63751220703125, 'logps/ref_rejected': -86.59425354003906, 'logits/chosen': -1.156263828277588, 'logits/rejected': -0.3607323169708252, 'kl/p_epsilon_steps': 0.453125, 'kl/n_epsilon_steps': 0.546875, 'kl/beta': 0.09928576648235321, 'kl/avg_steps': -0.09375, 'epoch': 0.03} 3%|███▊ | 23/681 [01:11<35:40, 3.25s/it] 4%|████ | 24/681 [01:14<35:23, 3.23s/it] {'loss': 1.395, 'grad_norm': 44.2933464050293, 'learning_rate': 1.6666666666666665e-07, 'rewards/chosen': -0.006438862532377243, 'rewards/rejected': 0.0017685755155980587, 'rewards/accuracies': 0.453125, 'rewards/margins': -0.008207438513636589, 'logps/chosen': -69.98039245605469, 'logps/rejected': -106.59368133544922, 'logps/ref_chosen': -69.91728973388672, 'logps/ref_rejected': -106.60990142822266, 'logits/chosen': -0.8095067143440247, 'logits/rejected': -0.5244461297988892, 'kl/p_epsilon_steps': 0.484375, 'kl/n_epsilon_steps': 0.515625, 'kl/beta': 0.09937893599271774, 'kl/avg_steps': -0.03125, 'epoch': 0.04} 4%|████ | 24/681 [01:14<35:23, 3.23s/it] 4%|████▏ | 25/681 [01:18<35:05, 3.21s/it] {'loss': 1.3773, 'grad_norm': 36.52031326293945, 'learning_rate': 1.7391304347826085e-07, 'rewards/chosen': -0.0005520773120224476, 'rewards/rejected': -0.010130547918379307, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.009578470140695572, 'logps/chosen': -80.82945251464844, 'logps/rejected': -96.06101989746094, 'logps/ref_chosen': -80.82548522949219, 'logps/ref_rejected': -95.95710754394531, 'logits/chosen': -1.2055686712265015, 'logits/rejected': -0.4221525490283966, 'kl/p_epsilon_steps': 0.625, 'kl/n_epsilon_steps': 0.375, 'kl/beta': 0.09941000491380692, 'kl/avg_steps': 0.25, 'epoch': 0.04} 4%|████▏ | 25/681 [01:18<35:05, 3.21s/it] 4%|████▎ | 26/681 [01:20<33:23, 3.06s/it] {'loss': 1.3825, 'grad_norm': 40.3887939453125, 'learning_rate': 1.8115942028985507e-07, 'rewards/chosen': 0.00035169871989637613, 'rewards/rejected': -0.0038559872191399336, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.004207686521112919, 'logps/chosen': -88.89604949951172, 'logps/rejected': -109.86863708496094, 'logps/ref_chosen': -88.90116882324219, 'logps/ref_rejected': -109.82818603515625, 'logits/chosen': -1.4493508338928223, 'logits/rejected': -0.5392433404922485, 'kl/p_epsilon_steps': 0.578125, 'kl/n_epsilon_steps': 0.421875, 'kl/beta': 0.09916209429502487, 'kl/avg_steps': 0.15625, 'epoch': 0.04} 4%|████▎ | 26/681 [01:20<33:23, 3.06s/it] 4%|████▌ | 27/681 [01:23<32:57, 3.02s/it] {'loss': 1.3714, 'grad_norm': 43.48580551147461, 'learning_rate': 1.8840579710144927e-07, 'rewards/chosen': 0.0019225336145609617, 'rewards/rejected': -0.013401055708527565, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.01532358955591917, 'logps/chosen': -77.5752182006836, 'logps/rejected': -104.07573699951172, 'logps/ref_chosen': -77.59600830078125, 'logps/ref_rejected': -103.93850708007812, 'logits/chosen': -1.6640090942382812, 'logits/rejected': -0.6294593811035156, 'kl/p_epsilon_steps': 0.640625, 'kl/n_epsilon_steps': 0.359375, 'kl/beta': 0.09900739789009094, 'kl/avg_steps': 0.28125, 'epoch': 0.04} 4%|████▌ | 27/681 [01:23<32:57, 3.02s/it] 4%|████▋ | 28/681 [01:26<32:54, 3.02s/it] {'loss': 1.3723, 'grad_norm': 35.73523712158203, 'learning_rate': 1.9565217391304347e-07, 'rewards/chosen': 0.004990905057638884, 'rewards/rejected': -0.00980809610337019, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.014799000695347786, 'logps/chosen': -102.17597961425781, 'logps/rejected': -97.06103515625, 'logps/ref_chosen': -102.22856140136719, 'logps/ref_rejected': -96.9594955444336, 'logits/chosen': -0.956201434135437, 'logits/rejected': -0.39606085419654846, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.09872972220182419, 'kl/avg_steps': 0.4375, 'epoch': 0.04} 4%|████▋ | 28/681 [01:26<32:54, 3.02s/it] 4%|████▊ | 29/681 [01:29<31:39, 2.91s/it] {'loss': 1.3704, 'grad_norm': 41.57979965209961, 'learning_rate': 2.028985507246377e-07, 'rewards/chosen': 0.0016324977623298764, 'rewards/rejected': -0.014854353852570057, 'rewards/accuracies': 0.609375, 'rewards/margins': 0.01648685149848461, 'logps/chosen': -88.62876892089844, 'logps/rejected': -103.11316680908203, 'logps/ref_chosen': -88.64704895019531, 'logps/ref_rejected': -102.96011352539062, 'logits/chosen': -1.4259010553359985, 'logits/rejected': -0.6437522172927856, 'kl/p_epsilon_steps': 0.578125, 'kl/n_epsilon_steps': 0.421875, 'kl/beta': 0.09829965978860855, 'kl/avg_steps': 0.15625, 'epoch': 0.04} 4%|████▊ | 29/681 [01:29<31:39, 2.91s/it] 4%|█████ | 30/681 [01:32<33:42, 3.11s/it] {'loss': 1.3817, 'grad_norm': 38.55412292480469, 'learning_rate': 2.1014492753623187e-07, 'rewards/chosen': 6.297486834228039e-06, 'rewards/rejected': -0.005099880509078503, 'rewards/accuracies': 0.578125, 'rewards/margins': 0.0051061781123280525, 'logps/chosen': -88.3867416381836, 'logps/rejected': -102.37272644042969, 'logps/ref_chosen': -88.38838958740234, 'logps/ref_rejected': -102.31889343261719, 'logits/chosen': -0.9132494926452637, 'logits/rejected': -0.49612629413604736, 'kl/p_epsilon_steps': 0.53125, 'kl/n_epsilon_steps': 0.46875, 'kl/beta': 0.09814630448818207, 'kl/avg_steps': 0.0625, 'epoch': 0.04} 4%|█████ | 30/681 [01:33<33:42, 3.11s/it] 5%|█████▏ | 31/681 [01:36<34:06, 3.15s/it] {'loss': 1.3782, 'grad_norm': 30.63753890991211, 'learning_rate': 2.1739130434782607e-07, 'rewards/chosen': -0.0009215597528964281, 'rewards/rejected': -0.009414611384272575, 'rewards/accuracies': 0.578125, 'rewards/margins': 0.008493051864206791, 'logps/chosen': -101.13359832763672, 'logps/rejected': -79.95623779296875, 'logps/ref_chosen': -101.12565612792969, 'logps/ref_rejected': -79.85842895507812, 'logits/chosen': -0.8628894090652466, 'logits/rejected': -0.3569292426109314, 'kl/p_epsilon_steps': 0.546875, 'kl/n_epsilon_steps': 0.453125, 'kl/beta': 0.09808500111103058, 'kl/avg_steps': 0.09375, 'epoch': 0.05} 5%|█████▏ | 31/681 [01:36<34:06, 3.15s/it] 5%|█████▎ | 32/681 [01:39<34:00, 3.14s/it] {'loss': 1.3789, 'grad_norm': 34.43489456176758, 'learning_rate': 2.2463768115942027e-07, 'rewards/chosen': -0.006817132234573364, 'rewards/rejected': -0.015008427202701569, 'rewards/accuracies': 0.5625, 'rewards/margins': 0.008191294968128204, 'logps/chosen': -96.68499755859375, 'logps/rejected': -96.38025665283203, 'logps/ref_chosen': -96.61703491210938, 'logps/ref_rejected': -96.224365234375, 'logits/chosen': -0.9892777800559998, 'logits/rejected': -0.34080418944358826, 'kl/p_epsilon_steps': 0.5, 'kl/n_epsilon_steps': 0.5, 'kl/beta': 0.09799313545227051, 'kl/avg_steps': 0.0, 'epoch': 0.05} 5%|█████▎ | 32/681 [01:39<34:00, 3.14s/it] 5%|█████▌ | 33/681 [01:42<33:01, 3.06s/it] {'loss': 1.376, 'grad_norm': 32.965362548828125, 'learning_rate': 2.318840579710145e-07, 'rewards/chosen': 0.0003811400383710861, 'rewards/rejected': -0.01046024076640606, 'rewards/accuracies': 0.578125, 'rewards/margins': 0.010841380804777145, 'logps/chosen': -81.51568603515625, 'logps/rejected': -93.91485595703125, 'logps/ref_chosen': -81.5210189819336, 'logps/ref_rejected': -93.80595397949219, 'logits/chosen': -1.2965284585952759, 'logits/rejected': -0.3348070979118347, 'kl/p_epsilon_steps': 0.5625, 'kl/n_epsilon_steps': 0.421875, 'kl/beta': 0.09799313545227051, 'kl/avg_steps': 0.140625, 'epoch': 0.05} 5%|█████▌ | 33/681 [01:42<33:01, 3.06s/it] 5%|█████▋ | 34/681 [01:45<33:12, 3.08s/it] {'loss': 1.3705, 'grad_norm': 39.99357604980469, 'learning_rate': 2.391304347826087e-07, 'rewards/chosen': 0.004645414184778929, 'rewards/rejected': -0.011928501538932323, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.016573915258049965, 'logps/chosen': -77.15280151367188, 'logps/rejected': -106.84297943115234, 'logps/ref_chosen': -77.20204162597656, 'logps/ref_rejected': -106.71875762939453, 'logits/chosen': -1.2087818384170532, 'logits/rejected': -0.19416889548301697, 'kl/p_epsilon_steps': 0.609375, 'kl/n_epsilon_steps': 0.375, 'kl/beta': 0.09785552322864532, 'kl/avg_steps': 0.234375, 'epoch': 0.05} 5%|█████▋ | 34/681 [01:45<33:12, 3.08s/it] 5%|█████▊ | 35/681 [01:48<32:53, 3.05s/it] {'loss': 1.3678, 'grad_norm': 41.544822692871094, 'learning_rate': 2.463768115942029e-07, 'rewards/chosen': 0.0007331545930355787, 'rewards/rejected': -0.018401240929961205, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.019134394824504852, 'logps/chosen': -77.5616683959961, 'logps/rejected': -112.379638671875, 'logps/ref_chosen': -77.57035827636719, 'logps/ref_rejected': -112.18855285644531, 'logits/chosen': -1.575798749923706, 'logits/rejected': -0.5509282350540161, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.328125, 'kl/beta': 0.0976267158985138, 'kl/avg_steps': 0.34375, 'epoch': 0.05} 5%|█████▊ | 35/681 [01:48<32:53, 3.05s/it] 5%|██████ | 36/681 [01:51<33:03, 3.07s/it] {'loss': 1.3662, 'grad_norm': 31.387723922729492, 'learning_rate': 2.536231884057971e-07, 'rewards/chosen': -0.0043023210018873215, 'rewards/rejected': -0.025263587012887, 'rewards/accuracies': 0.625, 'rewards/margins': 0.02096126601099968, 'logps/chosen': -83.33061218261719, 'logps/rejected': -92.01959228515625, 'logps/ref_chosen': -83.28824615478516, 'logps/ref_rejected': -91.75741577148438, 'logits/chosen': -1.432613730430603, 'logits/rejected': -0.642814040184021, 'kl/p_epsilon_steps': 0.640625, 'kl/n_epsilon_steps': 0.359375, 'kl/beta': 0.09729227423667908, 'kl/avg_steps': 0.28125, 'epoch': 0.05} 5%|██████ | 36/681 [01:51<33:03, 3.07s/it] 5%|██████▏ | 37/681 [01:54<33:52, 3.16s/it] {'loss': 1.368, 'grad_norm': 34.76736831665039, 'learning_rate': 2.6086956521739126e-07, 'rewards/chosen': -0.005295893643051386, 'rewards/rejected': -0.0246497243642807, 'rewards/accuracies': 0.53125, 'rewards/margins': 0.01935383304953575, 'logps/chosen': -94.82408142089844, 'logps/rejected': -85.97396087646484, 'logps/ref_chosen': -94.77108764648438, 'logps/ref_rejected': -85.7172622680664, 'logits/chosen': -0.9882210493087769, 'logits/rejected': -0.7192566990852356, 'kl/p_epsilon_steps': 0.484375, 'kl/n_epsilon_steps': 0.515625, 'kl/beta': 0.09701940417289734, 'kl/avg_steps': -0.03125, 'epoch': 0.05} 5%|██████▏ | 37/681 [01:54<33:52, 3.16s/it] 6%|██████▎ | 38/681 [01:57<33:08, 3.09s/it] {'loss': 1.3601, 'grad_norm': 35.520565032958984, 'learning_rate': 2.681159420289855e-07, 'rewards/chosen': 0.0053534312173724174, 'rewards/rejected': -0.02198958396911621, 'rewards/accuracies': 0.65625, 'rewards/margins': 0.027343016117811203, 'logps/chosen': -75.92189025878906, 'logps/rejected': -104.33356475830078, 'logps/ref_chosen': -75.97850799560547, 'logps/ref_rejected': -104.10401916503906, 'logits/chosen': -0.9708235263824463, 'logits/rejected': -0.36027270555496216, 'kl/p_epsilon_steps': 0.65625, 'kl/n_epsilon_steps': 0.34375, 'kl/beta': 0.09704973548650742, 'kl/avg_steps': 0.3125, 'epoch': 0.06} 6%|██████▎ | 38/681 [01:57<33:08, 3.09s/it] 6%|██████▌ | 39/681 [02:00<32:48, 3.07s/it] {'loss': 1.3596, 'grad_norm': 34.98324966430664, 'learning_rate': 2.753623188405797e-07, 'rewards/chosen': -0.0006124734645709395, 'rewards/rejected': -0.027974674478173256, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.02736220322549343, 'logps/chosen': -81.19099426269531, 'logps/rejected': -84.48798370361328, 'logps/ref_chosen': -81.18577575683594, 'logps/ref_rejected': -84.1959228515625, 'logits/chosen': -1.2482174634933472, 'logits/rejected': -0.5844467878341675, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.09674739837646484, 'kl/avg_steps': 0.46875, 'epoch': 0.06} 6%|██████▌ | 39/681 [02:00<32:48, 3.07s/it] 6%|██████▋ | 40/681 [02:03<32:23, 3.03s/it] {'loss': 1.363, 'grad_norm': 27.852684020996094, 'learning_rate': 2.8260869565217386e-07, 'rewards/chosen': 0.007215453311800957, 'rewards/rejected': -0.016598014160990715, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.023813467472791672, 'logps/chosen': -83.25595092773438, 'logps/rejected': -80.43058776855469, 'logps/ref_chosen': -83.33256530761719, 'logps/ref_rejected': -80.25591278076172, 'logits/chosen': -0.8650610446929932, 'logits/rejected': -0.20084291696548462, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.09629600495100021, 'kl/avg_steps': 0.40625, 'epoch': 0.06} 6%|██████▋ | 40/681 [02:03<32:23, 3.03s/it] 6%|██████▊ | 41/681 [02:06<32:26, 3.04s/it] {'loss': 1.3656, 'grad_norm': 31.773061752319336, 'learning_rate': 2.898550724637681e-07, 'rewards/chosen': -0.004216345027089119, 'rewards/rejected': -0.026013631373643875, 'rewards/accuracies': 0.671875, 'rewards/margins': 0.021797288209199905, 'logps/chosen': -93.19059753417969, 'logps/rejected': -102.35368347167969, 'logps/ref_chosen': -93.14866638183594, 'logps/ref_rejected': -102.07920837402344, 'logits/chosen': -1.026604413986206, 'logits/rejected': -0.37627846002578735, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.328125, 'kl/beta': 0.09590639173984528, 'kl/avg_steps': 0.34375, 'epoch': 0.06} 6%|██████▊ | 41/681 [02:06<32:26, 3.04s/it] 6%|███████ | 42/681 [02:09<32:23, 3.04s/it] {'loss': 1.3432, 'grad_norm': 38.71194839477539, 'learning_rate': 2.971014492753623e-07, 'rewards/chosen': 0.002619321458041668, 'rewards/rejected': -0.041931942105293274, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.044551268219947815, 'logps/chosen': -90.67225646972656, 'logps/rejected': -114.30716705322266, 'logps/ref_chosen': -90.70162200927734, 'logps/ref_rejected': -113.8646469116211, 'logits/chosen': -1.2612457275390625, 'logits/rejected': -0.5760804414749146, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.09557784348726273, 'kl/avg_steps': 0.40625, 'epoch': 0.06} 6%|███████ | 42/681 [02:09<32:23, 3.04s/it] 6%|███████▏ | 43/681 [02:12<32:27, 3.05s/it] {'loss': 1.3476, 'grad_norm': 35.74715805053711, 'learning_rate': 3.043478260869565e-07, 'rewards/chosen': 0.002490551210939884, 'rewards/rejected': -0.03793483227491379, 'rewards/accuracies': 0.703125, 'rewards/margins': 0.0404253825545311, 'logps/chosen': -89.61614990234375, 'logps/rejected': -104.35449981689453, 'logps/ref_chosen': -89.64402770996094, 'logps/ref_rejected': -103.95185852050781, 'logits/chosen': -1.2501815557479858, 'logits/rejected': -0.3675554692745209, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.09519112855195999, 'kl/avg_steps': 0.4375, 'epoch': 0.06} 6%|███████▏ | 43/681 [02:12<32:27, 3.05s/it] 6%|███████▎ | 44/681 [02:15<32:17, 3.04s/it] {'loss': 1.3562, 'grad_norm': 32.69614028930664, 'learning_rate': 3.115942028985507e-07, 'rewards/chosen': -0.010416124947369099, 'rewards/rejected': -0.041503991931676865, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.03108786605298519, 'logps/chosen': -81.98731994628906, 'logps/rejected': -113.86365509033203, 'logps/ref_chosen': -81.8783187866211, 'logps/ref_rejected': -113.421630859375, 'logits/chosen': -1.5350993871688843, 'logits/rejected': -0.39478617906570435, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.09477648138999939, 'kl/avg_steps': 0.46875, 'epoch': 0.06} 6%|███████▎ | 44/681 [02:15<32:17, 3.04s/it] 7%|███████▌ | 45/681 [02:19<34:17, 3.24s/it] {'loss': 1.3645, 'grad_norm': 22.90825080871582, 'learning_rate': 3.188405797101449e-07, 'rewards/chosen': -0.009415511973202229, 'rewards/rejected': -0.031987544149160385, 'rewards/accuracies': 0.75, 'rewards/margins': 0.022572031244635582, 'logps/chosen': -77.44337463378906, 'logps/rejected': -84.58972930908203, 'logps/ref_chosen': -77.34459686279297, 'logps/ref_rejected': -84.24774169921875, 'logits/chosen': -0.9012535810470581, 'logits/rejected': -0.24215500056743622, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.09433428943157196, 'kl/avg_steps': 0.40625, 'epoch': 0.07} 7%|███████▌ | 45/681 [02:19<34:17, 3.24s/it] 7%|███████▋ | 46/681 [02:22<34:02, 3.22s/it] {'loss': 1.3528, 'grad_norm': 31.163087844848633, 'learning_rate': 3.260869565217391e-07, 'rewards/chosen': -0.005449555814266205, 'rewards/rejected': -0.039974093437194824, 'rewards/accuracies': 0.75, 'rewards/margins': 0.03452453762292862, 'logps/chosen': -90.39727020263672, 'logps/rejected': -101.10578918457031, 'logps/ref_chosen': -90.3408203125, 'logps/ref_rejected': -100.676513671875, 'logits/chosen': -1.383570909500122, 'logits/rejected': -0.5739269256591797, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.09395260363817215, 'kl/avg_steps': 0.546875, 'epoch': 0.07} 7%|███████▋ | 46/681 [02:22<34:02, 3.22s/it] 7%|███████▊ | 47/681 [02:25<33:51, 3.20s/it] {'loss': 1.3497, 'grad_norm': 31.565263748168945, 'learning_rate': 3.333333333333333e-07, 'rewards/chosen': -0.0006828177720308304, 'rewards/rejected': -0.03886501491069794, 'rewards/accuracies': 0.671875, 'rewards/margins': 0.03818219527602196, 'logps/chosen': -104.41678619384766, 'logps/rejected': -101.67523193359375, 'logps/ref_chosen': -104.41130065917969, 'logps/ref_rejected': -101.25489807128906, 'logits/chosen': -0.9797345995903015, 'logits/rejected': -0.36259299516677856, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.3125, 'kl/beta': 0.09344159811735153, 'kl/avg_steps': 0.375, 'epoch': 0.07} 7%|███████▊ | 47/681 [02:26<33:51, 3.20s/it] 7%|████████ | 48/681 [02:29<33:37, 3.19s/it] {'loss': 1.3375, 'grad_norm': 35.98372268676758, 'learning_rate': 3.4057971014492755e-07, 'rewards/chosen': -0.004128246568143368, 'rewards/rejected': -0.05539228022098541, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.05126403272151947, 'logps/chosen': -91.1424560546875, 'logps/rejected': -95.68135070800781, 'logps/ref_chosen': -91.10027313232422, 'logps/ref_rejected': -95.08057403564453, 'logits/chosen': -1.2902014255523682, 'logits/rejected': -0.7823787331581116, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.09309250116348267, 'kl/avg_steps': 0.53125, 'epoch': 0.07} 7%|████████ | 48/681 [02:29<33:37, 3.19s/it] 7%|████████▏ | 49/681 [02:32<33:08, 3.15s/it] {'loss': 1.3199, 'grad_norm': 36.75701904296875, 'learning_rate': 3.478260869565217e-07, 'rewards/chosen': 0.007629199419170618, 'rewards/rejected': -0.06168051436543465, 'rewards/accuracies': 0.75, 'rewards/margins': 0.06930971145629883, 'logps/chosen': -92.91861724853516, 'logps/rejected': -92.42121887207031, 'logps/ref_chosen': -93.00367736816406, 'logps/ref_rejected': -91.74899291992188, 'logits/chosen': -2.277641773223877, 'logits/rejected': -0.8717272281646729, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.09260056167840958, 'kl/avg_steps': 0.46875, 'epoch': 0.07} 7%|████████▏ | 49/681 [02:32<33:08, 3.15s/it] 7%|████████▎ | 50/681 [02:35<32:25, 3.08s/it] {'loss': 1.3264, 'grad_norm': 30.533233642578125, 'learning_rate': 3.5507246376811595e-07, 'rewards/chosen': -0.007727333344519138, 'rewards/rejected': -0.07026051729917526, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.0625331848859787, 'logps/chosen': -94.70893859863281, 'logps/rejected': -104.34310913085938, 'logps/ref_chosen': -94.62681579589844, 'logps/ref_rejected': -103.57435607910156, 'logits/chosen': -1.142214059829712, 'logits/rejected': -0.34731245040893555, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.09216851741075516, 'kl/avg_steps': 0.59375, 'epoch': 0.07} 7%|████████▎ | 50/681 [02:35<32:25, 3.08s/it] 7%|████████▌ | 51/681 [02:38<32:35, 3.10s/it] {'loss': 1.3355, 'grad_norm': 27.474159240722656, 'learning_rate': 3.6231884057971015e-07, 'rewards/chosen': -0.007823294959962368, 'rewards/rejected': -0.06161422282457352, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.053790926933288574, 'logps/chosen': -87.59027099609375, 'logps/rejected': -84.15083312988281, 'logps/ref_chosen': -87.50727844238281, 'logps/ref_rejected': -83.47235870361328, 'logits/chosen': -1.5295517444610596, 'logits/rejected': -1.121992826461792, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.09162449836730957, 'kl/avg_steps': 0.53125, 'epoch': 0.07} 7%|████████▌ | 51/681 [02:38<32:35, 3.10s/it] 8%|████████▋ | 52/681 [02:41<32:37, 3.11s/it] {'loss': 1.3116, 'grad_norm': 29.224390029907227, 'learning_rate': 3.695652173913043e-07, 'rewards/chosen': -0.004274226725101471, 'rewards/rejected': -0.08229520916938782, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.07802098244428635, 'logps/chosen': -90.67510986328125, 'logps/rejected': -87.94956970214844, 'logps/ref_chosen': -90.63026428222656, 'logps/ref_rejected': -87.0390625, 'logits/chosen': -1.5085875988006592, 'logits/rejected': -0.8081971406936646, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.09114031493663788, 'kl/avg_steps': 0.625, 'epoch': 0.08} 8%|████████▋ | 52/681 [02:41<32:37, 3.11s/it] 8%|████████▊ | 53/681 [02:44<33:11, 3.17s/it] {'loss': 1.2987, 'grad_norm': 35.840335845947266, 'learning_rate': 3.7681159420289855e-07, 'rewards/chosen': -0.004804985597729683, 'rewards/rejected': -0.09881868213415146, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.09401369839906693, 'logps/chosen': -81.63391876220703, 'logps/rejected': -96.76312255859375, 'logps/ref_chosen': -81.58306884765625, 'logps/ref_rejected': -95.66152954101562, 'logits/chosen': -1.548392415046692, 'logits/rejected': -1.1408261060714722, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.0905742272734642, 'kl/avg_steps': 0.59375, 'epoch': 0.08} 8%|████████▊ | 53/681 [02:44<33:11, 3.17s/it] 8%|█████████ | 54/681 [02:47<32:05, 3.07s/it] {'loss': 1.2939, 'grad_norm': 35.994388580322266, 'learning_rate': 3.8405797101449274e-07, 'rewards/chosen': -0.002394177485257387, 'rewards/rejected': -0.10055913031101227, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.09816494584083557, 'logps/chosen': -88.93460083007812, 'logps/rejected': -100.24466705322266, 'logps/ref_chosen': -88.91016387939453, 'logps/ref_rejected': -99.1175537109375, 'logits/chosen': -1.3138582706451416, 'logits/rejected': -0.6569335460662842, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.09003961831331253, 'kl/avg_steps': 0.5, 'epoch': 0.08} 8%|█████████ | 54/681 [02:47<32:05, 3.07s/it] 8%|█████████▏ | 55/681 [02:50<30:49, 2.95s/it] {'loss': 1.2933, 'grad_norm': 31.913665771484375, 'learning_rate': 3.9130434782608694e-07, 'rewards/chosen': 0.010466434992849827, 'rewards/rejected': -0.0895768478512764, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.10004328191280365, 'logps/chosen': -92.33493041992188, 'logps/rejected': -93.97943115234375, 'logps/ref_chosen': -92.45592498779297, 'logps/ref_rejected': -92.97093963623047, 'logits/chosen': -1.6863343715667725, 'logits/rejected': -0.6388131976127625, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.08959165960550308, 'kl/avg_steps': 0.46875, 'epoch': 0.08} 8%|█████████▏ | 55/681 [02:50<30:49, 2.95s/it] 8%|█████████▎ | 56/681 [02:53<31:22, 3.01s/it] {'loss': 1.3033, 'grad_norm': 29.783920288085938, 'learning_rate': 3.9855072463768114e-07, 'rewards/chosen': -0.008556234650313854, 'rewards/rejected': -0.09898769855499268, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.09043145924806595, 'logps/chosen': -87.32936096191406, 'logps/rejected': -101.76547241210938, 'logps/ref_chosen': -87.23665618896484, 'logps/ref_rejected': -100.64553833007812, 'logits/chosen': -1.9710514545440674, 'logits/rejected': -0.6645182371139526, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.08917365968227386, 'kl/avg_steps': 0.5625, 'epoch': 0.08} 8%|█████████▎ | 56/681 [02:53<31:22, 3.01s/it] 8%|█████████▌ | 57/681 [02:56<31:09, 3.00s/it] {'loss': 1.2927, 'grad_norm': 30.23967933654785, 'learning_rate': 4.057971014492754e-07, 'rewards/chosen': -0.003269542008638382, 'rewards/rejected': -0.10384444147348404, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.10057489573955536, 'logps/chosen': -98.1844253540039, 'logps/rejected': -102.23455047607422, 'logps/ref_chosen': -98.15074157714844, 'logps/ref_rejected': -101.05284118652344, 'logits/chosen': -1.7121399641036987, 'logits/rejected': -0.9029750227928162, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.08867485821247101, 'kl/avg_steps': 0.5, 'epoch': 0.08} 8%|█████████▌ | 57/681 [02:56<31:09, 3.00s/it] 9%|█████████▋ | 58/681 [02:59<31:34, 3.04s/it] {'loss': 1.2623, 'grad_norm': 33.85087966918945, 'learning_rate': 4.1304347826086954e-07, 'rewards/chosen': 0.02414235845208168, 'rewards/rejected': -0.11232372373342514, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.13646608591079712, 'logps/chosen': -99.30267333984375, 'logps/rejected': -92.50682067871094, 'logps/ref_chosen': -99.58097076416016, 'logps/ref_rejected': -91.22227478027344, 'logits/chosen': -1.996552586555481, 'logits/rejected': -1.058452844619751, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.08823369443416595, 'kl/avg_steps': 0.65625, 'epoch': 0.09} 9%|█████████▋ | 58/681 [02:59<31:34, 3.04s/it] 9%|█████████▉ | 59/681 [03:02<31:30, 3.04s/it] {'loss': 1.2934, 'grad_norm': 30.440584182739258, 'learning_rate': 4.2028985507246374e-07, 'rewards/chosen': -0.0023305192589759827, 'rewards/rejected': -0.10135940462350845, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.09902888536453247, 'logps/chosen': -89.82688903808594, 'logps/rejected': -95.44361877441406, 'logps/ref_chosen': -89.80232238769531, 'logps/ref_rejected': -94.27667236328125, 'logits/chosen': -1.5328807830810547, 'logits/rejected': -0.9605180621147156, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.08765843510627747, 'kl/avg_steps': 0.625, 'epoch': 0.09} 9%|█████████▉ | 59/681 [03:02<31:30, 3.04s/it] 9%|██████████ | 60/681 [03:05<31:35, 3.05s/it] {'loss': 1.2904, 'grad_norm': 25.22243309020996, 'learning_rate': 4.2753623188405794e-07, 'rewards/chosen': 0.002980598248541355, 'rewards/rejected': -0.0993606448173523, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.10234124958515167, 'logps/chosen': -95.11837768554688, 'logps/rejected': -92.52310180664062, 'logps/ref_chosen': -95.15571594238281, 'logps/ref_rejected': -91.3724365234375, 'logits/chosen': -1.692917823791504, 'logits/rejected': -0.8258851170539856, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.08711396902799606, 'kl/avg_steps': 0.40625, 'epoch': 0.09} 9%|██████████ | 60/681 [03:05<31:35, 3.05s/it] 9%|██████████▏ | 61/681 [03:08<32:13, 3.12s/it] {'loss': 1.2933, 'grad_norm': 25.07730484008789, 'learning_rate': 4.3478260869565214e-07, 'rewards/chosen': -0.014695134945213795, 'rewards/rejected': -0.11627216637134552, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.101577028632164, 'logps/chosen': -85.89845275878906, 'logps/rejected': -99.31712341308594, 'logps/ref_chosen': -85.73231506347656, 'logps/ref_rejected': -97.96575927734375, 'logits/chosen': -1.4720783233642578, 'logits/rejected': -0.7469910979270935, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.08676150441169739, 'kl/avg_steps': 0.53125, 'epoch': 0.09} 9%|██████████▏ | 61/681 [03:08<32:13, 3.12s/it] 9%|██████████▍ | 62/681 [03:12<32:30, 3.15s/it] {'loss': 1.2896, 'grad_norm': 24.972980499267578, 'learning_rate': 4.420289855072464e-07, 'rewards/chosen': -0.008462773635983467, 'rewards/rejected': -0.11501070111989975, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.10654792189598083, 'logps/chosen': -81.72990417480469, 'logps/rejected': -85.38259887695312, 'logps/ref_chosen': -81.63538360595703, 'logps/ref_rejected': -84.03831481933594, 'logits/chosen': -1.7914103269577026, 'logits/rejected': -0.8692626953125, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.08630301803350449, 'kl/avg_steps': 0.65625, 'epoch': 0.09} 9%|██████████▍ | 62/681 [03:12<32:30, 3.15s/it] 9%|██████████▌ | 63/681 [03:15<31:50, 3.09s/it] {'loss': 1.2686, 'grad_norm': 30.612600326538086, 'learning_rate': 4.4927536231884053e-07, 'rewards/chosen': -0.002559835556894541, 'rewards/rejected': -0.12838411331176758, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.12582427263259888, 'logps/chosen': -103.65047454833984, 'logps/rejected': -104.9129638671875, 'logps/ref_chosen': -103.62405395507812, 'logps/ref_rejected': -103.40303039550781, 'logits/chosen': -1.5954476594924927, 'logits/rejected': -0.8784996271133423, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.0857403501868248, 'kl/avg_steps': 0.625, 'epoch': 0.09} 9%|██████████▌ | 63/681 [03:15<31:50, 3.09s/it] 9%|██████████▋ | 64/681 [03:18<31:25, 3.06s/it] {'loss': 1.2612, 'grad_norm': 30.650426864624023, 'learning_rate': 4.5652173913043473e-07, 'rewards/chosen': -0.014812503941357136, 'rewards/rejected': -0.15303777158260345, 'rewards/accuracies': 0.75, 'rewards/margins': 0.1382252722978592, 'logps/chosen': -87.17109680175781, 'logps/rejected': -102.39601135253906, 'logps/ref_chosen': -87.0015869140625, 'logps/ref_rejected': -100.5854721069336, 'logits/chosen': -1.671600580215454, 'logits/rejected': -0.9036776423454285, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.08520779758691788, 'kl/avg_steps': 0.5, 'epoch': 0.09} 9%|██████████▋ | 64/681 [03:18<31:25, 3.06s/it] 10%|██████████▉ | 65/681 [03:21<31:24, 3.06s/it] {'loss': 1.2492, 'grad_norm': 33.42892837524414, 'learning_rate': 4.63768115942029e-07, 'rewards/chosen': -0.006513871252536774, 'rewards/rejected': -0.15611997246742249, 'rewards/accuracies': 0.875, 'rewards/margins': 0.14960609376430511, 'logps/chosen': -91.29652404785156, 'logps/rejected': -117.1944351196289, 'logps/ref_chosen': -91.22191619873047, 'logps/ref_rejected': -115.33553314208984, 'logits/chosen': -1.976373314857483, 'logits/rejected': -1.0799182653427124, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'kl/beta': 0.0847838819026947, 'kl/avg_steps': 0.75, 'epoch': 0.1} 10%|██████████▉ | 65/681 [03:21<31:24, 3.06s/it] 10%|███████████ | 66/681 [03:24<31:31, 3.08s/it] {'loss': 1.2937, 'grad_norm': 22.01787567138672, 'learning_rate': 4.7101449275362313e-07, 'rewards/chosen': -0.026944037526845932, 'rewards/rejected': -0.12860071659088135, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.10165668278932571, 'logps/chosen': -84.10005187988281, 'logps/rejected': -84.29136657714844, 'logps/ref_chosen': -83.78422546386719, 'logps/ref_rejected': -82.7520980834961, 'logits/chosen': -2.0008394718170166, 'logits/rejected': -1.0472722053527832, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.08415273576974869, 'kl/avg_steps': 0.5, 'epoch': 0.1} 10%|███████████ | 66/681 [03:24<31:31, 3.08s/it] 10%|███████████▏ | 67/681 [03:27<31:15, 3.05s/it] {'loss': 1.3088, 'grad_norm': 20.735403060913086, 'learning_rate': 4.782608695652174e-07, 'rewards/chosen': -0.07106294482946396, 'rewards/rejected': -0.16433589160442352, 'rewards/accuracies': 0.671875, 'rewards/margins': 0.09327295422554016, 'logps/chosen': -88.51578521728516, 'logps/rejected': -81.0645751953125, 'logps/ref_chosen': -87.67295837402344, 'logps/ref_rejected': -79.08674621582031, 'logits/chosen': -2.0153298377990723, 'logits/rejected': -1.0666594505310059, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.328125, 'kl/beta': 0.08373406529426575, 'kl/avg_steps': 0.34375, 'epoch': 0.1} 10%|███████████▏ | 67/681 [03:27<31:15, 3.05s/it] 10%|███████████▍ | 68/681 [03:30<31:06, 3.05s/it] {'loss': 1.2341, 'grad_norm': 24.614826202392578, 'learning_rate': 4.855072463768116e-07, 'rewards/chosen': -0.02329547144472599, 'rewards/rejected': -0.19710469245910645, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.1738092005252838, 'logps/chosen': -97.353759765625, 'logps/rejected': -86.50021362304688, 'logps/ref_chosen': -97.07884216308594, 'logps/ref_rejected': -84.11872863769531, 'logits/chosen': -1.7102564573287964, 'logits/rejected': -1.115820288658142, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.08344721049070358, 'kl/avg_steps': 0.5625, 'epoch': 0.1} 10%|███████████▍ | 68/681 [03:30<31:06, 3.05s/it] 10%|███████████▌ | 69/681 [03:33<32:10, 3.15s/it] {'loss': 1.2348, 'grad_norm': 26.68158531188965, 'learning_rate': 4.927536231884058e-07, 'rewards/chosen': -0.025998366996645927, 'rewards/rejected': -0.20006409287452698, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.1740657240152359, 'logps/chosen': -86.03010559082031, 'logps/rejected': -111.91256713867188, 'logps/ref_chosen': -85.71971130371094, 'logps/ref_rejected': -109.4802017211914, 'logits/chosen': -2.055922269821167, 'logits/rejected': -1.1354026794433594, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.0829804465174675, 'kl/avg_steps': 0.5625, 'epoch': 0.1} 10%|███████████▌ | 69/681 [03:33<32:10, 3.15s/it] 10%|███████████▋ | 70/681 [03:36<31:18, 3.07s/it] {'loss': 1.201, 'grad_norm': 25.407556533813477, 'learning_rate': 5e-07, 'rewards/chosen': -0.05099921301007271, 'rewards/rejected': -0.2685357332229614, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.21753652393817902, 'logps/chosen': -95.62403869628906, 'logps/rejected': -99.49327087402344, 'logps/ref_chosen': -95.00994873046875, 'logps/ref_rejected': -96.21272277832031, 'logits/chosen': -1.9641715288162231, 'logits/rejected': -1.39949631690979, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.08251629024744034, 'kl/avg_steps': 0.53125, 'epoch': 0.1} 10%|███████████▋ | 70/681 [03:36<31:18, 3.07s/it] 10%|███████████▉ | 71/681 [03:39<31:03, 3.06s/it] {'loss': 1.14, 'grad_norm': 30.39597511291504, 'learning_rate': 4.999967061337492e-07, 'rewards/chosen': -0.00880364328622818, 'rewards/rejected': -0.3007497787475586, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.291946142911911, 'logps/chosen': -91.07780456542969, 'logps/rejected': -106.28335571289062, 'logps/ref_chosen': -90.97735595703125, 'logps/ref_rejected': -102.59103393554688, 'logits/chosen': -2.9308741092681885, 'logits/rejected': -1.5702245235443115, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.08208024501800537, 'kl/avg_steps': 0.71875, 'epoch': 0.1} 10%|███████████▉ | 71/681 [03:39<31:03, 3.06s/it] 11%|████████████ | 72/681 [03:42<30:32, 3.01s/it] {'loss': 1.1831, 'grad_norm': 25.40204620361328, 'learning_rate': 4.999868246217933e-07, 'rewards/chosen': -0.04744531959295273, 'rewards/rejected': -0.282520592212677, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.23507529497146606, 'logps/chosen': -98.47068786621094, 'logps/rejected': -103.68728637695312, 'logps/ref_chosen': -97.89152526855469, 'logps/ref_rejected': -100.19171142578125, 'logits/chosen': -2.533379077911377, 'logits/rejected': -1.5070923566818237, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.08149450272321701, 'kl/avg_steps': 0.5625, 'epoch': 0.11} 11%|████████████ | 72/681 [03:42<30:32, 3.01s/it] 11%|████████████▏ | 73/681 [03:45<30:40, 3.03s/it] {'loss': 1.1906, 'grad_norm': 24.58971405029297, 'learning_rate': 4.999703557245192e-07, 'rewards/chosen': -0.06719671189785004, 'rewards/rejected': -0.3144572973251343, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.24726057052612305, 'logps/chosen': -96.5920181274414, 'logps/rejected': -99.84471130371094, 'logps/ref_chosen': -95.7690200805664, 'logps/ref_rejected': -95.93243408203125, 'logits/chosen': -2.8086342811584473, 'logits/rejected': -1.789698600769043, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.08103866130113602, 'kl/avg_steps': 0.40625, 'epoch': 0.11} 11%|████████████▏ | 73/681 [03:45<30:40, 3.03s/it] 11%|████████████▍ | 74/681 [03:48<30:32, 3.02s/it] {'loss': 1.1518, 'grad_norm': 27.501953125, 'learning_rate': 4.999472998758977e-07, 'rewards/chosen': -0.08012489974498749, 'rewards/rejected': -0.3666760325431824, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.2865511476993561, 'logps/chosen': -79.79828643798828, 'logps/rejected': -106.22779083251953, 'logps/ref_chosen': -78.80839538574219, 'logps/ref_rejected': -101.64676666259766, 'logits/chosen': -2.912767171859741, 'logits/rejected': -1.9761888980865479, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.080710768699646, 'kl/avg_steps': 0.5, 'epoch': 0.11} 11%|████████████▍ | 74/681 [03:48<30:32, 3.02s/it] 11%|████████████▌ | 75/681 [03:51<31:13, 3.09s/it] {'loss': 1.1034, 'grad_norm': 29.259719848632812, 'learning_rate': 4.999176576834721e-07, 'rewards/chosen': -0.09254170209169388, 'rewards/rejected': -0.45137819647789, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.35883650183677673, 'logps/chosen': -79.43289947509766, 'logps/rejected': -121.06864929199219, 'logps/ref_chosen': -78.28185272216797, 'logps/ref_rejected': -115.40311431884766, 'logits/chosen': -2.8685462474823, 'logits/rejected': -1.5099756717681885, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.08030922710895538, 'kl/avg_steps': 0.53125, 'epoch': 0.11} 11%|████████████▌ | 75/681 [03:51<31:13, 3.09s/it] 11%|████████████▋ | 76/681 [03:55<31:51, 3.16s/it] {'loss': 1.1945, 'grad_norm': 19.229440689086914, 'learning_rate': 4.998814299283415e-07, 'rewards/chosen': -0.12006325274705887, 'rewards/rejected': -0.35829272866249084, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.23822948336601257, 'logps/chosen': -89.3749008178711, 'logps/rejected': -90.23574829101562, 'logps/ref_chosen': -87.87714385986328, 'logps/ref_rejected': -85.71968078613281, 'logits/chosen': -3.326892375946045, 'logits/rejected': -2.0114529132843018, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.3125, 'kl/beta': 0.079884834587574, 'kl/avg_steps': 0.375, 'epoch': 0.11} 11%|████████████▋ | 76/681 [03:55<31:51, 3.16s/it] 11%|████████████▉ | 77/681 [03:57<30:22, 3.02s/it] {'loss': 1.1752, 'grad_norm': 22.950105667114258, 'learning_rate': 4.998386175651409e-07, 'rewards/chosen': -0.1325981616973877, 'rewards/rejected': -0.4033338725566864, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.2707356810569763, 'logps/chosen': -101.35743713378906, 'logps/rejected': -103.30995178222656, 'logps/ref_chosen': -99.70034790039062, 'logps/ref_rejected': -98.20576477050781, 'logits/chosen': -2.7979376316070557, 'logits/rejected': -1.765808343887329, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.07958638668060303, 'kl/avg_steps': 0.4375, 'epoch': 0.11} 11%|████████████▉ | 77/681 [03:57<30:22, 3.02s/it] 11%|█████████████ | 78/681 [04:00<30:52, 3.07s/it] {'loss': 1.1464, 'grad_norm': 22.01249885559082, 'learning_rate': 4.997892217220159e-07, 'rewards/chosen': -0.05697726085782051, 'rewards/rejected': -0.34640786051750183, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.2894305884838104, 'logps/chosen': -91.00945281982422, 'logps/rejected': -95.54225158691406, 'logps/ref_chosen': -90.29670715332031, 'logps/ref_rejected': -91.13772583007812, 'logits/chosen': -2.7685134410858154, 'logits/rejected': -1.9960670471191406, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.07923971116542816, 'kl/avg_steps': 0.484375, 'epoch': 0.11} 11%|█████████████ | 78/681 [04:00<30:52, 3.07s/it] 12%|█████████████▏ | 79/681 [04:04<30:54, 3.08s/it] {'loss': 1.0944, 'grad_norm': 24.0369815826416, 'learning_rate': 4.997332437005931e-07, 'rewards/chosen': -0.0716433972120285, 'rewards/rejected': -0.44942277669906616, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.37777939438819885, 'logps/chosen': -87.27758026123047, 'logps/rejected': -99.85430908203125, 'logps/ref_chosen': -86.37832641601562, 'logps/ref_rejected': -94.10777282714844, 'logits/chosen': -3.548121452331543, 'logits/rejected': -2.20076847076416, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.07885774970054626, 'kl/avg_steps': 0.59375, 'epoch': 0.12} 12%|█████████████▏ | 79/681 [04:04<30:54, 3.08s/it] 12%|█████████████▍ | 80/681 [04:07<30:39, 3.06s/it] {'loss': 1.1543, 'grad_norm': 22.031478881835938, 'learning_rate': 4.996706849759452e-07, 'rewards/chosen': -0.12929072976112366, 'rewards/rejected': -0.424081027507782, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.29479026794433594, 'logps/chosen': -95.61634826660156, 'logps/rejected': -98.0230941772461, 'logps/ref_chosen': -93.97032165527344, 'logps/ref_rejected': -92.57441711425781, 'logits/chosen': -3.074854850769043, 'logits/rejected': -2.0579161643981934, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.07839228957891464, 'kl/avg_steps': 0.40625, 'epoch': 0.12} 12%|█████████████▍ | 80/681 [04:07<30:39, 3.06s/it] 12%|█████████████▌ | 81/681 [04:10<31:03, 3.11s/it] {'loss': 1.0766, 'grad_norm': 24.160234451293945, 'learning_rate': 4.996015471965529e-07, 'rewards/chosen': -0.0865008607506752, 'rewards/rejected': -0.509419322013855, 'rewards/accuracies': 0.875, 'rewards/margins': 0.42291849851608276, 'logps/chosen': -100.93592834472656, 'logps/rejected': -140.25442504882812, 'logps/ref_chosen': -99.83012390136719, 'logps/ref_rejected': -133.67245483398438, 'logits/chosen': -3.137648105621338, 'logits/rejected': -1.6095255613327026, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.078075110912323, 'kl/avg_steps': 0.53125, 'epoch': 0.12} 12%|█████████████▌ | 81/681 [04:10<31:03, 3.11s/it] 12%|█████████████▋ | 82/681 [04:13<30:37, 3.07s/it] {'loss': 1.1567, 'grad_norm': 21.8071346282959, 'learning_rate': 4.995258321842611e-07, 'rewards/chosen': -0.15859441459178925, 'rewards/rejected': -0.4496329426765442, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.29103851318359375, 'logps/chosen': -85.08958435058594, 'logps/rejected': -100.36309814453125, 'logps/ref_chosen': -83.04598236083984, 'logps/ref_rejected': -94.52595520019531, 'logits/chosen': -3.389235019683838, 'logits/rejected': -2.2665112018585205, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.07766252756118774, 'kl/avg_steps': 0.53125, 'epoch': 0.12} 12%|█████████████▋ | 82/681 [04:13<30:37, 3.07s/it] 12%|█████████████▉ | 83/681 [04:16<31:18, 3.14s/it] {'loss': 1.1507, 'grad_norm': 21.79849624633789, 'learning_rate': 4.994435419342304e-07, 'rewards/chosen': -0.19215711951255798, 'rewards/rejected': -0.5176993012428284, 'rewards/accuracies': 0.703125, 'rewards/margins': 0.3255422115325928, 'logps/chosen': -94.64817810058594, 'logps/rejected': -114.48917388916016, 'logps/ref_chosen': -92.17621612548828, 'logps/ref_rejected': -107.74464416503906, 'logits/chosen': -3.9037327766418457, 'logits/rejected': -2.2527427673339844, 'kl/p_epsilon_steps': 0.640625, 'kl/n_epsilon_steps': 0.359375, 'kl/beta': 0.07725212723016739, 'kl/avg_steps': 0.28125, 'epoch': 0.12} 12%|█████████████▉ | 83/681 [04:16<31:18, 3.14s/it] 12%|██████████████ | 84/681 [04:19<31:30, 3.17s/it] {'loss': 1.2405, 'grad_norm': 22.36842918395996, 'learning_rate': 4.993546786148857e-07, 'rewards/chosen': -0.24369969964027405, 'rewards/rejected': -0.4419947564601898, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.19829508662223816, 'logps/chosen': -104.68441772460938, 'logps/rejected': -98.19878387451172, 'logps/ref_chosen': -101.5264892578125, 'logps/ref_rejected': -92.42608642578125, 'logits/chosen': -3.4791769981384277, 'logits/rejected': -2.393409490585327, 'kl/p_epsilon_steps': 0.65625, 'kl/n_epsilon_steps': 0.34375, 'kl/beta': 0.07703546434640884, 'kl/avg_steps': 0.3125, 'epoch': 0.12} 12%|██████████████ | 84/681 [04:19<31:30, 3.17s/it] 12%|██████████████▏ | 85/681 [04:22<31:17, 3.15s/it] {'loss': 1.1934, 'grad_norm': 20.850772857666016, 'learning_rate': 4.992592445678582e-07, 'rewards/chosen': -0.21086883544921875, 'rewards/rejected': -0.4798212945461273, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.2689524292945862, 'logps/chosen': -98.8546371459961, 'logps/rejected': -91.33653259277344, 'logps/ref_chosen': -96.12738037109375, 'logps/ref_rejected': -85.05519104003906, 'logits/chosen': -3.586297035217285, 'logits/rejected': -2.9060792922973633, 'kl/p_epsilon_steps': 0.625, 'kl/n_epsilon_steps': 0.375, 'kl/beta': 0.07679548114538193, 'kl/avg_steps': 0.25, 'epoch': 0.12} 12%|██████████████▏ | 85/681 [04:22<31:17, 3.15s/it] 13%|██████████████▍ | 86/681 [04:25<30:43, 3.10s/it] {'loss': 1.1471, 'grad_norm': 18.46939468383789, 'learning_rate': 4.991572423079235e-07, 'rewards/chosen': -0.17061173915863037, 'rewards/rejected': -0.5097446441650391, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.3391328752040863, 'logps/chosen': -83.9306640625, 'logps/rejected': -100.36227416992188, 'logps/ref_chosen': -81.70426940917969, 'logps/ref_rejected': -93.6554946899414, 'logits/chosen': -3.4410600662231445, 'logits/rejected': -2.2581920623779297, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.0766039714217186, 'kl/avg_steps': 0.40625, 'epoch': 0.13} 13%|██████████████▍ | 86/681 [04:25<30:43, 3.10s/it] 13%|██████████████▌ | 87/681 [04:28<30:41, 3.10s/it] {'loss': 1.137, 'grad_norm': 21.282371520996094, 'learning_rate': 4.990486745229364e-07, 'rewards/chosen': -0.19558626413345337, 'rewards/rejected': -0.5641908049583435, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.36860454082489014, 'logps/chosen': -95.24873352050781, 'logps/rejected': -110.37586212158203, 'logps/ref_chosen': -92.68596649169922, 'logps/ref_rejected': -102.91818237304688, 'logits/chosen': -4.2626729011535645, 'logits/rejected': -2.455996513366699, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.07629402726888657, 'kl/avg_steps': 0.53125, 'epoch': 0.13} 13%|██████████████▌ | 87/681 [04:29<30:41, 3.10s/it] 13%|██████████████▋ | 88/681 [04:32<30:43, 3.11s/it] {'loss': 1.1828, 'grad_norm': 19.734506607055664, 'learning_rate': 4.989335440737586e-07, 'rewards/chosen': -0.239786297082901, 'rewards/rejected': -0.5274724960327148, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.28768616914749146, 'logps/chosen': -103.91253662109375, 'logps/rejected': -120.1460189819336, 'logps/ref_chosen': -100.76298522949219, 'logps/ref_rejected': -113.15037536621094, 'logits/chosen': -3.0765509605407715, 'logits/rejected': -2.2642664909362793, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.328125, 'kl/beta': 0.07589085400104523, 'kl/avg_steps': 0.34375, 'epoch': 0.13} 13%|██████████████▋ | 88/681 [04:32<30:43, 3.11s/it] 13%|██████████████▉ | 89/681 [04:35<30:05, 3.05s/it] {'loss': 1.1584, 'grad_norm': 20.043725967407227, 'learning_rate': 4.988118539941847e-07, 'rewards/chosen': -0.19024960696697235, 'rewards/rejected': -0.5089821815490723, 'rewards/accuracies': 0.75, 'rewards/margins': 0.3187325596809387, 'logps/chosen': -92.20272827148438, 'logps/rejected': -95.34589385986328, 'logps/ref_chosen': -89.69108581542969, 'logps/ref_rejected': -88.56832885742188, 'logits/chosen': -3.5043540000915527, 'logits/rejected': -2.6482343673706055, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.328125, 'kl/beta': 0.07563087344169617, 'kl/avg_steps': 0.34375, 'epoch': 0.13} 13%|██████████████▉ | 89/681 [04:35<30:05, 3.05s/it] 13%|███████████████ | 90/681 [04:38<30:13, 3.07s/it] {'loss': 1.0847, 'grad_norm': 22.877426147460938, 'learning_rate': 4.986836074908615e-07, 'rewards/chosen': -0.1728699803352356, 'rewards/rejected': -0.5974611043930054, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.4245910942554474, 'logps/chosen': -83.66722106933594, 'logps/rejected': -125.76116180419922, 'logps/ref_chosen': -81.38255310058594, 'logps/ref_rejected': -117.77714538574219, 'logits/chosen': -4.08919620513916, 'logits/rejected': -2.59071683883667, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.07537178695201874, 'kl/avg_steps': 0.53125, 'epoch': 0.13} 13%|███████████████ | 90/681 [04:38<30:13, 3.07s/it] 13%|███████████████▏ | 91/681 [04:41<31:05, 3.16s/it] {'loss': 1.1414, 'grad_norm': 21.175289154052734, 'learning_rate': 4.985488079432037e-07, 'rewards/chosen': -0.20845842361450195, 'rewards/rejected': -0.5571086406707764, 'rewards/accuracies': 0.703125, 'rewards/margins': 0.3486502170562744, 'logps/chosen': -99.98194885253906, 'logps/rejected': -100.44671630859375, 'logps/ref_chosen': -97.22188568115234, 'logps/ref_rejected': -92.97674560546875, 'logits/chosen': -3.6395578384399414, 'logits/rejected': -2.8058741092681885, 'kl/p_epsilon_steps': 0.609375, 'kl/n_epsilon_steps': 0.390625, 'kl/beta': 0.07497348636388779, 'kl/avg_steps': 0.21875, 'epoch': 0.13} 13%|███████████████▏ | 91/681 [04:41<31:05, 3.16s/it] 14%|███████████████▍ | 92/681 [04:44<30:24, 3.10s/it] {'loss': 1.1718, 'grad_norm': 19.441743850708008, 'learning_rate': 4.984074589033043e-07, 'rewards/chosen': -0.22265848517417908, 'rewards/rejected': -0.5193834900856018, 'rewards/accuracies': 0.703125, 'rewards/margins': 0.29672500491142273, 'logps/chosen': -87.49899291992188, 'logps/rejected': -91.48877716064453, 'logps/ref_chosen': -84.5302734375, 'logps/ref_rejected': -84.5013198852539, 'logits/chosen': -4.114226341247559, 'logits/rejected': -3.050344944000244, 'kl/p_epsilon_steps': 0.640625, 'kl/n_epsilon_steps': 0.359375, 'kl/beta': 0.07480984181165695, 'kl/avg_steps': 0.28125, 'epoch': 0.14} 14%|███████████████▍ | 92/681 [04:44<30:24, 3.10s/it] 14%|███████████████▌ | 93/681 [04:47<28:42, 2.93s/it] {'loss': 1.192, 'grad_norm': 20.867734909057617, 'learning_rate': 4.982595640958425e-07, 'rewards/chosen': -0.24706237018108368, 'rewards/rejected': -0.535523533821106, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.2884610891342163, 'logps/chosen': -93.55166625976562, 'logps/rejected': -91.32198333740234, 'logps/ref_chosen': -90.25043487548828, 'logps/ref_rejected': -84.09422302246094, 'logits/chosen': -3.9874632358551025, 'logits/rejected': -2.371685266494751, 'kl/p_epsilon_steps': 0.640625, 'kl/n_epsilon_steps': 0.359375, 'kl/beta': 0.07460002601146698, 'kl/avg_steps': 0.28125, 'epoch': 0.14} 14%|███████████████▌ | 93/681 [04:47<28:42, 2.93s/it] 14%|███████████████▋ | 94/681 [04:50<29:36, 3.03s/it] {'loss': 1.0808, 'grad_norm': 22.021520614624023, 'learning_rate': 4.98105127417984e-07, 'rewards/chosen': -0.25266966223716736, 'rewards/rejected': -0.6970977783203125, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.44442811608314514, 'logps/chosen': -95.84749603271484, 'logps/rejected': -114.69227600097656, 'logps/ref_chosen': -92.4542236328125, 'logps/ref_rejected': -105.24728393554688, 'logits/chosen': -4.104582786560059, 'logits/rejected': -2.78466796875, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.07439080625772476, 'kl/avg_steps': 0.53125, 'epoch': 0.14} 14%|███████████████▋ | 94/681 [04:50<29:36, 3.03s/it] 14%|███████████████▉ | 95/681 [04:53<29:10, 2.99s/it] {'loss': 1.1379, 'grad_norm': 18.556907653808594, 'learning_rate': 4.979441529392784e-07, 'rewards/chosen': -0.201766699552536, 'rewards/rejected': -0.5478700399398804, 'rewards/accuracies': 0.703125, 'rewards/margins': 0.346103310585022, 'logps/chosen': -81.59260559082031, 'logps/rejected': -91.0462875366211, 'logps/ref_chosen': -78.87370300292969, 'logps/ref_rejected': -83.59121704101562, 'logits/chosen': -4.07137393951416, 'logits/rejected': -2.9404728412628174, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.328125, 'kl/beta': 0.07399769127368927, 'kl/avg_steps': 0.34375, 'epoch': 0.14} 14%|███████████████▉ | 95/681 [04:53<29:10, 2.99s/it] 14%|████████████████ | 96/681 [04:56<29:07, 2.99s/it] {'loss': 1.0757, 'grad_norm': 19.24999237060547, 'learning_rate': 4.977766449015534e-07, 'rewards/chosen': -0.18087545037269592, 'rewards/rejected': -0.6175007224082947, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.43662530183792114, 'logps/chosen': -109.04351806640625, 'logps/rejected': -110.20415496826172, 'logps/ref_chosen': -106.5921630859375, 'logps/ref_rejected': -101.76802062988281, 'logits/chosen': -3.7676031589508057, 'logits/rejected': -2.4514598846435547, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.07374419271945953, 'kl/avg_steps': 0.46875, 'epoch': 0.14} 14%|████████████████ | 96/681 [04:56<29:07, 2.99s/it] 14%|████████████████▏ | 97/681 [04:59<29:18, 3.01s/it] {'loss': 1.1006, 'grad_norm': 21.76212501525879, 'learning_rate': 4.976026077188012e-07, 'rewards/chosen': -0.26550614833831787, 'rewards/rejected': -0.6464561223983765, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.380950003862381, 'logps/chosen': -92.28569030761719, 'logps/rejected': -93.66389465332031, 'logps/ref_chosen': -88.67988586425781, 'logps/ref_rejected': -84.81229400634766, 'logits/chosen': -4.675760746002197, 'logits/rejected': -3.565962791442871, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.07340013235807419, 'kl/avg_steps': 0.40625, 'epoch': 0.14} 14%|████████████████▏ | 97/681 [04:59<29:18, 3.01s/it] 14%|████████████████▍ | 98/681 [05:02<29:33, 3.04s/it] {'loss': 1.0488, 'grad_norm': 19.99883460998535, 'learning_rate': 4.974220459770639e-07, 'rewards/chosen': -0.2048511505126953, 'rewards/rejected': -0.6484156250953674, 'rewards/accuracies': 0.875, 'rewards/margins': 0.4435645043849945, 'logps/chosen': -95.04853057861328, 'logps/rejected': -110.45492553710938, 'logps/ref_chosen': -92.24249267578125, 'logps/ref_rejected': -101.51948547363281, 'logits/chosen': -3.876922607421875, 'logits/rejected': -2.7262768745422363, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.0731031522154808, 'kl/avg_steps': 0.53125, 'epoch': 0.14} 14%|████████████████▍ | 98/681 [05:02<29:33, 3.04s/it] 15%|████████████████▌ | 99/681 [05:05<28:56, 2.98s/it] {'loss': 1.0479, 'grad_norm': 20.251489639282227, 'learning_rate': 4.972349644343108e-07, 'rewards/chosen': -0.15243776142597198, 'rewards/rejected': -0.6443277597427368, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.49188998341560364, 'logps/chosen': -74.27732849121094, 'logps/rejected': -100.81095123291016, 'logps/ref_chosen': -72.18464660644531, 'logps/ref_rejected': -91.88131713867188, 'logits/chosen': -4.384429454803467, 'logits/rejected': -3.071768045425415, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.0727168396115303, 'kl/avg_steps': 0.53125, 'epoch': 0.15} 15%|████████████████▌ | 99/681 [05:05<28:56, 2.98s/it] 15%|████████████████▌ | 100/681 [05:08<29:12, 3.02s/it] {'loss': 1.1171, 'grad_norm': 17.54705047607422, 'learning_rate': 4.970413680203148e-07, 'rewards/chosen': -0.18433162569999695, 'rewards/rejected': -0.5253919363021851, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.3410602807998657, 'logps/chosen': -92.06529235839844, 'logps/rejected': -88.53147888183594, 'logps/ref_chosen': -89.51382446289062, 'logps/ref_rejected': -81.21713256835938, 'logits/chosen': -4.1084113121032715, 'logits/rejected': -2.772047758102417, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.07233257591724396, 'kl/avg_steps': 0.40625, 'epoch': 0.15} 15%|████████████████▌ | 100/681 [05:08<29:12, 3.02s/it][INFO|trainer.py:4307] 2026-04-24 04:21:15,383 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-24 04:21:15,383 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-24 04:21:15,383 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-24 04:27:07,658 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-24 04:27:07,658 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-200 [INFO|configuration_utils.py:419] 2026-04-24 04:28:10,716 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-200/config.json [INFO|configuration_utils.py:911] 2026-04-24 04:28:10,720 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-200/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-24 04:28:50,086 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-200/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-24 04:28:50,089 >> tokenizer config file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-200/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-24 04:28:50,092 >> Special tokens file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-200/special_tokens_map.json 30%|████████████████████████████████▍ | 201/681 [15:34<11:15:55, 84.49s/it] {'loss': 0.9275, 'grad_norm': 16.643579483032227, 'learning_rate': 4.455721242469372e-07, 'rewards/chosen': -0.8622275590896606, 'rewards/rejected': -1.6865991353988647, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.8243715763092041, 'logps/chosen': -127.57327270507812, 'logps/rejected': -158.6796875, 'logps/ref_chosen': -107.77249145507812, 'logps/ref_rejected': -119.79248046875, 'logits/chosen': -6.567927360534668, 'logits/rejected': -5.76133394241333, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.04368406534194946, 'kl/avg_steps': 0.5, 'epoch': 0.3} 30%|████████████████████████████████▍ | 201/681 [15:34<11:15:55, 84.49s/it] 30%|████████████████████████████████▉ | 202/681 [15:37<7:59:32, 60.07s/it] {'loss': 0.947, 'grad_norm': 15.5465669631958, 'learning_rate': 4.4477014363141755e-07, 'rewards/chosen': -0.8463935256004333, 'rewards/rejected': -1.568117618560791, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.7217241525650024, 'logps/chosen': -95.47408294677734, 'logps/rejected': -130.75558471679688, 'logps/ref_chosen': -75.97245025634766, 'logps/ref_rejected': -94.4599838256836, 'logits/chosen': -6.6011576652526855, 'logits/rejected': -5.9799299240112305, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.043466731905937195, 'kl/avg_steps': 0.5, 'epoch': 0.3} 30%|████████████████████████████████▉ | 202/681 [15:38<7:59:32, 60.07s/it] 30%|█████████████████████████████████ | 203/681 [15:41<5:42:40, 43.01s/it] {'loss': 0.8472, 'grad_norm': 14.089229583740234, 'learning_rate': 4.439630306414758e-07, 'rewards/chosen': -0.7273061275482178, 'rewards/rejected': -1.5748505592346191, 'rewards/accuracies': 0.875, 'rewards/margins': 0.8475444316864014, 'logps/chosen': -111.82502746582031, 'logps/rejected': -129.54229736328125, 'logps/ref_chosen': -94.96715545654297, 'logps/ref_rejected': -92.8876724243164, 'logits/chosen': -6.752559661865234, 'logits/rejected': -5.879279136657715, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.04325047880411148, 'kl/avg_steps': 0.5625, 'epoch': 0.3} 30%|█████████████████████████████████ | 203/681 [15:41<5:42:40, 43.01s/it] 30%|█████████████████████████████████▎ | 204/681 [15:44<4:07:12, 31.10s/it] {'loss': 0.9609, 'grad_norm': 17.907960891723633, 'learning_rate': 4.431508065452897e-07, 'rewards/chosen': -0.9145333170890808, 'rewards/rejected': -1.6298069953918457, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.7152736186981201, 'logps/chosen': -137.6444091796875, 'logps/rejected': -131.52774047851562, 'logps/ref_chosen': -116.35719299316406, 'logps/ref_rejected': -93.39759063720703, 'logits/chosen': -6.687747955322266, 'logits/rejected': -5.982107162475586, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.043008554726839066, 'kl/avg_steps': 0.40625, 'epoch': 0.3} 30%|█████████████████████████████████▎ | 204/681 [15:44<4:07:12, 31.10s/it] 30%|█████████████████████████████████▍ | 205/681 [15:47<3:00:14, 22.72s/it] {'loss': 0.7808, 'grad_norm': 12.563446998596191, 'learning_rate': 4.4233349274571974e-07, 'rewards/chosen': -0.8293547630310059, 'rewards/rejected': -1.7540370225906372, 'rewards/accuracies': 0.875, 'rewards/margins': 0.9246822595596313, 'logps/chosen': -108.34004211425781, 'logps/rejected': -133.15499877929688, 'logps/ref_chosen': -88.85934448242188, 'logps/ref_rejected': -91.8544921875, 'logits/chosen': -6.747779846191406, 'logits/rejected': -6.664907932281494, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'kl/beta': 0.04283454269170761, 'kl/avg_steps': 0.75, 'epoch': 0.3} 30%|█████████████████████████████████▍ | 205/681 [15:47<3:00:14, 22.72s/it] 30%|█████████████████████████████████▌ | 206/681 [15:50<2:13:03, 16.81s/it] {'loss': 0.8253, 'grad_norm': 17.306621551513672, 'learning_rate': 4.415111107797445e-07, 'rewards/chosen': -0.8388286828994751, 'rewards/rejected': -1.6743868589401245, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.8355581164360046, 'logps/chosen': -96.40692138671875, 'logps/rejected': -142.6748809814453, 'logps/ref_chosen': -76.54634857177734, 'logps/ref_rejected': -102.95314025878906, 'logits/chosen': -6.731910705566406, 'logits/rejected': -5.665759086608887, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'kl/beta': 0.042515672743320465, 'kl/avg_steps': 0.78125, 'epoch': 0.3} 30%|█████████████████████████████████▌ | 206/681 [15:50<2:13:03, 16.81s/it] 30%|█████████████████████████████████▋ | 207/681 [15:53<1:39:59, 12.66s/it] {'loss': 0.9093, 'grad_norm': 19.9998722076416, 'learning_rate': 4.4068368231789365e-07, 'rewards/chosen': -0.9145298600196838, 'rewards/rejected': -1.8097199201583862, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.8951901197433472, 'logps/chosen': -107.92840576171875, 'logps/rejected': -133.81866455078125, 'logps/ref_chosen': -86.23164367675781, 'logps/ref_rejected': -90.65512084960938, 'logits/chosen': -7.154547214508057, 'logits/rejected': -6.694057464599609, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.04218609631061554, 'kl/avg_steps': 0.5, 'epoch': 0.3} 30%|█████████████████████████████████▋ | 207/681 [15:53<1:39:59, 12.66s/it] 31%|█████████████████████████████████▉ | 208/681 [15:56<1:17:01, 9.77s/it] {'loss': 0.7998, 'grad_norm': 15.674067497253418, 'learning_rate': 4.398512291636768e-07, 'rewards/chosen': -0.8975973725318909, 'rewards/rejected': -1.838052749633789, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.9404553771018982, 'logps/chosen': -115.6429672241211, 'logps/rejected': -145.07806396484375, 'logps/ref_chosen': -94.1595458984375, 'logps/ref_rejected': -100.96233367919922, 'logits/chosen': -7.054866313934326, 'logits/rejected': -6.2353129386901855, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.041976213455200195, 'kl/avg_steps': 0.65625, 'epoch': 0.31} 31%|█████████████████████████████████▉ | 208/681 [15:56<1:17:01, 9.77s/it] 31%|██████████████████████████████████ | 209/681 [15:59<1:00:23, 7.68s/it] {'loss': 0.8985, 'grad_norm': 15.056414604187012, 'learning_rate': 4.3901377325300857e-07, 'rewards/chosen': -0.7734057903289795, 'rewards/rejected': -1.5872459411621094, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.8138402104377747, 'logps/chosen': -102.76752471923828, 'logps/rejected': -125.9391860961914, 'logps/ref_chosen': -84.17056274414062, 'logps/ref_rejected': -87.61955261230469, 'logits/chosen': -7.251114845275879, 'logits/rejected': -6.478391647338867, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.04170254245400429, 'kl/avg_steps': 0.5, 'epoch': 0.31} 31%|██████████████████████████████████ | 209/681 [15:59<1:00:23, 7.68s/it] 31%|██████████████████████████████████▊ | 210/681 [16:02<48:55, 6.23s/it] {'loss': 0.8996, 'grad_norm': 13.82249641418457, 'learning_rate': 4.381713366536311e-07, 'rewards/chosen': -0.8534192442893982, 'rewards/rejected': -1.6337693929672241, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7803501486778259, 'logps/chosen': -101.83394622802734, 'logps/rejected': -123.829833984375, 'logps/ref_chosen': -81.17117309570312, 'logps/ref_rejected': -84.17478942871094, 'logits/chosen': -7.367747783660889, 'logits/rejected': -6.7037177085876465, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.04149506613612175, 'kl/avg_steps': 0.625, 'epoch': 0.31} 31%|██████████████████████████████████▊ | 210/681 [16:02<48:55, 6.23s/it] 31%|███████████████████████████████████ | 211/681 [16:05<40:39, 5.19s/it] {'loss': 0.9141, 'grad_norm': 16.27776527404785, 'learning_rate': 4.373239415645323e-07, 'rewards/chosen': -0.9316354990005493, 'rewards/rejected': -1.7147372961044312, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.7831017971038818, 'logps/chosen': -131.35140991210938, 'logps/rejected': -135.39334106445312, 'logps/ref_chosen': -108.71271514892578, 'logps/ref_rejected': -93.55564880371094, 'logits/chosen': -6.783376216888428, 'logits/rejected': -6.2379889488220215, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.04123733192682266, 'kl/avg_steps': 0.46875, 'epoch': 0.31} 31%|███████████████████████████████████ | 211/681 [16:05<40:39, 5.19s/it] 31%|███████████████████████████████████▏ | 212/681 [16:08<35:53, 4.59s/it] {'loss': 0.7363, 'grad_norm': 13.126346588134766, 'learning_rate': 4.3647161031536086e-07, 'rewards/chosen': -0.6175893545150757, 'rewards/rejected': -1.706433653831482, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.0888441801071167, 'logps/chosen': -113.453125, 'logps/rejected': -151.75265502929688, 'logps/ref_chosen': -98.36194610595703, 'logps/ref_rejected': -109.88999938964844, 'logits/chosen': -6.987194061279297, 'logits/rejected': -6.6716461181640625, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.0410449355840683, 'kl/avg_steps': 0.53125, 'epoch': 0.31} 31%|███████████████████████████████████▏ | 212/681 [16:08<35:53, 4.59s/it] 31%|███████████████████████████████████▎ | 213/681 [16:11<32:38, 4.19s/it] {'loss': 0.8605, 'grad_norm': 19.004343032836914, 'learning_rate': 4.3561436536583774e-07, 'rewards/chosen': -0.7606232166290283, 'rewards/rejected': -1.6314427852630615, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8708195686340332, 'logps/chosen': -126.75518798828125, 'logps/rejected': -140.40818786621094, 'logps/ref_chosen': -108.05531311035156, 'logps/ref_rejected': -100.14414978027344, 'logits/chosen': -6.994483947753906, 'logits/rejected': -6.408557891845703, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.04082803428173065, 'kl/avg_steps': 0.5625, 'epoch': 0.31} 31%|███████████████████████████████████▎ | 213/681 [16:11<32:38, 4.19s/it] 31%|███████████████████████████████████▌ | 214/681 [16:14<29:19, 3.77s/it] {'loss': 0.9841, 'grad_norm': 15.15538501739502, 'learning_rate': 4.3475222930516473e-07, 'rewards/chosen': -0.8206988573074341, 'rewards/rejected': -1.570618987083435, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.749920129776001, 'logps/chosen': -98.05276489257812, 'logps/rejected': -123.21199798583984, 'logps/ref_chosen': -77.80473327636719, 'logps/ref_rejected': -84.27578735351562, 'logits/chosen': -6.964527130126953, 'logits/rejected': -6.679478645324707, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.04059966281056404, 'kl/avg_steps': 0.4375, 'epoch': 0.31} 31%|███████████████████████████████████▌ | 214/681 [16:14<29:19, 3.77s/it] 32%|███████████████████████████████████▋ | 215/681 [16:17<27:47, 3.58s/it] {'loss': 0.7734, 'grad_norm': 13.474177360534668, 'learning_rate': 4.3388522485142885e-07, 'rewards/chosen': -0.6723982095718384, 'rewards/rejected': -1.663226842880249, 'rewards/accuracies': 0.875, 'rewards/margins': 0.9908286333084106, 'logps/chosen': -101.85633087158203, 'logps/rejected': -138.35052490234375, 'logps/ref_chosen': -85.1138916015625, 'logps/ref_rejected': -96.86151885986328, 'logits/chosen': -7.422246932983398, 'logits/rejected': -6.752878189086914, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.04042281210422516, 'kl/avg_steps': 0.6875, 'epoch': 0.32} 32%|███████████████████████████████████▋ | 215/681 [16:17<27:47, 3.58s/it] 32%|███████████████████████████████████▊ | 216/681 [16:20<26:52, 3.47s/it] {'loss': 0.8502, 'grad_norm': 13.187141418457031, 'learning_rate': 4.330133748510036e-07, 'rewards/chosen': -0.7613588571548462, 'rewards/rejected': -1.5942199230194092, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.832861065864563, 'logps/chosen': -99.6475830078125, 'logps/rejected': -121.41978454589844, 'logps/ref_chosen': -80.5923080444336, 'logps/ref_rejected': -81.41983795166016, 'logits/chosen': -6.856760501861572, 'logits/rejected': -6.676458835601807, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.04014680162072182, 'kl/avg_steps': 0.53125, 'epoch': 0.32} 32%|███████████████████████████████████▊ | 216/681 [16:20<26:52, 3.47s/it] 32%|████████████████████████████████████ | 217/681 [16:23<25:46, 3.33s/it] {'loss': 0.9483, 'grad_norm': 17.514558792114258, 'learning_rate': 4.3213670227794757e-07, 'rewards/chosen': -0.8944262266159058, 'rewards/rejected': -1.6273518800735474, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.7329256534576416, 'logps/chosen': -115.89637756347656, 'logps/rejected': -144.48422241210938, 'logps/ref_chosen': -93.47257995605469, 'logps/ref_rejected': -103.488525390625, 'logits/chosen': -7.025016784667969, 'logits/rejected': -6.148534774780273, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.03993465006351471, 'kl/avg_steps': 0.46875, 'epoch': 0.32} 32%|████████████████████████████████████ | 217/681 [16:23<25:46, 3.33s/it] 32%|████████████████████████████████████▏ | 218/681 [16:26<25:07, 3.26s/it] {'loss': 0.8984, 'grad_norm': 15.729193687438965, 'learning_rate': 4.3125523023339815e-07, 'rewards/chosen': -0.8316381573677063, 'rewards/rejected': -1.6821269989013672, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.8504889607429504, 'logps/chosen': -110.00785827636719, 'logps/rejected': -136.88497924804688, 'logps/ref_chosen': -89.05883026123047, 'logps/ref_rejected': -94.30680847167969, 'logits/chosen': -7.055582046508789, 'logits/rejected': -6.02012300491333, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.03974832966923714, 'kl/avg_steps': 0.453125, 'epoch': 0.32} 32%|████████████████████████████████████▏ | 218/681 [16:26<25:07, 3.26s/it] 32%|████████████████████████████████████▎ | 219/681 [16:30<25:00, 3.25s/it] {'loss': 0.9475, 'grad_norm': 19.58609390258789, 'learning_rate': 4.303689819449636e-07, 'rewards/chosen': -0.7558479309082031, 'rewards/rejected': -1.5479310750961304, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.7920831441879272, 'logps/chosen': -120.11994171142578, 'logps/rejected': -131.8204803466797, 'logps/ref_chosen': -101.00733947753906, 'logps/ref_rejected': -92.46794128417969, 'logits/chosen': -7.102532386779785, 'logits/rejected': -6.606064796447754, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.039569031447172165, 'kl/avg_steps': 0.46875, 'epoch': 0.32} 32%|████████████████████████████████████▎ | 219/681 [16:30<25:00, 3.25s/it] 32%|████████████████████████████████████▌ | 220/681 [16:33<24:52, 3.24s/it] {'loss': 0.9298, 'grad_norm': 16.06678581237793, 'learning_rate': 4.2947798076611047e-07, 'rewards/chosen': -0.7732564210891724, 'rewards/rejected': -1.4757641553878784, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.702507734298706, 'logps/chosen': -115.22142028808594, 'logps/rejected': -132.008544921875, 'logps/ref_chosen': -95.53721618652344, 'logps/ref_rejected': -94.30703735351562, 'logits/chosen': -6.876911163330078, 'logits/rejected': -6.016563415527344, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.03938441723585129, 'kl/avg_steps': 0.46875, 'epoch': 0.32} 32%|████████████████████████████████████▌ | 220/681 [16:33<24:52, 3.24s/it] 32%|████████████████████████████████████▋ | 221/681 [16:36<24:47, 3.23s/it] {'loss': 0.7544, 'grad_norm': 14.65071964263916, 'learning_rate': 4.285822501755485e-07, 'rewards/chosen': -0.7377252578735352, 'rewards/rejected': -1.7821805477142334, 'rewards/accuracies': 0.875, 'rewards/margins': 1.0444551706314087, 'logps/chosen': -101.76954650878906, 'logps/rejected': -156.64260864257812, 'logps/ref_chosen': -82.84486389160156, 'logps/ref_rejected': -110.81179809570312, 'logits/chosen': -6.791396141052246, 'logits/rejected': -5.949629783630371, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.039200663566589355, 'kl/avg_steps': 0.6875, 'epoch': 0.32} 32%|████████████████████████████████████▋ | 221/681 [16:36<24:47, 3.23s/it] 33%|████████████████████████████████████▊ | 222/681 [16:39<24:16, 3.17s/it] {'loss': 0.9338, 'grad_norm': 20.610990524291992, 'learning_rate': 4.276818137766118e-07, 'rewards/chosen': -0.9227837920188904, 'rewards/rejected': -1.7836687564849854, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.8608850240707397, 'logps/chosen': -118.88861083984375, 'logps/rejected': -152.92138671875, 'logps/ref_chosen': -95.14198303222656, 'logps/ref_rejected': -106.80441284179688, 'logits/chosen': -7.152358055114746, 'logits/rejected': -6.381607532501221, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.03893300145864487, 'kl/avg_steps': 0.53125, 'epoch': 0.33} 33%|████████████████████████████████████▊ | 222/681 [16:39<24:16, 3.17s/it] 33%|█████████████████████████████████████ | 223/681 [16:42<22:50, 2.99s/it] {'loss': 0.98, 'grad_norm': 15.403647422790527, 'learning_rate': 4.2677669529663686e-07, 'rewards/chosen': -0.8735491037368774, 'rewards/rejected': -1.556617021560669, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.6830679178237915, 'logps/chosen': -108.16079711914062, 'logps/rejected': -126.88802337646484, 'logps/ref_chosen': -85.57511138916016, 'logps/ref_rejected': -86.45238494873047, 'logits/chosen': -7.370273590087891, 'logits/rejected': -6.7063889503479, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.03872726112604141, 'kl/avg_steps': 0.5, 'epoch': 0.33} 33%|█████████████████████████████████████ | 223/681 [16:42<22:50, 2.99s/it] 33%|█████████████████████████████████████▏ | 224/681 [16:44<21:49, 2.87s/it] {'loss': 0.867, 'grad_norm': 15.034847259521484, 'learning_rate': 4.2586691858633747e-07, 'rewards/chosen': -0.820472002029419, 'rewards/rejected': -1.6667978763580322, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.8463258147239685, 'logps/chosen': -104.06519317626953, 'logps/rejected': -126.1408462524414, 'logps/ref_chosen': -82.72380065917969, 'logps/ref_rejected': -82.59538269042969, 'logits/chosen': -7.058864593505859, 'logits/rejected': -6.6526618003845215, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.03853458911180496, 'kl/avg_steps': 0.5, 'epoch': 0.33} 33%|█████████████████████████████████████▏ | 224/681 [16:44<21:49, 2.87s/it] 33%|█████████████████████████████████████▎ | 225/681 [16:47<21:28, 2.83s/it] {'loss': 0.8512, 'grad_norm': 15.859506607055664, 'learning_rate': 4.249525076191759e-07, 'rewards/chosen': -0.8980911374092102, 'rewards/rejected': -1.83120858669281, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.9331176280975342, 'logps/chosen': -119.15274810791016, 'logps/rejected': -153.17977905273438, 'logps/ref_chosen': -95.67768096923828, 'logps/ref_rejected': -105.09687805175781, 'logits/chosen': -7.314513206481934, 'logits/rejected': -6.8365983963012695, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.03834287449717522, 'kl/avg_steps': 0.53125, 'epoch': 0.33} 33%|█████████████████████████████████████▎ | 225/681 [16:47<21:28, 2.83s/it] 33%|█████████████████████████████████████▌ | 226/681 [16:50<22:06, 2.91s/it] {'loss': 0.8457, 'grad_norm': 13.422561645507812, 'learning_rate': 4.2403348649073167e-07, 'rewards/chosen': -0.7408524751663208, 'rewards/rejected': -1.5925018787384033, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8516495227813721, 'logps/chosen': -112.96669006347656, 'logps/rejected': -128.74534606933594, 'logps/ref_chosen': -93.46092987060547, 'logps/ref_rejected': -86.7017593383789, 'logits/chosen': -7.025399684906006, 'logits/rejected': -6.684216499328613, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.03814025595784187, 'kl/avg_steps': 0.625, 'epoch': 0.33} 33%|█████████████████████████████████████▌ | 226/681 [16:50<22:06, 2.91s/it] 33%|█████████████████████████████████████▋ | 227/681 [16:53<22:11, 2.93s/it] {'loss': 0.906, 'grad_norm': 15.427087783813477, 'learning_rate': 4.2310987941806615e-07, 'rewards/chosen': -1.0186035633087158, 'rewards/rejected': -1.8651710748672485, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.8465676307678223, 'logps/chosen': -119.78524017333984, 'logps/rejected': -154.31568908691406, 'logps/ref_chosen': -92.81427001953125, 'logps/ref_rejected': -104.73692321777344, 'logits/chosen': -7.346820831298828, 'logits/rejected': -6.7165141105651855, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.03790335729718208, 'kl/avg_steps': 0.5625, 'epoch': 0.33} 33%|█████████████████████████████████████▋ | 227/681 [16:53<22:11, 2.93s/it] 33%|█████████████████████████████████████▊ | 228/681 [16:56<22:31, 2.98s/it] {'loss': 0.962, 'grad_norm': 15.720854759216309, 'learning_rate': 4.2218171073908463e-07, 'rewards/chosen': -0.9233343601226807, 'rewards/rejected': -1.5975819826126099, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.6742476224899292, 'logps/chosen': -118.57168579101562, 'logps/rejected': -138.6416015625, 'logps/ref_chosen': -94.03712463378906, 'logps/ref_rejected': -96.02151489257812, 'logits/chosen': -7.082067489624023, 'logits/rejected': -6.396081924438477, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.03769134357571602, 'kl/avg_steps': 0.40625, 'epoch': 0.33} 33%|█████████████████████████████████████▊ | 228/681 [16:56<22:31, 2.98s/it] 34%|█████████████████████████████████████▉ | 229/681 [16:59<22:57, 3.05s/it] {'loss': 0.7638, 'grad_norm': 13.65562915802002, 'learning_rate': 4.212490049118951e-07, 'rewards/chosen': -0.7962503433227539, 'rewards/rejected': -1.879028558731079, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.0827782154083252, 'logps/chosen': -116.85358428955078, 'logps/rejected': -139.57473754882812, 'logps/ref_chosen': -95.57766723632812, 'logps/ref_rejected': -89.17379760742188, 'logits/chosen': -7.236542224884033, 'logits/rejected': -6.673727989196777, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.03753884509205818, 'kl/avg_steps': 0.5625, 'epoch': 0.34} 34%|█████████████████████████████████████▉ | 229/681 [16:59<22:57, 3.05s/it] 34%|██████████████████████████████████████▏ | 230/681 [17:02<22:19, 2.97s/it] {'loss': 0.7439, 'grad_norm': 15.202421188354492, 'learning_rate': 4.203117865141635e-07, 'rewards/chosen': -0.8491979241371155, 'rewards/rejected': -1.8918962478637695, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.0426983833312988, 'logps/chosen': -86.61280822753906, 'logps/rejected': -143.0418243408203, 'logps/ref_chosen': -63.713626861572266, 'logps/ref_rejected': -91.9087142944336, 'logits/chosen': -7.399197578430176, 'logits/rejected': -6.820221900939941, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'kl/beta': 0.037328869104385376, 'kl/avg_steps': 0.78125, 'epoch': 0.34} 34%|██████████████████████████████████████▏ | 230/681 [17:02<22:19, 2.97s/it] 34%|██████████████████████████████████████▎ | 231/681 [17:05<22:32, 3.01s/it] {'loss': 0.8274, 'grad_norm': 13.081562995910645, 'learning_rate': 4.1937008024246625e-07, 'rewards/chosen': -0.957029402256012, 'rewards/rejected': -1.8156838417053223, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8586543798446655, 'logps/chosen': -121.447998046875, 'logps/rejected': -130.36236572265625, 'logps/ref_chosen': -95.45567321777344, 'logps/ref_rejected': -80.95568084716797, 'logits/chosen': -6.668990135192871, 'logits/rejected': -6.165530681610107, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.03703949600458145, 'kl/avg_steps': 0.671875, 'epoch': 0.34} 34%|██████████████████████████████████████▎ | 231/681 [17:05<22:32, 3.01s/it] 34%|██████████████████████████████████████▍ | 232/681 [17:08<22:40, 3.03s/it] {'loss': 0.9199, 'grad_norm': 15.00186538696289, 'learning_rate': 4.1842391091163933e-07, 'rewards/chosen': -0.9383091926574707, 'rewards/rejected': -1.7088375091552734, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.7705283164978027, 'logps/chosen': -122.46676635742188, 'logps/rejected': -136.52182006835938, 'logps/ref_chosen': -96.89726257324219, 'logps/ref_rejected': -89.76461791992188, 'logits/chosen': -6.989046096801758, 'logits/rejected': -6.184296607971191, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.03679230064153671, 'kl/avg_steps': 0.5, 'epoch': 0.34} 34%|██████████████████████████████████████▍ | 232/681 [17:08<22:40, 3.03s/it] 34%|██████████████████████████████████████▋ | 233/681 [17:11<23:07, 3.10s/it] {'loss': 0.7772, 'grad_norm': 14.02349853515625, 'learning_rate': 4.174733034541245e-07, 'rewards/chosen': -1.0092403888702393, 'rewards/rejected': -2.045924663543701, 'rewards/accuracies': 0.828125, 'rewards/margins': 1.036684513092041, 'logps/chosen': -116.73382568359375, 'logps/rejected': -169.069580078125, 'logps/ref_chosen': -89.05032348632812, 'logps/ref_rejected': -112.75917053222656, 'logits/chosen': -7.088686943054199, 'logits/rejected': -6.224997520446777, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.03660925105214119, 'kl/avg_steps': 0.625, 'epoch': 0.34} 34%|██████████████████████████████████████▋ | 233/681 [17:12<23:07, 3.10s/it] 34%|██████████████████████████████████████▊ | 234/681 [17:15<23:39, 3.18s/it] {'loss': 0.7907, 'grad_norm': 13.894185066223145, 'learning_rate': 4.165182829193126e-07, 'rewards/chosen': -0.9510073661804199, 'rewards/rejected': -2.022789239883423, 'rewards/accuracies': 0.875, 'rewards/margins': 1.071781873703003, 'logps/chosen': -100.55630493164062, 'logps/rejected': -162.41140747070312, 'logps/ref_chosen': -74.318359375, 'logps/ref_rejected': -106.38758850097656, 'logits/chosen': -7.294958114624023, 'logits/rejected': -6.297677040100098, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.03638186678290367, 'kl/avg_steps': 0.625, 'epoch': 0.34} 34%|██████████████████████████████████████▊ | 234/681 [17:15<23:39, 3.18s/it] 35%|██████████████████████████████████████▉ | 235/681 [17:18<23:15, 3.13s/it] {'loss': 0.8641, 'grad_norm': 13.863997459411621, 'learning_rate': 4.1555887447288255e-07, 'rewards/chosen': -1.1314321756362915, 'rewards/rejected': -1.9421896934509277, 'rewards/accuracies': 0.875, 'rewards/margins': 0.810757577419281, 'logps/chosen': -129.61538696289062, 'logps/rejected': -151.29678344726562, 'logps/ref_chosen': -98.217041015625, 'logps/ref_rejected': -97.24677276611328, 'logits/chosen': -6.75037956237793, 'logits/rejected': -6.180291175842285, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.03615589067339897, 'kl/avg_steps': 0.53125, 'epoch': 0.35} 35%|██████████████████████████████████████▉ | 235/681 [17:18<23:15, 3.13s/it] 35%|███████████████████████████████████████▏ | 236/681 [17:21<23:37, 3.19s/it] {'loss': 0.8264, 'grad_norm': 14.032482147216797, 'learning_rate': 4.1459510339613946e-07, 'rewards/chosen': -0.9487680196762085, 'rewards/rejected': -1.9531795978546143, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.0044116973876953, 'logps/chosen': -105.3929672241211, 'logps/rejected': -163.83856201171875, 'logps/ref_chosen': -78.83773040771484, 'logps/ref_rejected': -109.06343078613281, 'logits/chosen': -6.762874126434326, 'logits/rejected': -6.526078224182129, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.03596482798457146, 'kl/avg_steps': 0.71875, 'epoch': 0.35} 35%|███████████████████████████████████████▏ | 236/681 [17:21<23:37, 3.19s/it] 35%|███████████████████████████████████████▎ | 237/681 [17:24<23:39, 3.20s/it] {'loss': 0.9932, 'grad_norm': 15.7308931350708, 'learning_rate': 4.136269950853473e-07, 'rewards/chosen': -1.0902290344238281, 'rewards/rejected': -1.89679753780365, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.8065685033798218, 'logps/chosen': -115.78558349609375, 'logps/rejected': -153.35638427734375, 'logps/ref_chosen': -85.21128845214844, 'logps/ref_rejected': -99.90999603271484, 'logits/chosen': -7.501216888427734, 'logits/rejected': -6.719086647033691, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.03570817783474922, 'kl/avg_steps': 0.46875, 'epoch': 0.35} 35%|███████████████████████████████████████▎ | 237/681 [17:24<23:39, 3.20s/it] 35%|███████████████████████████████████████▍ | 238/681 [17:28<23:39, 3.20s/it] {'loss': 0.8772, 'grad_norm': 13.518217086791992, 'learning_rate': 4.126545750510605e-07, 'rewards/chosen': -1.190969705581665, 'rewards/rejected': -2.0659611225128174, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.8749915361404419, 'logps/chosen': -112.39120483398438, 'logps/rejected': -153.98170471191406, 'logps/ref_chosen': -78.73123168945312, 'logps/ref_rejected': -95.41840362548828, 'logits/chosen': -6.873219013214111, 'logits/rejected': -5.906381607055664, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.03554157540202141, 'kl/avg_steps': 0.625, 'epoch': 0.35} 35%|███████████████████████████████████████▍ | 238/681 [17:28<23:39, 3.20s/it] 35%|███████████████████████████████████████▋ | 239/681 [17:30<22:38, 3.07s/it] {'loss': 0.9094, 'grad_norm': 14.817635536193848, 'learning_rate': 4.116778689174514e-07, 'rewards/chosen': -1.117043375968933, 'rewards/rejected': -1.9282580614089966, 'rewards/accuracies': 0.75, 'rewards/margins': 0.8112146854400635, 'logps/chosen': -124.30506134033203, 'logps/rejected': -155.46278381347656, 'logps/ref_chosen': -92.60093688964844, 'logps/ref_rejected': -100.51769256591797, 'logits/chosen': -7.087869644165039, 'logits/rejected': -6.191803932189941, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.035320818424224854, 'kl/avg_steps': 0.5, 'epoch': 0.35} 35%|███████████████████████████████████████▋ | 239/681 [17:30<22:38, 3.07s/it] 35%|███████████████████████████████████████▊ | 240/681 [17:33<22:33, 3.07s/it] {'loss': 0.9385, 'grad_norm': 17.138147354125977, 'learning_rate': 4.106969024216348e-07, 'rewards/chosen': -1.1331913471221924, 'rewards/rejected': -1.8612143993377686, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.7280229926109314, 'logps/chosen': -118.43206787109375, 'logps/rejected': -133.68617248535156, 'logps/ref_chosen': -86.15977478027344, 'logps/ref_rejected': -80.45567321777344, 'logits/chosen': -7.254019260406494, 'logits/rejected': -7.064877986907959, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.3125, 'kl/beta': 0.03514509275555611, 'kl/avg_steps': 0.375, 'epoch': 0.35} 35%|███████████████████████████████████████▊ | 240/681 [17:33<22:33, 3.07s/it] 35%|███████████████████████████████████████▉ | 241/681 [17:36<22:04, 3.01s/it] {'loss': 0.6851, 'grad_norm': 13.10658073425293, 'learning_rate': 4.097117014129903e-07, 'rewards/chosen': -0.9694840312004089, 'rewards/rejected': -2.225541353225708, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.2560572624206543, 'logps/chosen': -128.86328125, 'logps/rejected': -158.0999298095703, 'logps/ref_chosen': -101.04594421386719, 'logps/ref_rejected': -94.04934692382812, 'logits/chosen': -7.0530242919921875, 'logits/rejected': -6.54688835144043, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.03501379117369652, 'kl/avg_steps': 0.6875, 'epoch': 0.35} 35%|███████████████████████████████████████▉ | 241/681 [17:36<22:04, 3.01s/it] 36%|████████████████████████████████████████▏ | 242/681 [17:39<22:04, 3.02s/it] {'loss': 0.8351, 'grad_norm': 14.712688446044922, 'learning_rate': 4.087222918524807e-07, 'rewards/chosen': -1.184476613998413, 'rewards/rejected': -2.078831434249878, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8943547606468201, 'logps/chosen': -129.89794921875, 'logps/rejected': -150.88523864746094, 'logps/ref_chosen': -95.67266082763672, 'logps/ref_rejected': -90.65454864501953, 'logits/chosen': -7.671010494232178, 'logits/rejected': -7.0201215744018555, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.034774716943502426, 'kl/avg_steps': 0.6875, 'epoch': 0.36} 36%|████████████████████████████████████████▏ | 242/681 [17:39<22:04, 3.02s/it] 36%|████████████████████████████████████████▎ | 243/681 [17:42<22:11, 3.04s/it] {'loss': 0.8947, 'grad_norm': 14.254800796508789, 'learning_rate': 4.07728699811968e-07, 'rewards/chosen': -1.1293984651565552, 'rewards/rejected': -2.0041677951812744, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.8747692704200745, 'logps/chosen': -130.82611083984375, 'logps/rejected': -141.62124633789062, 'logps/ref_chosen': -98.03140258789062, 'logps/ref_rejected': -83.18806457519531, 'logits/chosen': -7.436939239501953, 'logits/rejected': -7.066817283630371, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.03453727439045906, 'kl/avg_steps': 0.5625, 'epoch': 0.36} 36%|████████████████████████████████████████▎ | 243/681 [17:42<22:11, 3.04s/it] 36%|████████████████████████████████████████▍ | 244/681 [17:46<22:17, 3.06s/it] {'loss': 0.8616, 'grad_norm': 14.085189819335938, 'learning_rate': 4.067309514735267e-07, 'rewards/chosen': -1.187152624130249, 'rewards/rejected': -2.082256317138672, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8951038122177124, 'logps/chosen': -123.65585327148438, 'logps/rejected': -163.68592834472656, 'logps/ref_chosen': -88.89391326904297, 'logps/ref_rejected': -102.57278442382812, 'logits/chosen': -7.297283172607422, 'logits/rejected': -6.48216438293457, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.03434408828616142, 'kl/avg_steps': 0.65625, 'epoch': 0.36} 36%|████████████████████████████████████████▍ | 244/681 [17:46<22:17, 3.06s/it] 36%|████████████████████████████████████████▋ | 245/681 [17:49<22:43, 3.13s/it] {'loss': 0.777, 'grad_norm': 13.635170936584473, 'learning_rate': 4.057290731287531e-07, 'rewards/chosen': -1.0023081302642822, 'rewards/rejected': -1.9891510009765625, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.9868428707122803, 'logps/chosen': -133.73660278320312, 'logps/rejected': -151.42526245117188, 'logps/ref_chosen': -104.19400024414062, 'logps/ref_rejected': -92.65645599365234, 'logits/chosen': -7.1143999099731445, 'logits/rejected': -6.517737865447998, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'kl/beta': 0.03412017226219177, 'kl/avg_steps': 0.78125, 'epoch': 0.36} 36%|████████████████████████████████████████▋ | 245/681 [17:49<22:43, 3.13s/it] 36%|████████████████████████████████████████▊ | 246/681 [17:52<22:35, 3.12s/it] {'loss': 0.9225, 'grad_norm': 15.270983695983887, 'learning_rate': 4.047230911780736e-07, 'rewards/chosen': -1.277728796005249, 'rewards/rejected': -2.0303685665130615, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.7526397705078125, 'logps/chosen': -141.06942749023438, 'logps/rejected': -151.3291015625, 'logps/ref_chosen': -103.21904754638672, 'logps/ref_rejected': -90.9922103881836, 'logits/chosen': -6.8056464195251465, 'logits/rejected': -6.480011940002441, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.03385567665100098, 'kl/avg_steps': 0.53125, 'epoch': 0.36} 36%|████████████████████████████████████████▊ | 246/681 [17:52<22:35, 3.12s/it] 36%|████████████████████████████████████████▉ | 247/681 [17:55<22:10, 3.07s/it] {'loss': 0.7459, 'grad_norm': 12.317371368408203, 'learning_rate': 4.0371303213004814e-07, 'rewards/chosen': -1.1581556797027588, 'rewards/rejected': -2.265951156616211, 'rewards/accuracies': 0.875, 'rewards/margins': 1.1077954769134521, 'logps/chosen': -121.5998306274414, 'logps/rejected': -179.18167114257812, 'logps/ref_chosen': -86.99436950683594, 'logps/ref_rejected': -111.33802795410156, 'logits/chosen': -7.767994403839111, 'logits/rejected': -7.044826507568359, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'kl/beta': 0.03367676958441734, 'kl/avg_steps': 0.78125, 'epoch': 0.36} 36%|████████████████████████████████████████▉ | 247/681 [17:55<22:10, 3.07s/it] 36%|█████████████████████████████████████████▏ | 248/681 [17:58<22:03, 3.06s/it] {'loss': 0.9066, 'grad_norm': 15.562211990356445, 'learning_rate': 4.0269892260067197e-07, 'rewards/chosen': -1.2984592914581299, 'rewards/rejected': -2.0758614540100098, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.7774021625518799, 'logps/chosen': -113.81005859375, 'logps/rejected': -160.82879638671875, 'logps/ref_chosen': -74.7855224609375, 'logps/ref_rejected': -98.27689361572266, 'logits/chosen': -7.3634748458862305, 'logits/rejected': -6.64450740814209, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.03341570869088173, 'kl/avg_steps': 0.5625, 'epoch': 0.36} 36%|█████████████████████████████████████████▏ | 248/681 [17:58<22:03, 3.06s/it] 37%|█████████████████████████████████████████▎ | 249/681 [18:01<21:29, 2.99s/it] {'loss': 0.9861, 'grad_norm': 17.970985412597656, 'learning_rate': 4.0168078931267426e-07, 'rewards/chosen': -1.3996614217758179, 'rewards/rejected': -2.111713409423828, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.7120518684387207, 'logps/chosen': -137.87208557128906, 'logps/rejected': -151.0142059326172, 'logps/ref_chosen': -95.70379638671875, 'logps/ref_rejected': -87.14646911621094, 'logits/chosen': -7.139558792114258, 'logits/rejected': -6.51041316986084, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.0332287959754467, 'kl/avg_steps': 0.4375, 'epoch': 0.37} 37%|█████████████████████████████████████████▎ | 249/681 [18:01<21:29, 2.99s/it] 37%|█████████████████████████████████████████▍ | 250/681 [18:04<21:35, 3.01s/it] {'loss': 0.9481, 'grad_norm': 18.755630493164062, 'learning_rate': 4.006586590948141e-07, 'rewards/chosen': -1.287900686264038, 'rewards/rejected': -2.0340938568115234, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.7461932897567749, 'logps/chosen': -153.10398864746094, 'logps/rejected': -142.95217895507812, 'logps/ref_chosen': -114.05220794677734, 'logps/ref_rejected': -81.08768463134766, 'logits/chosen': -7.073131561279297, 'logits/rejected': -6.734502792358398, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.03308405354619026, 'kl/avg_steps': 0.5625, 'epoch': 0.37} 37%|█████████████████████████████████████████▍ | 250/681 [18:04<21:35, 3.01s/it] 37%|█████████████████████████████████████████▋ | 251/681 [18:07<21:20, 2.98s/it] {'loss': 0.9872, 'grad_norm': 15.883668899536133, 'learning_rate': 3.9963255888117325e-07, 'rewards/chosen': -1.2341444492340088, 'rewards/rejected': -1.9472013711929321, 'rewards/accuracies': 0.75, 'rewards/margins': 0.7130569219589233, 'logps/chosen': -135.24278259277344, 'logps/rejected': -142.99111938476562, 'logps/ref_chosen': -97.71128845214844, 'logps/ref_rejected': -83.52742004394531, 'logits/chosen': -7.458626747131348, 'logits/rejected': -6.964792251586914, 'kl/p_epsilon_steps': 0.640625, 'kl/n_epsilon_steps': 0.359375, 'kl/beta': 0.03289899602532387, 'kl/avg_steps': 0.28125, 'epoch': 0.37} 37%|█████████████████████████████████████████▋ | 251/681 [18:07<21:20, 2.98s/it] 37%|█████████████████████████████████████████▊ | 252/681 [18:10<21:50, 3.05s/it] {'loss': 0.93, 'grad_norm': 16.21406364440918, 'learning_rate': 3.9860251571044666e-07, 'rewards/chosen': -1.2076178789138794, 'rewards/rejected': -1.948510766029358, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.7408928871154785, 'logps/chosen': -145.88766479492188, 'logps/rejected': -151.3180389404297, 'logps/ref_chosen': -108.9861068725586, 'logps/ref_rejected': -91.56424713134766, 'logits/chosen': -7.022328853607178, 'logits/rejected': -6.755680084228516, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.03280672803521156, 'kl/avg_steps': 0.5, 'epoch': 0.37} 37%|█████████████████████████████████████████▊ | 252/681 [18:10<21:50, 3.05s/it] 37%|█████████████████████████████████████████▉ | 253/681 [18:13<22:00, 3.09s/it] {'loss': 0.9468, 'grad_norm': 15.019343376159668, 'learning_rate': 3.9756855672522986e-07, 'rewards/chosen': -1.3202366828918457, 'rewards/rejected': -2.0234005451202393, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.7031639218330383, 'logps/chosen': -140.8556365966797, 'logps/rejected': -168.1231689453125, 'logps/ref_chosen': -100.21630859375, 'logps/ref_rejected': -105.67670440673828, 'logits/chosen': -6.749828815460205, 'logits/rejected': -6.487143516540527, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.03264351189136505, 'kl/avg_steps': 0.625, 'epoch': 0.37} 37%|█████████████████████████████████████████▉ | 253/681 [18:13<22:00, 3.09s/it] 37%|██████████████████████████████████████████▏ | 254/681 [18:16<22:10, 3.12s/it] {'loss': 0.971, 'grad_norm': 14.38194751739502, 'learning_rate': 3.965307091713037e-07, 'rewards/chosen': -1.2845338582992554, 'rewards/rejected': -1.9614899158477783, 'rewards/accuracies': 0.75, 'rewards/margins': 0.676956057548523, 'logps/chosen': -138.41323852539062, 'logps/rejected': -154.52410888671875, 'logps/ref_chosen': -98.73518371582031, 'logps/ref_rejected': -93.73825073242188, 'logits/chosen': -7.187813758850098, 'logits/rejected': -6.099554061889648, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.3125, 'kl/beta': 0.032440755516290665, 'kl/avg_steps': 0.359375, 'epoch': 0.37} 37%|██████████████████████████████████████████▏ | 254/681 [18:16<22:10, 3.12s/it] 37%|██████████████████████████████████████████▎ | 255/681 [18:19<21:41, 3.06s/it] {'loss': 0.8306, 'grad_norm': 14.393223762512207, 'learning_rate': 3.954890003969163e-07, 'rewards/chosen': -1.3301056623458862, 'rewards/rejected': -2.233675003051758, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.9035694599151611, 'logps/chosen': -131.79598999023438, 'logps/rejected': -166.75144958496094, 'logps/ref_chosen': -90.382568359375, 'logps/ref_rejected': -97.07625579833984, 'logits/chosen': -7.347094535827637, 'logits/rejected': -6.871660232543945, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.03232458978891373, 'kl/avg_steps': 0.71875, 'epoch': 0.37} 37%|██████████████████████████████████████████▎ | 255/681 [18:19<21:41, 3.06s/it] 38%|██████████████████████████████████████████▍ | 256/681 [18:22<21:20, 3.01s/it] {'loss': 0.8817, 'grad_norm': 14.8760986328125, 'learning_rate': 3.944434578520628e-07, 'rewards/chosen': -1.1998153924942017, 'rewards/rejected': -2.102677822113037, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.9028624296188354, 'logps/chosen': -126.28778076171875, 'logps/rejected': -164.5001220703125, 'logps/ref_chosen': -88.7528076171875, 'logps/ref_rejected': -98.49382781982422, 'logits/chosen': -7.483323097229004, 'logits/rejected': -6.505797386169434, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.03209391236305237, 'kl/avg_steps': 0.65625, 'epoch': 0.38} 38%|██████████████████████████████████████████▍ | 256/681 [18:22<21:20, 3.01s/it] 38%|██████████████████████████████████████████▋ | 257/681 [18:25<21:30, 3.04s/it] {'loss': 0.8809, 'grad_norm': 15.880992889404297, 'learning_rate': 3.933941090877615e-07, 'rewards/chosen': -1.2686723470687866, 'rewards/rejected': -2.1270086765289307, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8583362698554993, 'logps/chosen': -122.74734497070312, 'logps/rejected': -153.03878784179688, 'logps/ref_chosen': -82.80352783203125, 'logps/ref_rejected': -85.8677978515625, 'logits/chosen': -7.26761531829834, 'logits/rejected': -6.992372512817383, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.03188467025756836, 'kl/avg_steps': 0.5625, 'epoch': 0.38} 38%|██████████████████████████████████████████▋ | 257/681 [18:25<21:30, 3.04s/it] 38%|██████████████████████████████████████████▊ | 258/681 [18:28<21:10, 3.00s/it] {'loss': 0.8631, 'grad_norm': 14.805852890014648, 'learning_rate': 3.923409817553284e-07, 'rewards/chosen': -1.2512186765670776, 'rewards/rejected': -2.0762269496917725, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8250081539154053, 'logps/chosen': -129.7963409423828, 'logps/rejected': -169.39410400390625, 'logps/ref_chosen': -90.187744140625, 'logps/ref_rejected': -103.47068786621094, 'logits/chosen': -7.284233093261719, 'logits/rejected': -6.5939788818359375, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.031706321984529495, 'kl/avg_steps': 0.5625, 'epoch': 0.38} 38%|██████████████████████████████████████████▊ | 258/681 [18:28<21:10, 3.00s/it] 38%|██████████████████████████████████████████▉ | 259/681 [18:31<21:05, 3.00s/it] {'loss': 0.9968, 'grad_norm': 15.601424217224121, 'learning_rate': 3.9128410360564793e-07, 'rewards/chosen': -1.3569505214691162, 'rewards/rejected': -2.0308425426483154, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.6738921403884888, 'logps/chosen': -133.95770263671875, 'logps/rejected': -159.41954040527344, 'logps/ref_chosen': -90.77254486083984, 'logps/ref_rejected': -94.58816528320312, 'logits/chosen': -6.8948163986206055, 'logits/rejected': -6.679216384887695, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.03152897208929062, 'kl/avg_steps': 0.5625, 'epoch': 0.38} 38%|██████████████████████████████████████████▉ | 259/681 [18:31<21:05, 3.00s/it] 38%|███████████████████████████████████████████▏ | 260/681 [18:35<21:53, 3.12s/it] {'loss': 1.0562, 'grad_norm': 16.810916900634766, 'learning_rate': 3.9022350248844246e-07, 'rewards/chosen': -1.3429489135742188, 'rewards/rejected': -1.9885942935943604, 'rewards/accuracies': 0.75, 'rewards/margins': 0.6456452012062073, 'logps/chosen': -118.55705261230469, 'logps/rejected': -164.71981811523438, 'logps/ref_chosen': -75.59269714355469, 'logps/ref_rejected': -100.84554290771484, 'logits/chosen': -7.622888088226318, 'logits/rejected': -6.941500186920166, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.03135261312127113, 'kl/avg_steps': 0.5, 'epoch': 0.38} 38%|███████████████████████████████████████████▏ | 260/681 [18:35<21:53, 3.12s/it] 38%|███████████████████████████████████████████▎ | 261/681 [18:37<20:45, 2.96s/it] {'loss': 0.8049, 'grad_norm': 13.099295616149902, 'learning_rate': 3.891592063515376e-07, 'rewards/chosen': -1.0897305011749268, 'rewards/rejected': -2.0056087970733643, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.9158782958984375, 'logps/chosen': -129.3551025390625, 'logps/rejected': -158.41415405273438, 'logps/ref_chosen': -94.25491333007812, 'logps/ref_rejected': -93.65699768066406, 'logits/chosen': -7.312009811401367, 'logits/rejected': -7.067169666290283, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.031196629628539085, 'kl/avg_steps': 0.65625, 'epoch': 0.38} 38%|███████████████████████████████████████████▎ | 261/681 [18:37<20:45, 2.96s/it] 38%|███████████████████████████████████████████▍ | 262/681 [18:40<20:24, 2.92s/it] {'loss': 0.8587, 'grad_norm': 13.834671020507812, 'learning_rate': 3.880912432401264e-07, 'rewards/chosen': -1.1732381582260132, 'rewards/rejected': -2.021843194961548, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8486050367355347, 'logps/chosen': -123.33172607421875, 'logps/rejected': -156.57522583007812, 'logps/ref_chosen': -85.26730346679688, 'logps/ref_rejected': -90.82609558105469, 'logits/chosen': -7.694709777832031, 'logits/rejected': -6.795375823974609, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.03099323809146881, 'kl/avg_steps': 0.6875, 'epoch': 0.38} 38%|███████████████████████████████████████████▍ | 262/681 [18:40<20:24, 2.92s/it] 39%|███████████████████████████████████████████▋ | 263/681 [18:43<20:22, 2.93s/it] {'loss': 0.7435, 'grad_norm': 13.023946762084961, 'learning_rate': 3.870196412960302e-07, 'rewards/chosen': -0.9609699845314026, 'rewards/rejected': -2.075908660888672, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.114938497543335, 'logps/chosen': -127.14653015136719, 'logps/rejected': -169.84298706054688, 'logps/ref_chosen': -95.75790405273438, 'logps/ref_rejected': -101.83377075195312, 'logits/chosen': -6.901250839233398, 'logits/rejected': -6.722340106964111, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'kl/beta': 0.030781613662838936, 'kl/avg_steps': 0.75, 'epoch': 0.39} 39%|███████████████████████████████████████████▋ | 263/681 [18:43<20:22, 2.93s/it] 39%|███████████████████████████████████████████▊ | 264/681 [18:46<20:53, 3.01s/it] {'loss': 0.8463, 'grad_norm': 13.047411918640137, 'learning_rate': 3.8594442875695665e-07, 'rewards/chosen': -1.0397236347198486, 'rewards/rejected': -1.9546151161193848, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.9148914217948914, 'logps/chosen': -124.76637268066406, 'logps/rejected': -164.76083374023438, 'logps/ref_chosen': -90.6226577758789, 'logps/ref_rejected': -100.32554626464844, 'logits/chosen': -7.175429344177246, 'logits/rejected': -6.558530330657959, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.030552471056580544, 'kl/avg_steps': 0.5625, 'epoch': 0.39} 39%|███████████████████████████████████████████▊ | 264/681 [18:46<20:53, 3.01s/it] 39%|███████████████████████████████████████████▉ | 265/681 [18:49<20:48, 3.00s/it] {'loss': 0.8792, 'grad_norm': 12.617591857910156, 'learning_rate': 3.848656339557562e-07, 'rewards/chosen': -1.1058483123779297, 'rewards/rejected': -1.9895697832107544, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.8837213516235352, 'logps/chosen': -128.9173583984375, 'logps/rejected': -160.58474731445312, 'logps/ref_chosen': -92.37232971191406, 'logps/ref_rejected': -94.62757110595703, 'logits/chosen': -6.967649936676025, 'logits/rejected': -6.795099258422852, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.030381573364138603, 'kl/avg_steps': 0.5625, 'epoch': 0.39} 39%|███████████████████████████████████████████▉ | 265/681 [18:49<20:48, 3.00s/it] 39%|████████████████████████████████████████████▏ | 266/681 [18:52<20:54, 3.02s/it] {'loss': 0.9138, 'grad_norm': 15.880788803100586, 'learning_rate': 3.8378328531967507e-07, 'rewards/chosen': -1.2558131217956543, 'rewards/rejected': -2.005937099456787, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.7501237392425537, 'logps/chosen': -143.80422973632812, 'logps/rejected': -141.06625366210938, 'logps/ref_chosen': -102.20002746582031, 'logps/ref_rejected': -74.36642456054688, 'logits/chosen': -7.422516345977783, 'logits/rejected': -7.111500263214111, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.328125, 'kl/beta': 0.030211633071303368, 'kl/avg_steps': 0.34375, 'epoch': 0.39} 39%|████████████████████████████████████████████▏ | 266/681 [18:52<20:54, 3.02s/it] 39%|████████████████████████████████████████████▎ | 267/681 [18:55<21:07, 3.06s/it] {'loss': 0.7978, 'grad_norm': 13.670655250549316, 'learning_rate': 3.8269741136960646e-07, 'rewards/chosen': -1.04282808303833, 'rewards/rejected': -1.9617595672607422, 'rewards/accuracies': 0.875, 'rewards/margins': 0.9189317226409912, 'logps/chosen': -139.10177612304688, 'logps/rejected': -161.63311767578125, 'logps/ref_chosen': -104.28599548339844, 'logps/ref_rejected': -95.98719024658203, 'logits/chosen': -7.119365692138672, 'logits/rejected': -6.709759712219238, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.0301081370562315, 'kl/avg_steps': 0.65625, 'epoch': 0.39} 39%|████████████████████████████████████████████▎ | 267/681 [18:55<21:07, 3.06s/it] 39%|████████████████████████████████████████████▍ | 268/681 [18:58<21:13, 3.08s/it] {'loss': 0.7894, 'grad_norm': 13.244879722595215, 'learning_rate': 3.8160804071933894e-07, 'rewards/chosen': -1.030745029449463, 'rewards/rejected': -1.9844391345977783, 'rewards/accuracies': 0.875, 'rewards/margins': 0.9536941051483154, 'logps/chosen': -121.36919403076172, 'logps/rejected': -176.08946228027344, 'logps/ref_chosen': -86.69622039794922, 'logps/ref_rejected': -109.19183349609375, 'logits/chosen': -7.436177730560303, 'logits/rejected': -6.697609901428223, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'kl/beta': 0.029911840334534645, 'kl/avg_steps': 0.75, 'epoch': 0.39} 39%|████████████████████████████████████████████▍ | 268/681 [18:58<21:13, 3.08s/it] 40%|████████████████████████████████████████████▋ | 269/681 [19:01<20:51, 3.04s/it] {'loss': 0.8031, 'grad_norm': 13.288008689880371, 'learning_rate': 3.8051520207480204e-07, 'rewards/chosen': -1.0468546152114868, 'rewards/rejected': -2.0211329460144043, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.9742782115936279, 'logps/chosen': -140.38177490234375, 'logps/rejected': -181.06021118164062, 'logps/ref_chosen': -104.97181701660156, 'logps/ref_rejected': -112.4764633178711, 'logits/chosen': -7.45705509185791, 'logits/rejected': -6.695377826690674, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.02968917228281498, 'kl/avg_steps': 0.5625, 'epoch': 0.4} 40%|████████████████████████████████████████████▋ | 269/681 [19:01<20:51, 3.04s/it] 40%|████████████████████████████████████████████▊ | 270/681 [19:04<20:59, 3.07s/it] {'loss': 0.9174, 'grad_norm': 13.801987648010254, 'learning_rate': 3.794189242333106e-07, 'rewards/chosen': -1.1459414958953857, 'rewards/rejected': -2.0066967010498047, 'rewards/accuracies': 0.75, 'rewards/margins': 0.8607551455497742, 'logps/chosen': -139.99703979492188, 'logps/rejected': -186.19866943359375, 'logps/ref_chosen': -101.07383728027344, 'logps/ref_rejected': -117.75289916992188, 'logits/chosen': -7.553506851196289, 'logits/rejected': -6.664368152618408, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.029523104429244995, 'kl/avg_steps': 0.5625, 'epoch': 0.4} 40%|████████████████████████████████████████████▊ | 270/681 [19:05<20:59, 3.07s/it] 40%|████████████████████████████████████████████▉ | 271/681 [19:07<20:27, 2.99s/it] {'loss': 0.8123, 'grad_norm': 12.916444778442383, 'learning_rate': 3.7831923608280514e-07, 'rewards/chosen': -0.9474867582321167, 'rewards/rejected': -1.9652299880981445, 'rewards/accuracies': 0.890625, 'rewards/margins': 1.0177432298660278, 'logps/chosen': -129.179443359375, 'logps/rejected': -166.06207275390625, 'logps/ref_chosen': -96.72459411621094, 'logps/ref_rejected': -98.5244140625, 'logits/chosen': -6.938437461853027, 'logits/rejected': -6.5058183670043945, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'kl/beta': 0.029357966035604477, 'kl/avg_steps': 0.75, 'epoch': 0.4} 40%|████████████████████████████████████████████▉ | 271/681 [19:07<20:27, 2.99s/it] 40%|█████████████████████████████████████████████▏ | 272/681 [19:10<20:39, 3.03s/it] {'loss': 0.8282, 'grad_norm': 14.018041610717773, 'learning_rate': 3.772161666010912e-07, 'rewards/chosen': -0.7937983870506287, 'rewards/rejected': -1.7605557441711426, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.9667574167251587, 'logps/chosen': -108.3265380859375, 'logps/rejected': -169.4478759765625, 'logps/ref_chosen': -80.97721862792969, 'logps/ref_rejected': -108.55535888671875, 'logits/chosen': -7.514509677886963, 'logits/rejected': -7.041294574737549, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.02913942001760006, 'kl/avg_steps': 0.625, 'epoch': 0.4} 40%|█████████████████████████████████████████████▏ | 272/681 [19:10<20:39, 3.03s/it] 40%|█████████████████████████████████████████████▎ | 273/681 [19:13<20:27, 3.01s/it] {'loss': 0.8615, 'grad_norm': 13.580171585083008, 'learning_rate': 3.761097448550755e-07, 'rewards/chosen': -1.0409189462661743, 'rewards/rejected': -2.039475440979004, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.9985566139221191, 'logps/chosen': -128.26251220703125, 'logps/rejected': -168.29635620117188, 'logps/ref_chosen': -92.22460174560547, 'logps/ref_rejected': -97.3630599975586, 'logits/chosen': -7.261774063110352, 'logits/rejected': -6.644618988037109, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.028958430513739586, 'kl/avg_steps': 0.59375, 'epoch': 0.4} 40%|█████████████████████████████████████████████▎ | 273/681 [19:13<20:27, 3.01s/it] 40%|█████████████████████████████████████████████▍ | 274/681 [19:16<20:20, 3.00s/it] {'loss': 0.8819, 'grad_norm': 15.386717796325684, 'learning_rate': 3.75e-07, 'rewards/chosen': -1.0782254934310913, 'rewards/rejected': -1.8726218938827515, 'rewards/accuracies': 0.875, 'rewards/margins': 0.7943964004516602, 'logps/chosen': -119.94572448730469, 'logps/rejected': -149.46990966796875, 'logps/ref_chosen': -82.2608871459961, 'logps/ref_rejected': -83.87699127197266, 'logits/chosen': -7.7138991355896, 'logits/rejected': -6.974665641784668, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'kl/beta': 0.028787503018975258, 'kl/avg_steps': 0.75, 'epoch': 0.4} 40%|█████████████████████████████████████████████▍ | 274/681 [19:16<20:20, 3.00s/it] 40%|█████████████████████████████████████████████▋ | 275/681 [19:20<21:08, 3.12s/it] {'loss': 0.9047, 'grad_norm': 15.320517539978027, 'learning_rate': 3.738869612786737e-07, 'rewards/chosen': -0.8999394178390503, 'rewards/rejected': -1.7487740516662598, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.8488346338272095, 'logps/chosen': -111.30049896240234, 'logps/rejected': -160.444580078125, 'logps/ref_chosen': -79.68695831298828, 'logps/ref_rejected': -98.7509765625, 'logits/chosen': -7.628604888916016, 'logits/rejected': -6.829320907592773, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.028573205694556236, 'kl/avg_steps': 0.71875, 'epoch': 0.4} 40%|█████████████████████████████████████████████▋ | 275/681 [19:20<21:08, 3.12s/it] 41%|█████████████████████████████████████████████▊ | 276/681 [19:23<20:25, 3.03s/it] {'loss': 0.9138, 'grad_norm': 15.482325553894043, 'learning_rate': 3.7277065802070204e-07, 'rewards/chosen': -1.1226071119308472, 'rewards/rejected': -1.9377222061157227, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8151148557662964, 'logps/chosen': -126.2287826538086, 'logps/rejected': -146.60427856445312, 'logps/ref_chosen': -86.53970336914062, 'logps/ref_rejected': -77.85394287109375, 'logits/chosen': -7.228045463562012, 'logits/rejected': -6.846656799316406, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.028369300067424774, 'kl/avg_steps': 0.53125, 'epoch': 0.41} 41%|█████████████████████████████████████████████▊ | 276/681 [19:23<20:25, 3.03s/it] 41%|█████████████████████████████████████████████▉ | 277/681 [19:25<19:39, 2.92s/it] {'loss': 0.898, 'grad_norm': 15.835052490234375, 'learning_rate': 3.71651119641714e-07, 'rewards/chosen': -1.0951545238494873, 'rewards/rejected': -1.8525817394256592, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.7574270963668823, 'logps/chosen': -123.2606430053711, 'logps/rejected': -164.99749755859375, 'logps/ref_chosen': -84.24411010742188, 'logps/ref_rejected': -98.83421325683594, 'logits/chosen': -7.30094051361084, 'logits/rejected': -6.272882461547852, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.02821938507258892, 'kl/avg_steps': 0.65625, 'epoch': 0.41} 41%|█████████████████████████████████████████████▉ | 277/681 [19:25<19:39, 2.92s/it] 41%|██████████████████████████████████████████████▏ | 278/681 [19:28<19:56, 2.97s/it] {'loss': 0.9088, 'grad_norm': 13.882027626037598, 'learning_rate': 3.705283756425872e-07, 'rewards/chosen': -1.0521965026855469, 'rewards/rejected': -1.8657093048095703, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.8135128021240234, 'logps/chosen': -120.13731384277344, 'logps/rejected': -164.93141174316406, 'logps/ref_chosen': -82.431884765625, 'logps/ref_rejected': -97.85691833496094, 'logits/chosen': -7.213944435119629, 'logits/rejected': -6.768559455871582, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.028035402297973633, 'kl/avg_steps': 0.65625, 'epoch': 0.41} 41%|██████████████████████████████████████████████▏ | 278/681 [19:28<19:56, 2.97s/it] 41%|██████████████████████████████████████████████▎ | 279/681 [19:31<19:54, 2.97s/it] {'loss': 0.8719, 'grad_norm': 13.638445854187012, 'learning_rate': 3.6940245560867e-07, 'rewards/chosen': -1.119484305381775, 'rewards/rejected': -1.9745573997497559, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.8550732135772705, 'logps/chosen': -125.54033660888672, 'logps/rejected': -165.55284118652344, 'logps/ref_chosen': -85.16799926757812, 'logps/ref_rejected': -94.12664794921875, 'logits/chosen': -7.466989517211914, 'logits/rejected': -7.357805252075195, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.02785261906683445, 'kl/avg_steps': 0.59375, 'epoch': 0.41} 41%|██████████████████████████████████████████████▎ | 279/681 [19:31<19:54, 2.97s/it] 41%|██████████████████████████████████████████████▍ | 280/681 [19:35<20:29, 3.06s/it] {'loss': 0.7947, 'grad_norm': 13.675751686096191, 'learning_rate': 3.6827338920900253e-07, 'rewards/chosen': -1.0486705303192139, 'rewards/rejected': -2.0240159034729004, 'rewards/accuracies': 0.875, 'rewards/margins': 0.9753453731536865, 'logps/chosen': -123.91621398925781, 'logps/rejected': -177.7848663330078, 'logps/ref_chosen': -85.85641479492188, 'logps/ref_rejected': -104.11859130859375, 'logits/chosen': -7.190787315368652, 'logits/rejected': -6.597271919250488, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.027688222005963326, 'kl/avg_steps': 0.625, 'epoch': 0.41} 41%|██████████████████████████████████████████████▍ | 280/681 [19:35<20:29, 3.06s/it] 41%|██████████████████████████████████████████████▋ | 281/681 [19:38<20:49, 3.12s/it] {'loss': 0.8278, 'grad_norm': 11.921332359313965, 'learning_rate': 3.6714120619553435e-07, 'rewards/chosen': -1.0626280307769775, 'rewards/rejected': -2.0622613430023193, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.9996333718299866, 'logps/chosen': -125.37873840332031, 'logps/rejected': -164.1964874267578, 'logps/ref_chosen': -86.55081939697266, 'logps/ref_rejected': -88.62866973876953, 'logits/chosen': -7.7915472984313965, 'logits/rejected': -6.586763381958008, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.02751624397933483, 'kl/avg_steps': 0.6875, 'epoch': 0.41} 41%|██████████████████████████████████████████████▋ | 281/681 [19:38<20:49, 3.12s/it] 41%|██████████████████████████████████████████████▊ | 282/681 [19:41<20:45, 3.12s/it] {'loss': 0.6931, 'grad_norm': 11.661370277404785, 'learning_rate': 3.660059364023408e-07, 'rewards/chosen': -1.037402868270874, 'rewards/rejected': -2.1895809173583984, 'rewards/accuracies': 0.953125, 'rewards/margins': 1.152178168296814, 'logps/chosen': -143.371826171875, 'logps/rejected': -183.70306396484375, 'logps/ref_chosen': -105.10511016845703, 'logps/ref_rejected': -102.85336303710938, 'logits/chosen': -7.430984973907471, 'logits/rejected': -6.962902545928955, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'kl/beta': 0.027328362688422203, 'kl/avg_steps': 0.8125, 'epoch': 0.41} 41%|██████████████████████████████████████████████▊ | 282/681 [19:41<20:45, 3.12s/it] 42%|██████████████████████████████████████████████▉ | 283/681 [19:44<20:32, 3.10s/it] {'loss': 0.9734, 'grad_norm': 17.020105361938477, 'learning_rate': 3.6486760974483685e-07, 'rewards/chosen': -1.1950119733810425, 'rewards/rejected': -1.9943175315856934, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.7993054389953613, 'logps/chosen': -139.34780883789062, 'logps/rejected': -177.7033233642578, 'logps/ref_chosen': -95.05259704589844, 'logps/ref_rejected': -103.54454803466797, 'logits/chosen': -7.325111389160156, 'logits/rejected': -6.981871604919434, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.02710810862481594, 'kl/avg_steps': 0.65625, 'epoch': 0.42} 42%|██████████████████████████████████████████████▉ | 283/681 [19:44<20:32, 3.10s/it] 42%|███████████████████████████████████████████████ | 284/681 [19:47<20:48, 3.14s/it] {'loss': 0.6944, 'grad_norm': 11.532405853271484, 'learning_rate': 3.6372625621898863e-07, 'rewards/chosen': -0.9900509119033813, 'rewards/rejected': -2.1134815216064453, 'rewards/accuracies': 0.921875, 'rewards/margins': 1.1234304904937744, 'logps/chosen': -124.68060302734375, 'logps/rejected': -177.91339111328125, 'logps/ref_chosen': -87.6664810180664, 'logps/ref_rejected': -98.75103759765625, 'logits/chosen': -7.086446762084961, 'logits/rejected': -6.687787055969238, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'kl/beta': 0.02693137153983116, 'kl/avg_steps': 0.78125, 'epoch': 0.42} 42%|███████████████████████████████████████████████ | 284/681 [19:47<20:48, 3.14s/it] 42%|███████████████████████████████████████████████▎ | 285/681 [19:50<20:30, 3.11s/it] {'loss': 0.8825, 'grad_norm': 18.86406898498535, 'learning_rate': 3.625819059005228e-07, 'rewards/chosen': -1.1481690406799316, 'rewards/rejected': -1.9326767921447754, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7845076322555542, 'logps/chosen': -137.629150390625, 'logps/rejected': -176.97714233398438, 'logps/ref_chosen': -94.43303680419922, 'logps/ref_rejected': -104.07194519042969, 'logits/chosen': -7.379509449005127, 'logits/rejected': -7.0911407470703125, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.026722600683569908, 'kl/avg_steps': 0.6875, 'epoch': 0.42} 42%|███████████████████████████████████████████████▎ | 285/681 [19:50<20:30, 3.11s/it] 42%|███████████████████████████████████████████████▍ | 286/681 [19:53<20:17, 3.08s/it] {'loss': 0.8014, 'grad_norm': 14.356172561645508, 'learning_rate': 3.614345889441346e-07, 'rewards/chosen': -1.2212910652160645, 'rewards/rejected': -2.1978063583374023, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.9765151143074036, 'logps/chosen': -149.8736572265625, 'logps/rejected': -179.59671020507812, 'logps/ref_chosen': -103.72039794921875, 'logps/ref_rejected': -96.25775909423828, 'logits/chosen': -7.472691535949707, 'logits/rejected': -7.013616561889648, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.026540137827396393, 'kl/avg_steps': 0.5625, 'epoch': 0.42} 42%|███████████████████████████████████████████████▍ | 286/681 [19:53<20:17, 3.08s/it] 42%|███████████████████████████████████████████████▌ | 287/681 [19:56<19:26, 2.96s/it] {'loss': 1.0332, 'grad_norm': 15.217528343200684, 'learning_rate': 3.6028433558269275e-07, 'rewards/chosen': -1.4231257438659668, 'rewards/rejected': -2.0668678283691406, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.6437419652938843, 'logps/chosen': -147.86685180664062, 'logps/rejected': -162.0262451171875, 'logps/ref_chosen': -93.88988494873047, 'logps/ref_rejected': -83.33365631103516, 'logits/chosen': -7.295309066772461, 'logits/rejected': -6.699875831604004, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.328125, 'kl/beta': 0.026391685009002686, 'kl/avg_steps': 0.34375, 'epoch': 0.42} 42%|███████████████████████████████████████████████▌ | 287/681 [19:56<19:26, 2.96s/it] 42%|███████████████████████████████████████████████▊ | 288/681 [19:59<19:43, 3.01s/it] {'loss': 0.816, 'grad_norm': 14.122811317443848, 'learning_rate': 3.5913117612644327e-07, 'rewards/chosen': -1.3554046154022217, 'rewards/rejected': -2.3520970344543457, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.9966922998428345, 'logps/chosen': -139.82644653320312, 'logps/rejected': -183.26959228515625, 'logps/ref_chosen': -88.15602111816406, 'logps/ref_rejected': -93.28195190429688, 'logits/chosen': -7.697492599487305, 'logits/rejected': -7.256779670715332, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.026301274076104164, 'kl/avg_steps': 0.53125, 'epoch': 0.42} 42%|███████████████████████████████████████████████▊ | 288/681 [19:59<19:43, 3.01s/it] 42%|███████████████████████████████████████████████▉ | 289/681 [20:02<19:45, 3.03s/it] {'loss': 0.8297, 'grad_norm': 13.091045379638672, 'learning_rate': 3.5797514096221024e-07, 'rewards/chosen': -1.3352906703948975, 'rewards/rejected': -2.3961217403411865, 'rewards/accuracies': 0.796875, 'rewards/margins': 1.0608309507369995, 'logps/chosen': -126.64739227294922, 'logps/rejected': -185.3885498046875, 'logps/ref_chosen': -75.39292907714844, 'logps/ref_rejected': -93.15428161621094, 'logits/chosen': -7.527945518493652, 'logits/rejected': -7.149833679199219, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.02616228722035885, 'kl/avg_steps': 0.59375, 'epoch': 0.42} 42%|███████████████████████████████████████████████▉ | 289/681 [20:02<19:45, 3.03s/it] 43%|████████████████████████████████████████████████ | 290/681 [20:05<19:32, 3.00s/it] {'loss': 0.8033, 'grad_norm': 14.055940628051758, 'learning_rate': 3.568162605525952e-07, 'rewards/chosen': -1.4665530920028687, 'rewards/rejected': -2.53206205368042, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.0655090808868408, 'logps/chosen': -144.79820251464844, 'logps/rejected': -221.4034423828125, 'logps/ref_chosen': -88.0419692993164, 'logps/ref_rejected': -123.21215057373047, 'logits/chosen': -7.388311386108398, 'logits/rejected': -7.0829925537109375, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'kl/beta': 0.026007864624261856, 'kl/avg_steps': 0.8125, 'epoch': 0.43} 43%|████████████████████████████████████████████████ | 290/681 [20:05<19:32, 3.00s/it] 43%|████████████████████████████████████████████████▎ | 291/681 [20:08<19:50, 3.05s/it] {'loss': 0.9127, 'grad_norm': 14.552366256713867, 'learning_rate': 3.5565456543517485e-07, 'rewards/chosen': -1.408342719078064, 'rewards/rejected': -2.232980251312256, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8246374130249023, 'logps/chosen': -148.91311645507812, 'logps/rejected': -183.22817993164062, 'logps/ref_chosen': -94.09524536132812, 'logps/ref_rejected': -96.05006408691406, 'logits/chosen': -7.654292106628418, 'logits/rejected': -6.923820972442627, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.0257982537150383, 'kl/avg_steps': 0.625, 'epoch': 0.43} 43%|████████████████████████████████████████████████▎ | 291/681 [20:08<19:50, 3.05s/it] 43%|████████████████████████████████████████████████▍ | 292/681 [20:11<19:20, 2.98s/it] {'loss': 0.7522, 'grad_norm': 12.810431480407715, 'learning_rate': 3.5449008622169583e-07, 'rewards/chosen': -1.3947566747665405, 'rewards/rejected': -2.3737521171569824, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.9789954423904419, 'logps/chosen': -142.94708251953125, 'logps/rejected': -189.70559692382812, 'logps/ref_chosen': -88.25041198730469, 'logps/ref_rejected': -96.41764068603516, 'logits/chosen': -7.4952850341796875, 'logits/rejected': -7.096066951751709, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.025638015940785408, 'kl/avg_steps': 0.6875, 'epoch': 0.43} 43%|████████████████████████████████████████████████▍ | 292/681 [20:11<19:20, 2.98s/it] 43%|████████████████████████████████████████████████▌ | 293/681 [20:14<19:27, 3.01s/it] {'loss': 0.9942, 'grad_norm': 13.038102149963379, 'learning_rate': 3.5332285359726846e-07, 'rewards/chosen': -1.4899965524673462, 'rewards/rejected': -2.205336570739746, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.7153399586677551, 'logps/chosen': -146.07388305664062, 'logps/rejected': -172.91517639160156, 'logps/ref_chosen': -87.37654876708984, 'logps/ref_rejected': -85.75579833984375, 'logits/chosen': -7.2822585105896, 'logits/rejected': -6.593290328979492, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.025462958961725235, 'kl/avg_steps': 0.53125, 'epoch': 0.43} 43%|████████████████████████████████████████████████▌ | 293/681 [20:14<19:27, 3.01s/it] 43%|████████████████████████████████████████████████▊ | 294/681 [20:17<19:20, 3.00s/it] {'loss': 1.0529, 'grad_norm': 14.321362495422363, 'learning_rate': 3.5215289831955786e-07, 'rewards/chosen': -1.5407367944717407, 'rewards/rejected': -2.1650938987731934, 'rewards/accuracies': 0.75, 'rewards/margins': 0.6243571043014526, 'logps/chosen': -134.58004760742188, 'logps/rejected': -174.16912841796875, 'logps/ref_chosen': -73.5079574584961, 'logps/ref_rejected': -88.08877563476562, 'logits/chosen': -6.959033966064453, 'logits/rejected': -6.0898590087890625, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.02532840147614479, 'kl/avg_steps': 0.53125, 'epoch': 0.43} 43%|████████████████████████████████████████████████▊ | 294/681 [20:17<19:20, 3.00s/it] 43%|████████████████████████████████████████████████▉ | 295/681 [20:20<19:01, 2.96s/it] {'loss': 0.8606, 'grad_norm': 13.053278923034668, 'learning_rate': 3.509802512179737e-07, 'rewards/chosen': -1.4831101894378662, 'rewards/rejected': -2.3872218132019043, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.9041118025779724, 'logps/chosen': -136.8726043701172, 'logps/rejected': -189.65672302246094, 'logps/ref_chosen': -77.76548767089844, 'logps/ref_rejected': -94.24726867675781, 'logits/chosen': -7.271920680999756, 'logits/rejected': -7.100852012634277, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.025194555521011353, 'kl/avg_steps': 0.59375, 'epoch': 0.43} 43%|████████████████████████████████████████████████▉ | 295/681 [20:20<19:01, 2.96s/it] 43%|█████████████████████████████████████████████████ | 296/681 [20:23<19:19, 3.01s/it] {'loss': 0.8194, 'grad_norm': 13.810120582580566, 'learning_rate': 3.498049431928577e-07, 'rewards/chosen': -1.586834192276001, 'rewards/rejected': -2.5582542419433594, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.9714199304580688, 'logps/chosen': -161.45822143554688, 'logps/rejected': -203.63685607910156, 'logps/ref_chosen': -97.85641479492188, 'logps/ref_rejected': -100.81631469726562, 'logits/chosen': -7.517086505889893, 'logits/rejected': -7.0501275062561035, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.02504584565758705, 'kl/avg_steps': 0.5625, 'epoch': 0.43} 43%|█████████████████████████████████████████████████ | 296/681 [20:23<19:19, 3.01s/it] 44%|█████████████████████████████████████████████████▎ | 297/681 [20:26<19:25, 3.03s/it] {'loss': 0.8867, 'grad_norm': 14.663503646850586, 'learning_rate': 3.486270052146694e-07, 'rewards/chosen': -1.6361867189407349, 'rewards/rejected': -2.453274726867676, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.8170878887176514, 'logps/chosen': -154.54638671875, 'logps/rejected': -200.7189178466797, 'logps/ref_chosen': -88.56583404541016, 'logps/ref_rejected': -101.55656433105469, 'logits/chosen': -7.523492813110352, 'logits/rejected': -6.788856506347656, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.024905750527977943, 'kl/avg_steps': 0.5625, 'epoch': 0.44} 44%|█████████████████████████████████████████████████▎ | 297/681 [20:26<19:25, 3.03s/it] 44%|█████████████████████████████████████████████████▍ | 298/681 [20:30<19:58, 3.13s/it] {'loss': 0.8289, 'grad_norm': 16.382911682128906, 'learning_rate': 3.474464683231698e-07, 'rewards/chosen': -1.460301160812378, 'rewards/rejected': -2.431180000305176, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.9708787202835083, 'logps/chosen': -154.0912322998047, 'logps/rejected': -221.15391540527344, 'logps/ref_chosen': -94.88043975830078, 'logps/ref_rejected': -122.31101989746094, 'logits/chosen': -7.475191593170166, 'logits/rejected': -6.408700942993164, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.02476643957197666, 'kl/avg_steps': 0.59375, 'epoch': 0.44} 44%|█████████████████████████████████████████████████▍ | 298/681 [20:30<19:58, 3.13s/it] 44%|█████████████████████████████████████████████████▌ | 299/681 [20:33<19:41, 3.09s/it] {'loss': 0.785, 'grad_norm': 12.90149974822998, 'learning_rate': 3.462633636266041e-07, 'rewards/chosen': -1.5161972045898438, 'rewards/rejected': -2.5289955139160156, 'rewards/accuracies': 0.859375, 'rewards/margins': 1.0127980709075928, 'logps/chosen': -142.339111328125, 'logps/rejected': -193.05116271972656, 'logps/ref_chosen': -80.40835571289062, 'logps/ref_rejected': -89.53716278076172, 'logits/chosen': -7.417859077453613, 'logits/rejected': -6.741024971008301, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.024620257318019867, 'kl/avg_steps': 0.71875, 'epoch': 0.44} 44%|█████████████████████████████████████████████████▌ | 299/681 [20:33<19:41, 3.09s/it] 44%|█████████████████████████████████████████████████▊ | 300/681 [20:36<19:44, 3.11s/it] {'loss': 0.912, 'grad_norm': 14.46535587310791, 'learning_rate': 3.4507772230088147e-07, 'rewards/chosen': -1.6017869710922241, 'rewards/rejected': -2.454439640045166, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.8526525497436523, 'logps/chosen': -153.97088623046875, 'logps/rejected': -202.05059814453125, 'logps/ref_chosen': -88.15890502929688, 'logps/ref_rejected': -100.93919372558594, 'logits/chosen': -7.5312700271606445, 'logits/rejected': -7.016364097595215, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.024444561451673508, 'kl/avg_steps': 0.625, 'epoch': 0.44} 44%|█████████████████████████████████████████████████▊ | 300/681 [20:36<19:44, 3.11s/it][INFO|trainer.py:4307] 2026-04-24 04:36:43,351 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-24 04:36:43,352 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-24 04:36:43,352 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-24 04:42:37,685 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-24 04:42:37,685 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-400 [INFO|configuration_utils.py:419] 2026-04-24 04:43:40,471 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-400/config.json [INFO|configuration_utils.py:911] 2026-04-24 04:43:40,476 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-400/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-24 04:44:19,679 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-400/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-24 04:44:19,683 >> tokenizer config file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-400/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-24 04:44:19,689 >> Special tokens file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-400/special_tokens_map.json 59%|█████████████████████████████████████████████████████████████████▎ | 401/681 [31:15<6:48:08, 87.46s/it] {'loss': 1.0429, 'grad_norm': 15.558944702148438, 'learning_rate': 2.1800473436235136e-07, 'rewards/chosen': -1.4610803127288818, 'rewards/rejected': -2.123840808868408, 'rewards/accuracies': 0.703125, 'rewards/margins': 0.6627606749534607, 'logps/chosen': -197.28976440429688, 'logps/rejected': -255.7905731201172, 'logps/ref_chosen': -84.28916931152344, 'logps/ref_rejected': -90.943115234375, 'logits/chosen': -7.620119094848633, 'logits/rejected': -7.406888961791992, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.3125, 'kl/beta': 0.012954174540936947, 'kl/avg_steps': 0.375, 'epoch': 0.59} 59%|█████████████████████████████████████████████████████████████████▎ | 401/681 [31:15<6:48:08, 87.46s/it] 59%|█████████████████████████████████████████████████████████████████▌ | 402/681 [31:17<4:48:32, 62.05s/it] {'loss': 0.6486, 'grad_norm': 12.638237953186035, 'learning_rate': 2.1673238449588665e-07, 'rewards/chosen': -1.3018509149551392, 'rewards/rejected': -2.5145905017852783, 'rewards/accuracies': 0.90625, 'rewards/margins': 1.2127395868301392, 'logps/chosen': -185.138916015625, 'logps/rejected': -284.30706787109375, 'logps/ref_chosen': -83.59312438964844, 'logps/ref_rejected': -87.81027221679688, 'logits/chosen': -7.716352939605713, 'logits/rejected': -7.249260902404785, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'kl/beta': 0.012905777432024479, 'kl/avg_steps': 0.78125, 'epoch': 0.59} 59%|█████████████████████████████████████████████████████████████████▌ | 402/681 [31:18<4:48:32, 62.05s/it] 59%|█████████████████████████████████████████████████████████████████▋ | 403/681 [31:21<3:25:34, 44.37s/it] {'loss': 0.8488, 'grad_norm': 11.707761764526367, 'learning_rate': 2.154609112620295e-07, 'rewards/chosen': -1.3442716598510742, 'rewards/rejected': -2.2409329414367676, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.8966612219810486, 'logps/chosen': -179.35401916503906, 'logps/rejected': -260.35247802734375, 'logps/ref_chosen': -73.75308227539062, 'logps/ref_rejected': -83.92012786865234, 'logits/chosen': -7.694974899291992, 'logits/rejected': -6.938326835632324, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.012805732898414135, 'kl/avg_steps': 0.71875, 'epoch': 0.59} 59%|█████████████████████████████████████████████████████████████████▋ | 403/681 [31:21<3:25:34, 44.37s/it] 59%|█████████████████████████████████████████████████████████████████▊ | 404/681 [31:24<2:27:45, 32.01s/it] {'loss': 0.9664, 'grad_norm': 17.042625427246094, 'learning_rate': 2.1419034816528218e-07, 'rewards/chosen': -1.567918300628662, 'rewards/rejected': -2.3647704124450684, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7968522906303406, 'logps/chosen': -203.4947509765625, 'logps/rejected': -271.59332275390625, 'logps/ref_chosen': -79.67617797851562, 'logps/ref_rejected': -84.280517578125, 'logits/chosen': -7.690559387207031, 'logits/rejected': -7.393105983734131, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.01271434873342514, 'kl/avg_steps': 0.59375, 'epoch': 0.59} 59%|█████████████████████████████████████████████████████████████████▊ | 404/681 [31:24<2:27:45, 32.01s/it] 59%|██████████████████████████████████████████████████████████████████ | 405/681 [31:27<1:47:06, 23.28s/it] {'loss': 0.8803, 'grad_norm': 11.477922439575195, 'learning_rate': 2.129207286861638e-07, 'rewards/chosen': -1.6104973554611206, 'rewards/rejected': -2.3875961303710938, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7770987749099731, 'logps/chosen': -224.55743408203125, 'logps/rejected': -283.11083984375, 'logps/ref_chosen': -96.46195220947266, 'logps/ref_rejected': -92.87071228027344, 'logits/chosen': -7.6790571212768555, 'logits/rejected': -7.126956462860107, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.012639302760362625, 'kl/avg_steps': 0.65625, 'epoch': 0.59} 59%|██████████████████████████████████████████████████████████████████ | 405/681 [31:27<1:47:06, 23.28s/it] 60%|██████████████████████████████████████████████████████████████████▏ | 406/681 [31:30<1:18:36, 17.15s/it] {'loss': 0.9891, 'grad_norm': 16.791955947875977, 'learning_rate': 2.1165208628032861e-07, 'rewards/chosen': -1.5131518840789795, 'rewards/rejected': -2.2241034507751465, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.7109516859054565, 'logps/chosen': -199.13026428222656, 'logps/rejected': -276.6565246582031, 'logps/ref_chosen': -78.13396453857422, 'logps/ref_rejected': -98.28359985351562, 'logits/chosen': -7.518543720245361, 'logits/rejected': -7.182583808898926, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.012556898407638073, 'kl/avg_steps': 0.59375, 'epoch': 0.6} 60%|██████████████████████████████████████████████████████████████████▏ | 406/681 [31:30<1:18:36, 17.15s/it] 60%|███████████████████████████████████████████████████████████████████▌ | 407/681 [31:33<59:11, 12.96s/it] {'loss': 0.9939, 'grad_norm': 14.285913467407227, 'learning_rate': 2.1038445437768375e-07, 'rewards/chosen': -1.5919959545135498, 'rewards/rejected': -2.248244285583496, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.6562482118606567, 'logps/chosen': -212.05181884765625, 'logps/rejected': -264.0316162109375, 'logps/ref_chosen': -84.01283264160156, 'logps/ref_rejected': -82.78103637695312, 'logits/chosen': -7.507756233215332, 'logits/rejected': -6.921146392822266, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.012482781894505024, 'kl/avg_steps': 0.5625, 'epoch': 0.6} 60%|███████████████████████████████████████████████████████████████████▌ | 407/681 [31:33<59:11, 12.96s/it] 60%|███████████████████████████████████████████████████████████████████▋ | 408/681 [31:36<45:32, 10.01s/it] {'loss': 1.011, 'grad_norm': 13.731730461120605, 'learning_rate': 2.0911786638150872e-07, 'rewards/chosen': -1.480905294418335, 'rewards/rejected': -2.146219253540039, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.6653140783309937, 'logps/chosen': -224.00033569335938, 'logps/rejected': -270.20428466796875, 'logps/ref_chosen': -104.46175384521484, 'logps/ref_rejected': -96.37218475341797, 'logits/chosen': -7.530098915100098, 'logits/rejected': -7.0652008056640625, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.012412958778440952, 'kl/avg_steps': 0.421875, 'epoch': 0.6} 60%|███████████████████████████████████████████████████████████████████▋ | 408/681 [31:36<45:32, 10.01s/it] 60%|███████████████████████████████████████████████████████████████████▊ | 409/681 [31:39<36:12, 7.99s/it] {'loss': 1.0303, 'grad_norm': 12.978145599365234, 'learning_rate': 2.0785235566757517e-07, 'rewards/chosen': -1.6262993812561035, 'rewards/rejected': -2.277886152267456, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.6515868306159973, 'logps/chosen': -229.72506713867188, 'logps/rejected': -275.54852294921875, 'logps/ref_chosen': -97.66830444335938, 'logps/ref_rejected': -90.04584503173828, 'logits/chosen': -7.490418434143066, 'logits/rejected': -7.061441898345947, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.012360811233520508, 'kl/avg_steps': 0.5625, 'epoch': 0.6} 60%|███████████████████████████████████████████████████████████████████▊ | 409/681 [31:39<36:12, 7.99s/it] 60%|████████████████████████████████████████████████████████████████████ | 410/681 [31:42<29:37, 6.56s/it] {'loss': 0.9507, 'grad_norm': 12.724774360656738, 'learning_rate': 2.065879555832674e-07, 'rewards/chosen': -1.5810291767120361, 'rewards/rejected': -2.278393268585205, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.697364091873169, 'logps/chosen': -205.697265625, 'logps/rejected': -275.3063049316406, 'logps/ref_chosen': -76.46923828125, 'logps/ref_rejected': -88.64064025878906, 'logits/chosen': -7.461033821105957, 'logits/rejected': -7.218528747558594, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.01229167077690363, 'kl/avg_steps': 0.578125, 'epoch': 0.6} 60%|████████████████████████████████████████████████████████████████████ | 410/681 [31:42<29:37, 6.56s/it] 60%|████████████████████████████████████████████████████████████████████▏ | 411/681 [31:45<24:16, 5.39s/it] {'loss': 0.9201, 'grad_norm': 11.978557586669922, 'learning_rate': 2.0532469944670343e-07, 'rewards/chosen': -1.6353148221969604, 'rewards/rejected': -2.3577518463134766, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.7224369049072266, 'logps/chosen': -221.50474548339844, 'logps/rejected': -281.2148742675781, 'logps/ref_chosen': -87.16630554199219, 'logps/ref_rejected': -87.09603118896484, 'logits/chosen': -7.684117317199707, 'logits/rejected': -7.11949348449707, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.012221017852425575, 'kl/avg_steps': 0.53125, 'epoch': 0.6} 60%|████████████████████████████████████████████████████████████████████▏ | 411/681 [31:45<24:16, 5.39s/it] 60%|████████████████████████████████████████████████████████████████████▎ | 412/681 [31:48<21:05, 4.70s/it] {'loss': 1.0053, 'grad_norm': 13.545778274536133, 'learning_rate': 2.0406262054585738e-07, 'rewards/chosen': -1.5305845737457275, 'rewards/rejected': -2.215937614440918, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.6853532791137695, 'logps/chosen': -205.3089599609375, 'logps/rejected': -289.611572265625, 'logps/ref_chosen': -78.94734191894531, 'logps/ref_rejected': -106.10554504394531, 'logits/chosen': -7.71500301361084, 'logits/rejected': -6.960906982421875, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.012156437151134014, 'kl/avg_steps': 0.46875, 'epoch': 0.6} 60%|████████████████████████████████████████████████████████████████████▎ | 412/681 [31:48<21:05, 4.70s/it] 61%|████████████████████████████████████████████████████████████████████▌ | 413/681 [31:51<18:46, 4.20s/it] {'loss': 0.9401, 'grad_norm': 12.726061820983887, 'learning_rate': 2.0280175213768205e-07, 'rewards/chosen': -1.650810956954956, 'rewards/rejected': -2.4081544876098633, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7573432922363281, 'logps/chosen': -232.97862243652344, 'logps/rejected': -308.6137390136719, 'logps/ref_chosen': -95.69471740722656, 'logps/ref_rejected': -107.96085357666016, 'logits/chosen': -7.195650100708008, 'logits/rejected': -7.105502128601074, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.012099719606339931, 'kl/avg_steps': 0.71875, 'epoch': 0.61} 61%|████████████████████████████████████████████████████████████████████▌ | 413/681 [31:51<18:46, 4.20s/it] 61%|████████████████████████████████████████████████████████████████████▋ | 414/681 [31:54<17:14, 3.87s/it] {'loss': 0.9978, 'grad_norm': 16.81880760192871, 'learning_rate': 2.0154212744723247e-07, 'rewards/chosen': -1.5135799646377563, 'rewards/rejected': -2.220407009124756, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.7068269848823547, 'logps/chosen': -214.6531524658203, 'logps/rejected': -278.8741455078125, 'logps/ref_chosen': -88.27667236328125, 'logps/ref_rejected': -92.87004089355469, 'logits/chosen': -7.222269058227539, 'logits/rejected': -6.936489105224609, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.012013372965157032, 'kl/avg_steps': 0.53125, 'epoch': 0.61} 61%|████████████████████████████████████████████████████████████████████▋ | 414/681 [31:54<17:14, 3.87s/it] 61%|████████████████████████████████████████████████████████████████████▊ | 415/681 [31:58<16:39, 3.76s/it] {'loss': 1.0108, 'grad_norm': 13.74764633178711, 'learning_rate': 2.002837796667909e-07, 'rewards/chosen': -1.5901520252227783, 'rewards/rejected': -2.227059841156006, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.6369077563285828, 'logps/chosen': -242.48207092285156, 'logps/rejected': -295.0639953613281, 'logps/ref_chosen': -108.91590118408203, 'logps/ref_rejected': -107.47135925292969, 'logits/chosen': -7.991650104522705, 'logits/rejected': -7.442322731018066, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.011949889361858368, 'kl/avg_steps': 0.5625, 'epoch': 0.61} 61%|████████████████████████████████████████████████████████████████████▊ | 415/681 [31:58<16:39, 3.76s/it] 61%|█████████████████████████████████████████████████████████████████████ | 416/681 [32:01<15:41, 3.55s/it] {'loss': 0.875, 'grad_norm': 11.695809364318848, 'learning_rate': 1.990267419549914e-07, 'rewards/chosen': -1.556352138519287, 'rewards/rejected': -2.4217967987060547, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.8654446601867676, 'logps/chosen': -224.93905639648438, 'logps/rejected': -302.91412353515625, 'logps/ref_chosen': -93.39888000488281, 'logps/ref_rejected': -97.6729736328125, 'logits/chosen': -7.476058006286621, 'logits/rejected': -7.21858024597168, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.01188304740935564, 'kl/avg_steps': 0.5625, 'epoch': 0.61} 61%|█████████████████████████████████████████████████████████████████████ | 416/681 [32:01<15:41, 3.55s/it] 61%|█████████████████████████████████████████████████████████████████████▏ | 417/681 [32:04<14:43, 3.34s/it] {'loss': 0.8896, 'grad_norm': 12.070055961608887, 'learning_rate': 1.9777104743594686e-07, 'rewards/chosen': -1.4651763439178467, 'rewards/rejected': -2.2487990856170654, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7836226224899292, 'logps/chosen': -208.16903686523438, 'logps/rejected': -266.1756591796875, 'logps/ref_chosen': -83.53533172607422, 'logps/ref_rejected': -74.44184112548828, 'logits/chosen': -7.844758033752441, 'logits/rejected': -6.70277214050293, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.011816578917205334, 'kl/avg_steps': 0.6875, 'epoch': 0.61} 61%|█████████████████████████████████████████████████████████████████████▏ | 417/681 [32:04<14:43, 3.34s/it] 61%|█████████████████████████████████████████████████████████████████████▎ | 418/681 [32:07<14:28, 3.30s/it] {'loss': 0.7851, 'grad_norm': 13.669163703918457, 'learning_rate': 1.965167291983757e-07, 'rewards/chosen': -1.3652478456497192, 'rewards/rejected': -2.3257837295532227, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.9605357646942139, 'logps/chosen': -225.50465393066406, 'logps/rejected': -311.86407470703125, 'logps/ref_chosen': -108.22152709960938, 'logps/ref_rejected': -111.8646469116211, 'logits/chosen': -7.78868293762207, 'logits/rejected': -7.488645553588867, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'kl/beta': 0.011735894717276096, 'kl/avg_steps': 0.84375, 'epoch': 0.61} 61%|█████████████████████████████████████████████████████████████████████▎ | 418/681 [32:07<14:28, 3.30s/it] 62%|█████████████████████████████████████████████████████████████████████▌ | 419/681 [32:10<14:24, 3.30s/it] {'loss': 0.8759, 'grad_norm': 11.460182189941406, 'learning_rate': 1.9526382029472988e-07, 'rewards/chosen': -1.4567546844482422, 'rewards/rejected': -2.2854669094085693, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8287121057510376, 'logps/chosen': -223.0901641845703, 'logps/rejected': -296.13623046875, 'logps/ref_chosen': -97.18328094482422, 'logps/ref_rejected': -98.18531799316406, 'logits/chosen': -7.678118705749512, 'logits/rejected': -7.43233585357666, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.011637701652944088, 'kl/avg_steps': 0.71875, 'epoch': 0.62} 62%|█████████████████████████████████████████████████████████████████████▌ | 419/681 [32:10<14:24, 3.30s/it] 62%|█████████████████████████████████████████████████████████████████████▋ | 420/681 [32:13<13:59, 3.22s/it] {'loss': 0.9212, 'grad_norm': 14.639280319213867, 'learning_rate': 1.9401235374032425e-07, 'rewards/chosen': -1.4852039813995361, 'rewards/rejected': -2.24064040184021, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.7554365396499634, 'logps/chosen': -243.3070831298828, 'logps/rejected': -270.84930419921875, 'logps/ref_chosen': -114.30847930908203, 'logps/ref_rejected': -75.68356323242188, 'logits/chosen': -7.911190509796143, 'logits/rejected': -6.9983086585998535, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.011554652824997902, 'kl/avg_steps': 0.5625, 'epoch': 0.62} 62%|█████████████████████████████████████████████████████████████████████▋ | 420/681 [32:13<13:59, 3.22s/it] 62%|█████████████████████████████████████████████████████████████████████▊ | 421/681 [32:16<13:54, 3.21s/it] {'loss': 0.8958, 'grad_norm': 11.352572441101074, 'learning_rate': 1.9276236251246653e-07, 'rewards/chosen': -1.4818620681762695, 'rewards/rejected': -2.2694497108459473, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.7875877618789673, 'logps/chosen': -215.33929443359375, 'logps/rejected': -295.12469482421875, 'logps/ref_chosen': -85.87985229492188, 'logps/ref_rejected': -96.33648681640625, 'logits/chosen': -7.578913688659668, 'logits/rejected': -6.917729377746582, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.011490020900964737, 'kl/avg_steps': 0.5625, 'epoch': 0.62} 62%|█████████████████████████████████████████████████████████████████████▊ | 421/681 [32:16<13:54, 3.21s/it] 62%|██████████████████████████████████████████████████████████████████████ | 422/681 [32:20<13:48, 3.20s/it] {'loss': 0.8898, 'grad_norm': 11.291938781738281, 'learning_rate': 1.9151387954958792e-07, 'rewards/chosen': -1.409227728843689, 'rewards/rejected': -2.1822290420532227, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7730013132095337, 'logps/chosen': -224.50631713867188, 'logps/rejected': -286.848388671875, 'logps/ref_chosen': -100.48060607910156, 'logps/ref_rejected': -94.40821838378906, 'logits/chosen': -7.927703857421875, 'logits/rejected': -7.495624542236328, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.011425751261413097, 'kl/avg_steps': 0.6875, 'epoch': 0.62} 62%|██████████████████████████████████████████████████████████████████████ | 422/681 [32:20<13:48, 3.20s/it] 62%|██████████████████████████████████████████████████████████████████████▏ | 423/681 [32:22<13:19, 3.10s/it] {'loss': 0.9664, 'grad_norm': 12.197092056274414, 'learning_rate': 1.902669377503756e-07, 'rewards/chosen': -1.364768624305725, 'rewards/rejected': -2.0834240913391113, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.718655526638031, 'logps/chosen': -198.95823669433594, 'logps/rejected': -276.6893310546875, 'logps/ref_chosen': -78.44993591308594, 'logps/ref_rejected': -92.04652404785156, 'logits/chosen': -7.66268253326416, 'logits/rejected': -7.711108207702637, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.011347736231982708, 'kl/avg_steps': 0.4375, 'epoch': 0.62} 62%|██████████████████████████████████████████████████████████████████████▏ | 423/681 [32:22<13:19, 3.10s/it] 62%|██████████████████████████████████████████████████████████████████████▎ | 424/681 [32:26<13:23, 3.13s/it] {'loss': 0.9078, 'grad_norm': 11.235798835754395, 'learning_rate': 1.890215699729057e-07, 'rewards/chosen': -1.4883897304534912, 'rewards/rejected': -2.245741367340088, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.7573517560958862, 'logps/chosen': -220.01161193847656, 'logps/rejected': -272.5384521484375, 'logps/ref_chosen': -87.6423568725586, 'logps/ref_rejected': -72.36566162109375, 'logits/chosen': -7.801157474517822, 'logits/rejected': -6.635551452636719, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.011298305355012417, 'kl/avg_steps': 0.59375, 'epoch': 0.62} 62%|██████████████████████████████████████████████████████████████████████▎ | 424/681 [32:26<13:23, 3.13s/it] 62%|██████████████████████████████████████████████████████████████████████▌ | 425/681 [32:29<13:03, 3.06s/it] {'loss': 1.1393, 'grad_norm': 13.440584182739258, 'learning_rate': 1.8777780903377732e-07, 'rewards/chosen': -1.5395094156265259, 'rewards/rejected': -2.0465922355651855, 'rewards/accuracies': 0.625, 'rewards/margins': 0.5070829391479492, 'logps/chosen': -215.4922332763672, 'logps/rejected': -285.57012939453125, 'logps/ref_chosen': -78.51979064941406, 'logps/ref_rejected': -102.74864196777344, 'logits/chosen': -7.429314613342285, 'logits/rejected': -7.256626605987549, 'kl/p_epsilon_steps': 0.59375, 'kl/n_epsilon_steps': 0.40625, 'kl/beta': 0.011231618002057076, 'kl/avg_steps': 0.1875, 'epoch': 0.62} 62%|██████████████████████████████████████████████████████████████████████▌ | 425/681 [32:29<13:03, 3.06s/it] 63%|██████████████████████████████████████████████████████████████████████▋ | 426/681 [32:32<13:20, 3.14s/it] {'loss': 0.8943, 'grad_norm': 13.104548454284668, 'learning_rate': 1.8653568770724803e-07, 'rewards/chosen': -1.3247995376586914, 'rewards/rejected': -2.0728135108947754, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.7480140924453735, 'logps/chosen': -227.1387939453125, 'logps/rejected': -274.4301452636719, 'logps/ref_chosen': -108.50582885742188, 'logps/ref_rejected': -88.300048828125, 'logits/chosen': -7.60335636138916, 'logits/rejected': -7.2415266036987305, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.011210598051548004, 'kl/avg_steps': 0.5625, 'epoch': 0.63} 63%|██████████████████████████████████████████████████████████████████████▋ | 426/681 [32:32<13:20, 3.14s/it] 63%|██████████████████████████████████████████████████████████████████████▊ | 427/681 [32:35<13:38, 3.22s/it] {'loss': 0.9039, 'grad_norm': 10.724087715148926, 'learning_rate': 1.8529523872436977e-07, 'rewards/chosen': -1.119832158088684, 'rewards/rejected': -1.820410966873169, 'rewards/accuracies': 0.875, 'rewards/margins': 0.7005788087844849, 'logps/chosen': -200.22837829589844, 'logps/rejected': -250.37628173828125, 'logps/ref_chosen': -99.12046813964844, 'logps/ref_rejected': -85.724609375, 'logits/chosen': -7.953921794891357, 'logits/rejected': -7.36940860748291, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'kl/beta': 0.01114789117127657, 'kl/avg_steps': 0.75, 'epoch': 0.63} 63%|██████████████████████████████████████████████████████████████████████▊ | 427/681 [32:35<13:38, 3.22s/it] 63%|███████████████████████████████████████████████████████████████████████ | 428/681 [32:38<13:37, 3.23s/it] {'loss': 1.0721, 'grad_norm': 12.104276657104492, 'learning_rate': 1.8405649477212697e-07, 'rewards/chosen': -1.520464301109314, 'rewards/rejected': -2.0730321407318115, 'rewards/accuracies': 0.75, 'rewards/margins': 0.5525679588317871, 'logps/chosen': -243.97320556640625, 'logps/rejected': -297.7687683105469, 'logps/ref_chosen': -105.96925354003906, 'logps/ref_rejected': -109.1021728515625, 'logits/chosen': -7.528921604156494, 'logits/rejected': -7.260353088378906, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.011064904741942883, 'kl/avg_steps': 0.5625, 'epoch': 0.63} 63%|███████████████████████████████████████████████████████████████████████ | 428/681 [32:39<13:37, 3.23s/it] 63%|███████████████████████████████████████████████████████████████████████▏ | 429/681 [32:42<13:31, 3.22s/it] {'loss': 0.9249, 'grad_norm': 13.0979642868042, 'learning_rate': 1.828194884925749e-07, 'rewards/chosen': -1.363885760307312, 'rewards/rejected': -2.127699851989746, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.7638142108917236, 'logps/chosen': -237.86270141601562, 'logps/rejected': -292.80511474609375, 'logps/ref_chosen': -113.54486846923828, 'logps/ref_rejected': -98.24201965332031, 'logits/chosen': -7.794681072235107, 'logits/rejected': -7.093453407287598, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.011003012768924236, 'kl/avg_steps': 0.53125, 'epoch': 0.63} 63%|███████████████████████████████████████████████████████████████████████▏ | 429/681 [32:42<13:31, 3.22s/it] 63%|███████████████████████████████████████████████████████████████████████▎ | 430/681 [32:45<13:34, 3.24s/it] {'loss': 1.0248, 'grad_norm': 12.785799980163574, 'learning_rate': 1.8158425248197928e-07, 'rewards/chosen': -1.361114501953125, 'rewards/rejected': -1.9355595111846924, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.5744451284408569, 'logps/chosen': -216.0718994140625, 'logps/rejected': -288.0528564453125, 'logps/ref_chosen': -91.31936645507812, 'logps/ref_rejected': -110.1096420288086, 'logits/chosen': -7.379518032073975, 'logits/rejected': -7.428651809692383, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.010944867506623268, 'kl/avg_steps': 0.4375, 'epoch': 0.63} 63%|███████████████████████████████████████████████████████████████████████▎ | 430/681 [32:45<13:34, 3.24s/it] 63%|███████████████████████████████████████████████████████████████████████▌ | 431/681 [32:48<13:23, 3.22s/it] {'loss': 0.9558, 'grad_norm': 11.655879974365234, 'learning_rate': 1.8035081928995788e-07, 'rewards/chosen': -1.3294377326965332, 'rewards/rejected': -2.0129597187042236, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.6835219860076904, 'logps/chosen': -215.51998901367188, 'logps/rejected': -283.9945983886719, 'logps/ref_chosen': -93.18122100830078, 'logps/ref_rejected': -98.13226318359375, 'logits/chosen': -7.7428998947143555, 'logits/rejected': -7.222956657409668, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.010897193104028702, 'kl/avg_steps': 0.5, 'epoch': 0.63} 63%|███████████████████████████████████████████████████████████████████████▌ | 431/681 [32:48<13:23, 3.22s/it] 63%|███████████████████████████████████████████████████████████████████████▋ | 432/681 [32:52<13:32, 3.26s/it] {'loss': 0.8684, 'grad_norm': 10.997950553894043, 'learning_rate': 1.791192214186223e-07, 'rewards/chosen': -1.2002313137054443, 'rewards/rejected': -2.012035846710205, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.8118045330047607, 'logps/chosen': -215.7032928466797, 'logps/rejected': -292.071533203125, 'logps/ref_chosen': -104.43478393554688, 'logps/ref_rejected': -105.08955383300781, 'logits/chosen': -7.891312122344971, 'logits/rejected': -7.409452438354492, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.010842978022992611, 'kl/avg_steps': 0.625, 'epoch': 0.63} 63%|███████████████████████████████████████████████████████████████████████▋ | 432/681 [32:52<13:32, 3.26s/it] 64%|███████████████████████████████████████████████████████████████████████▊ | 433/681 [32:55<13:17, 3.22s/it] {'loss': 1.001, 'grad_norm': 11.87765884399414, 'learning_rate': 1.7788949132172193e-07, 'rewards/chosen': -1.4064010381698608, 'rewards/rejected': -2.044281482696533, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.6378804445266724, 'logps/chosen': -220.77711486816406, 'logps/rejected': -292.6389465332031, 'logps/ref_chosen': -89.84322357177734, 'logps/ref_rejected': -101.73345947265625, 'logits/chosen': -7.532526969909668, 'logits/rejected': -7.0036725997924805, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.010775630362331867, 'kl/avg_steps': 0.46875, 'epoch': 0.64} 64%|███████████████████████████████████████████████████████████████████████▊ | 433/681 [32:55<13:17, 3.22s/it] 64%|████████████████████████████████████████████████████████████████████████ | 434/681 [32:58<13:11, 3.20s/it] {'loss': 0.984, 'grad_norm': 11.362030982971191, 'learning_rate': 1.7666166140378853e-07, 'rewards/chosen': -1.3009238243103027, 'rewards/rejected': -1.9657368659973145, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.6648129820823669, 'logps/chosen': -219.51129150390625, 'logps/rejected': -268.685302734375, 'logps/ref_chosen': -97.6925277709961, 'logps/ref_rejected': -84.09130096435547, 'logits/chosen': -7.701730728149414, 'logits/rejected': -7.350008010864258, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.0107253547757864, 'kl/avg_steps': 0.5, 'epoch': 0.64} 64%|████████████████████████████████████████████████████████████████████████ | 434/681 [32:58<13:11, 3.20s/it] 64%|████████████████████████████████████████████████████████████████████████▏ | 435/681 [33:01<12:40, 3.09s/it] {'loss': 0.9314, 'grad_norm': 11.840472221374512, 'learning_rate': 1.7543576401928218e-07, 'rewards/chosen': -1.2436943054199219, 'rewards/rejected': -1.9595654010772705, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.7158711552619934, 'logps/chosen': -203.26406860351562, 'logps/rejected': -278.71820068359375, 'logps/ref_chosen': -86.17192077636719, 'logps/ref_rejected': -93.751708984375, 'logits/chosen': -7.7169880867004395, 'logits/rejected': -7.244067192077637, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.010671994648873806, 'kl/avg_steps': 0.65625, 'epoch': 0.64} 64%|████████████████████████████████████████████████████████████████████████▏ | 435/681 [33:01<12:40, 3.09s/it] 64%|████████████████████████████████████████████████████████████████████████▎ | 436/681 [33:04<12:43, 3.12s/it] {'loss': 0.8215, 'grad_norm': 11.141986846923828, 'learning_rate': 1.742118314717391e-07, 'rewards/chosen': -1.0024278163909912, 'rewards/rejected': -1.8351895809173584, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.8327617645263672, 'logps/chosen': -200.92079162597656, 'logps/rejected': -263.154296875, 'logps/ref_chosen': -105.78710174560547, 'logps/ref_rejected': -88.62471008300781, 'logits/chosen': -8.073395729064941, 'logits/rejected': -6.879586219787598, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'kl/beta': 0.010602416470646858, 'kl/avg_steps': 0.78125, 'epoch': 0.64} 64%|████████████████████████████████████████████████████████████████████████▎ | 436/681 [33:04<12:43, 3.12s/it] 64%|████████████████████████████████████████████████████████████████████████▌ | 437/681 [33:07<12:52, 3.17s/it] {'loss': 0.9969, 'grad_norm': 12.68950366973877, 'learning_rate': 1.7298989601292036e-07, 'rewards/chosen': -1.2023859024047852, 'rewards/rejected': -1.770288109779358, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.5679023265838623, 'logps/chosen': -210.84364318847656, 'logps/rejected': -258.4656677246094, 'logps/ref_chosen': -96.06204223632812, 'logps/ref_rejected': -89.01220703125, 'logits/chosen': -7.771686553955078, 'logits/rejected': -7.039131164550781, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.010520227253437042, 'kl/avg_steps': 0.53125, 'epoch': 0.64} 64%|████████████████████████████████████████████████████████████████████████▌ | 437/681 [33:07<12:52, 3.17s/it] 64%|████████████████████████████████████████████████████████████████████████▋ | 438/681 [33:10<12:37, 3.12s/it] {'loss': 0.949, 'grad_norm': 11.616890907287598, 'learning_rate': 1.7176998984196144e-07, 'rewards/chosen': -1.149721622467041, 'rewards/rejected': -1.8044536113739014, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.6547319889068604, 'logps/chosen': -212.05523681640625, 'logps/rejected': -262.99102783203125, 'logps/ref_chosen': -101.85537719726562, 'logps/ref_rejected': -89.4476547241211, 'logits/chosen': -7.707596302032471, 'logits/rejected': -7.402861595153809, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.01046463381499052, 'kl/avg_steps': 0.53125, 'epoch': 0.64} 64%|████████████████████████████████████████████████████████████████████████▋ | 438/681 [33:10<12:37, 3.12s/it] 64%|████████████████████████████████████████████████████████████████████████▊ | 439/681 [33:13<12:19, 3.06s/it] {'loss': 0.9918, 'grad_norm': 12.508461952209473, 'learning_rate': 1.7055214510452458e-07, 'rewards/chosen': -1.2794837951660156, 'rewards/rejected': -1.8945040702819824, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.6150201559066772, 'logps/chosen': -205.33200073242188, 'logps/rejected': -273.9890441894531, 'logps/ref_chosen': -81.75563049316406, 'logps/ref_rejected': -90.58635711669922, 'logits/chosen': -7.787407875061035, 'logits/rejected': -7.194408416748047, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.010409334674477577, 'kl/avg_steps': 0.625, 'epoch': 0.64} 64%|████████████████████████████████████████████████████████████████████████▊ | 439/681 [33:13<12:19, 3.06s/it] 65%|█████████████████████████████████████████████████████████████████████████ | 440/681 [33:16<12:05, 3.01s/it] {'loss': 0.923, 'grad_norm': 10.161545753479004, 'learning_rate': 1.6933639389195134e-07, 'rewards/chosen': -1.1412960290908813, 'rewards/rejected': -1.8606973886489868, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.7194013595581055, 'logps/chosen': -216.48208618164062, 'logps/rejected': -284.5960388183594, 'logps/ref_chosen': -105.64108276367188, 'logps/ref_rejected': -103.40100860595703, 'logits/chosen': -7.840778827667236, 'logits/rejected': -7.435771465301514, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.010344680398702621, 'kl/avg_steps': 0.625, 'epoch': 0.65} 65%|█████████████████████████████████████████████████████████████████████████ | 440/681 [33:16<12:05, 3.01s/it] 65%|█████████████████████████████████████████████████████████████████████████▏ | 441/681 [33:19<12:23, 3.10s/it] {'loss': 0.8764, 'grad_norm': 10.330644607543945, 'learning_rate': 1.681227682404166e-07, 'rewards/chosen': -1.3216805458068848, 'rewards/rejected': -2.063693046569824, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.7420125007629395, 'logps/chosen': -221.08547973632812, 'logps/rejected': -306.0940856933594, 'logps/ref_chosen': -91.529541015625, 'logps/ref_rejected': -103.619384765625, 'logits/chosen': -7.877140522003174, 'logits/rejected': -6.723335266113281, 'kl/p_epsilon_steps': 0.921875, 'kl/n_epsilon_steps': 0.078125, 'kl/beta': 0.010280427522957325, 'kl/avg_steps': 0.84375, 'epoch': 0.65} 65%|█████████████████████████████████████████████████████████████████████████▏ | 441/681 [33:19<12:23, 3.10s/it] 65%|█████████████████████████████████████████████████████████████████████████▎ | 442/681 [33:22<12:12, 3.07s/it] {'loss': 0.8427, 'grad_norm': 11.554139137268066, 'learning_rate': 1.669113001300851e-07, 'rewards/chosen': -1.3306344747543335, 'rewards/rejected': -2.1431615352630615, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.8125271797180176, 'logps/chosen': -216.06195068359375, 'logps/rejected': -295.6746826171875, 'logps/ref_chosen': -84.77755737304688, 'logps/ref_rejected': -83.82415008544922, 'logits/chosen': -7.492166996002197, 'logits/rejected': -6.9776811599731445, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.010194412432610989, 'kl/avg_steps': 0.6875, 'epoch': 0.65} 65%|█████████████████████████████████████████████████████████████████████████▎ | 442/681 [33:22<12:12, 3.07s/it] 65%|█████████████████████████████████████████████████████████████████████████▌ | 443/681 [33:26<12:27, 3.14s/it] {'loss': 0.9816, 'grad_norm': 11.208428382873535, 'learning_rate': 1.6570202148426815e-07, 'rewards/chosen': -1.2582169771194458, 'rewards/rejected': -1.8541338443756104, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.5959169864654541, 'logps/chosen': -227.578125, 'logps/rejected': -277.5597229003906, 'logps/ref_chosen': -102.64927673339844, 'logps/ref_rejected': -93.03807067871094, 'logits/chosen': -7.71527099609375, 'logits/rejected': -7.359766960144043, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.010124804452061653, 'kl/avg_steps': 0.625, 'epoch': 0.65} 65%|█████████████████████████████████████████████████████████████████████████▌ | 443/681 [33:26<12:27, 3.14s/it] 65%|█████████████████████████████████████████████████████████████████████████▋ | 444/681 [33:29<12:14, 3.10s/it] {'loss': 1.0177, 'grad_norm': 12.04550552368164, 'learning_rate': 1.6449496416858282e-07, 'rewards/chosen': -1.34379243850708, 'rewards/rejected': -1.9272189140319824, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.5834265947341919, 'logps/chosen': -221.87020874023438, 'logps/rejected': -295.9997253417969, 'logps/ref_chosen': -87.91971588134766, 'logps/ref_rejected': -103.32345581054688, 'logits/chosen': -7.471601963043213, 'logits/rejected': -7.507413387298584, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.010061916895210743, 'kl/avg_steps': 0.4375, 'epoch': 0.65} 65%|█████████████████████████████████████████████████████████████████████████▋ | 444/681 [33:29<12:14, 3.10s/it] 65%|█████████████████████████████████████████████████████████████████████████▊ | 445/681 [33:32<12:10, 3.10s/it] {'loss': 0.925, 'grad_norm': 10.32401180267334, 'learning_rate': 1.6329015999011182e-07, 'rewards/chosen': -1.1529053449630737, 'rewards/rejected': -1.8247992992401123, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.6718940138816833, 'logps/chosen': -216.97213745117188, 'logps/rejected': -282.439208984375, 'logps/ref_chosen': -101.40087127685547, 'logps/ref_rejected': -99.03790283203125, 'logits/chosen': -7.7422380447387695, 'logits/rejected': -7.081811904907227, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.010018087923526764, 'kl/avg_steps': 0.59375, 'epoch': 0.65} 65%|█████████████████████████████████████████████████████████████████████████▊ | 445/681 [33:32<12:10, 3.10s/it] 65%|██████████████████████████████████████████████████████████████████████████ | 446/681 [33:35<12:21, 3.16s/it] {'loss': 0.9615, 'grad_norm': 12.130196571350098, 'learning_rate': 1.6208764069656578e-07, 'rewards/chosen': -1.1098341941833496, 'rewards/rejected': -1.7301799058914185, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.6203456521034241, 'logps/chosen': -199.480224609375, 'logps/rejected': -281.7625732421875, 'logps/ref_chosen': -87.42234802246094, 'logps/ref_rejected': -106.70075988769531, 'logits/chosen': -7.809151649475098, 'logits/rejected': -7.551025390625, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.009958956390619278, 'kl/avg_steps': 0.5625, 'epoch': 0.65} 65%|██████████████████████████████████████████████████████████████████████████ | 446/681 [33:35<12:21, 3.16s/it] 66%|██████████████████████████████████████████████████████████████████████████▏ | 447/681 [33:38<11:54, 3.06s/it] {'loss': 0.9079, 'grad_norm': 11.679533958435059, 'learning_rate': 1.608874379754465e-07, 'rewards/chosen': -1.1265549659729004, 'rewards/rejected': -1.8664308786392212, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.7398759126663208, 'logps/chosen': -197.861083984375, 'logps/rejected': -294.75396728515625, 'logps/ref_chosen': -83.6152572631836, 'logps/ref_rejected': -104.91239929199219, 'logits/chosen': -7.548259735107422, 'logits/rejected': -7.2763800621032715, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.009903251193463802, 'kl/avg_steps': 0.65625, 'epoch': 0.66} 66%|██████████████████████████████████████████████████████████████████████████▏ | 447/681 [33:38<11:54, 3.06s/it] 66%|██████████████████████████████████████████████████████████████████████████▎ | 448/681 [33:41<11:53, 3.06s/it] {'loss': 0.9455, 'grad_norm': 11.890493392944336, 'learning_rate': 1.5968958345321177e-07, 'rewards/chosen': -1.1929302215576172, 'rewards/rejected': -1.929168939590454, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.7362387180328369, 'logps/chosen': -214.25633239746094, 'logps/rejected': -305.142578125, 'logps/ref_chosen': -92.5757827758789, 'logps/ref_rejected': -107.68977355957031, 'logits/chosen': -7.886592864990234, 'logits/rejected': -7.5247802734375, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.00983868446201086, 'kl/avg_steps': 0.5625, 'epoch': 0.66} 66%|██████████████████████████████████████████████████████████████████████████▎ | 448/681 [33:41<11:53, 3.06s/it] 66%|██████████████████████████████████████████████████████████████████████████▌ | 449/681 [33:44<12:06, 3.13s/it] {'loss': 0.9078, 'grad_norm': 10.460795402526855, 'learning_rate': 1.584941086944423e-07, 'rewards/chosen': -1.1828500032424927, 'rewards/rejected': -1.9290869235992432, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.7462369799613953, 'logps/chosen': -223.71267700195312, 'logps/rejected': -293.65252685546875, 'logps/ref_chosen': -102.39893341064453, 'logps/ref_rejected': -95.14886474609375, 'logits/chosen': -7.6187238693237305, 'logits/rejected': -7.113336086273193, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.009783651679754257, 'kl/avg_steps': 0.53125, 'epoch': 0.66} 66%|██████████████████████████████████████████████████████████████████████████▌ | 449/681 [33:44<12:06, 3.13s/it] 66%|██████████████████████████████████████████████████████████████████████████▋ | 450/681 [33:47<12:07, 3.15s/it] {'loss': 0.9059, 'grad_norm': 11.509246826171875, 'learning_rate': 1.573010452010098e-07, 'rewards/chosen': -1.0168712139129639, 'rewards/rejected': -1.7526460886001587, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.7357749938964844, 'logps/chosen': -191.9837646484375, 'logps/rejected': -290.04888916015625, 'logps/ref_chosen': -86.99285888671875, 'logps/ref_rejected': -108.53203582763672, 'logits/chosen': -8.000127792358398, 'logits/rejected': -7.75508451461792, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.009731950238347054, 'kl/avg_steps': 0.71875, 'epoch': 0.66} 66%|██████████████████████████████████████████████████████████████████████████▋ | 450/681 [33:47<12:07, 3.15s/it] 66%|██████████████████████████████████████████████████████████████████████████▊ | 451/681 [33:50<11:32, 3.01s/it] {'loss': 1.0328, 'grad_norm': 16.418062210083008, 'learning_rate': 1.5611042441124687e-07, 'rewards/chosen': -1.157651424407959, 'rewards/rejected': -1.7588902711868286, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.6012389659881592, 'logps/chosen': -206.88548278808594, 'logps/rejected': -262.9779357910156, 'logps/ref_chosen': -86.81128692626953, 'logps/ref_rejected': -79.8555908203125, 'logits/chosen': -7.632309913635254, 'logits/rejected': -7.299648761749268, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.009662501513957977, 'kl/avg_steps': 0.46875, 'epoch': 0.66} 66%|██████████████████████████████████████████████████████████████████████████▊ | 451/681 [33:50<11:32, 3.01s/it] 66%|███████████████████████████████████████████████████████████████████████████ | 452/681 [33:53<11:30, 3.01s/it] {'loss': 0.9212, 'grad_norm': 10.160775184631348, 'learning_rate': 1.549222776991186e-07, 'rewards/chosen': -1.0104700326919556, 'rewards/rejected': -1.6558433771133423, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.6453733444213867, 'logps/chosen': -185.05380249023438, 'logps/rejected': -277.22369384765625, 'logps/ref_chosen': -79.379638671875, 'logps/ref_rejected': -103.71539306640625, 'logits/chosen': -7.714634895324707, 'logits/rejected': -7.272608757019043, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.00961741991341114, 'kl/avg_steps': 0.6875, 'epoch': 0.66} 66%|███████████████████████████████████████████████████████████████████████████ | 452/681 [33:53<11:30, 3.01s/it] 67%|███████████████████████████████████████████████████████████████████████████▏ | 453/681 [33:56<11:22, 2.99s/it] {'loss': 0.8581, 'grad_norm': 10.804041862487793, 'learning_rate': 1.5373663637339584e-07, 'rewards/chosen': -1.191428303718567, 'rewards/rejected': -1.964081883430481, 'rewards/accuracies': 0.875, 'rewards/margins': 0.7726534605026245, 'logps/chosen': -213.16781616210938, 'logps/rejected': -297.2564697265625, 'logps/ref_chosen': -87.6951904296875, 'logps/ref_rejected': -90.0582275390625, 'logits/chosen': -7.6690826416015625, 'logits/rejected': -7.041682243347168, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.009551751427352428, 'kl/avg_steps': 0.6875, 'epoch': 0.67} 67%|███████████████████████████████████████████████████████████████████████████▏ | 453/681 [33:56<11:22, 2.99s/it] 67%|███████████████████████████████████████████████████████████████████████████▎ | 454/681 [33:59<11:18, 2.99s/it] {'loss': 1.0254, 'grad_norm': 11.586421966552734, 'learning_rate': 1.5255353167683017e-07, 'rewards/chosen': -1.273888349533081, 'rewards/rejected': -1.8924428224563599, 'rewards/accuracies': 0.75, 'rewards/margins': 0.6185543537139893, 'logps/chosen': -224.0977325439453, 'logps/rejected': -293.582275390625, 'logps/ref_chosen': -89.56623840332031, 'logps/ref_rejected': -92.92105102539062, 'logits/chosen': -7.6572160720825195, 'logits/rejected': -7.268322944641113, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.009486531838774681, 'kl/avg_steps': 0.46875, 'epoch': 0.67} 67%|███████████████████████████████████████████████████████████████████████████▎ | 454/681 [33:59<11:18, 2.99s/it] 67%|███████████████████████████████████████████████████████████████████████████▍ | 455/681 [34:02<11:19, 3.01s/it] {'loss': 0.9778, 'grad_norm': 10.893000602722168, 'learning_rate': 1.5137299478533064e-07, 'rewards/chosen': -1.15059232711792, 'rewards/rejected': -1.7739757299423218, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.6233835220336914, 'logps/chosen': -199.83328247070312, 'logps/rejected': -308.0276184082031, 'logps/ref_chosen': -77.6299819946289, 'logps/ref_rejected': -118.97795104980469, 'logits/chosen': -7.797979354858398, 'logits/rejected': -7.566769599914551, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.00944227073341608, 'kl/avg_steps': 0.5, 'epoch': 0.67} 67%|███████████████████████████████████████████████████████████████████████████▍ | 455/681 [34:02<11:19, 3.01s/it] 67%|███████████████████████████████████████████████████████████████████████████▋ | 456/681 [34:05<11:15, 3.00s/it] {'loss': 0.9811, 'grad_norm': 12.390460014343262, 'learning_rate': 1.5019505680714232e-07, 'rewards/chosen': -1.1745624542236328, 'rewards/rejected': -1.7645561695098877, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.5899935960769653, 'logps/chosen': -215.07440185546875, 'logps/rejected': -298.5582580566406, 'logps/ref_chosen': -89.61686706542969, 'logps/ref_rejected': -109.5597152709961, 'logits/chosen': -7.607115745544434, 'logits/rejected': -7.611271858215332, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.009395293891429901, 'kl/avg_steps': 0.5625, 'epoch': 0.67} 67%|███████████████████████████████████████████████████████████████████████████▋ | 456/681 [34:05<11:15, 3.00s/it] 67%|███████████████████████████████████████████████████████████████████████████▊ | 457/681 [34:08<11:32, 3.09s/it] {'loss': 0.842, 'grad_norm': 11.532391548156738, 'learning_rate': 1.4901974878202627e-07, 'rewards/chosen': -0.9881317615509033, 'rewards/rejected': -1.7533208131790161, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.765188992023468, 'logps/chosen': -193.81298828125, 'logps/rejected': -280.0379638671875, 'logps/ref_chosen': -87.32168579101562, 'logps/ref_rejected': -90.76660919189453, 'logits/chosen': -7.829406261444092, 'logits/rejected': -7.343839168548584, 'kl/p_epsilon_steps': 0.90625, 'kl/n_epsilon_steps': 0.09375, 'kl/beta': 0.009342741221189499, 'kl/avg_steps': 0.8125, 'epoch': 0.67} 67%|███████████████████████████████████████████████████████████████████████████▊ | 457/681 [34:08<11:32, 3.09s/it] 67%|███████████████████████████████████████████████████████████████████████████▉ | 458/681 [34:11<11:27, 3.08s/it] {'loss': 1.024, 'grad_norm': 11.981277465820312, 'learning_rate': 1.4784710168044212e-07, 'rewards/chosen': -1.1065789461135864, 'rewards/rejected': -1.687969446182251, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.581390380859375, 'logps/chosen': -213.52423095703125, 'logps/rejected': -286.9295654296875, 'logps/ref_chosen': -93.52044677734375, 'logps/ref_rejected': -103.36898803710938, 'logits/chosen': -7.395938873291016, 'logits/rejected': -7.573344707489014, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.009267443791031837, 'kl/avg_steps': 0.59375, 'epoch': 0.67} 67%|███████████████████████████████████████████████████████████████████████████▉ | 458/681 [34:11<11:27, 3.08s/it] 67%|████████████████████████████████████████████████████████████████████████████▏ | 459/681 [34:14<11:25, 3.09s/it] {'loss': 0.9252, 'grad_norm': 9.090399742126465, 'learning_rate': 1.466771464027316e-07, 'rewards/chosen': -1.05517578125, 'rewards/rejected': -1.7575633525848389, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.7023875713348389, 'logps/chosen': -190.7766571044922, 'logps/rejected': -284.36572265625, 'logps/ref_chosen': -75.68820190429688, 'logps/ref_rejected': -92.17048645019531, 'logits/chosen': -7.5142059326171875, 'logits/rejected': -6.964491844177246, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.009212742559611797, 'kl/avg_steps': 0.65625, 'epoch': 0.67} 67%|████████████████████████████████████████████████████████████████████████████▏ | 459/681 [34:14<11:25, 3.09s/it] 68%|████████████████████████████████████████████████████████████████████████████▎ | 460/681 [34:18<11:29, 3.12s/it] {'loss': 0.9702, 'grad_norm': 12.827125549316406, 'learning_rate': 1.4550991377830423e-07, 'rewards/chosen': -1.0923173427581787, 'rewards/rejected': -1.7237976789474487, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.6314802765846252, 'logps/chosen': -201.06686401367188, 'logps/rejected': -300.1006164550781, 'logps/ref_chosen': -81.11788940429688, 'logps/ref_rejected': -110.31238555908203, 'logits/chosen': -7.7618255615234375, 'logits/rejected': -7.418609619140625, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.009152678772807121, 'kl/avg_steps': 0.65625, 'epoch': 0.68} 68%|████████████████████████████████████████████████████████████████████████████▎ | 460/681 [34:18<11:29, 3.12s/it] 68%|████████████████████████████████████████████████████████████████████████████▍ | 461/681 [34:21<11:38, 3.18s/it] {'loss': 1.0504, 'grad_norm': 11.000265121459961, 'learning_rate': 1.4434543456482518e-07, 'rewards/chosen': -1.3213638067245483, 'rewards/rejected': -1.819370985031128, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.4980071187019348, 'logps/chosen': -227.41180419921875, 'logps/rejected': -295.16571044921875, 'logps/ref_chosen': -81.58352661132812, 'logps/ref_rejected': -93.87710571289062, 'logits/chosen': -7.253688335418701, 'logits/rejected': -6.825348854064941, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.009093005210161209, 'kl/avg_steps': 0.46875, 'epoch': 0.68} 68%|████████████████████████████████████████████████████████████████████████████▍ | 461/681 [34:21<11:38, 3.18s/it] 68%|████████████████████████████████████████████████████████████████████████████▋ | 462/681 [34:24<11:21, 3.11s/it] {'loss': 0.9154, 'grad_norm': 10.546554565429688, 'learning_rate': 1.4318373944740484e-07, 'rewards/chosen': -1.00137460231781, 'rewards/rejected': -1.675006628036499, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.6736320853233337, 'logps/chosen': -205.47802734375, 'logps/rejected': -272.15692138671875, 'logps/ref_chosen': -94.19855499267578, 'logps/ref_rejected': -85.63162994384766, 'logits/chosen': -7.895207405090332, 'logits/rejected': -7.166084289550781, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.00905058067291975, 'kl/avg_steps': 0.71875, 'epoch': 0.68} 68%|████████████████████████████████████████████████████████████████████████████▋ | 462/681 [34:24<11:21, 3.11s/it] 68%|████████████████████████████████████████████████████████████████████████████▊ | 463/681 [34:27<11:07, 3.06s/it] {'loss': 0.8797, 'grad_norm': 12.333318710327148, 'learning_rate': 1.4202485903778976e-07, 'rewards/chosen': -0.9655375480651855, 'rewards/rejected': -1.7415142059326172, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7759765386581421, 'logps/chosen': -193.8429718017578, 'logps/rejected': -292.2165222167969, 'logps/ref_chosen': -85.92474365234375, 'logps/ref_rejected': -96.90184020996094, 'logits/chosen': -7.807669639587402, 'logits/rejected': -7.118095397949219, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.008985994383692741, 'kl/avg_steps': 0.65625, 'epoch': 0.68} 68%|████████████████████████████████████████████████████████████████████████████▊ | 463/681 [34:27<11:07, 3.06s/it] 68%|████████████████████████████████████████████████████████████████████████████▉ | 464/681 [34:30<10:55, 3.02s/it] {'loss': 0.9686, 'grad_norm': 14.331416130065918, 'learning_rate': 1.4086882387355658e-07, 'rewards/chosen': -1.1799356937408447, 'rewards/rejected': -1.8003066778182983, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.6203708648681641, 'logps/chosen': -212.67556762695312, 'logps/rejected': -310.6097717285156, 'logps/ref_chosen': -79.68920135498047, 'logps/ref_rejected': -107.29232025146484, 'logits/chosen': -7.688302993774414, 'logits/rejected': -7.788723945617676, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.008927407674491405, 'kl/avg_steps': 0.71875, 'epoch': 0.68} 68%|████████████████████████████████████████████████████████████████████████████▉ | 464/681 [34:30<10:55, 3.02s/it] 68%|█████████████████████████████████████████████████████████████████████████████▏ | 465/681 [34:33<11:11, 3.11s/it] {'loss': 0.8775, 'grad_norm': 11.430578231811523, 'learning_rate': 1.3971566441730714e-07, 'rewards/chosen': -1.073095440864563, 'rewards/rejected': -1.8162384033203125, 'rewards/accuracies': 0.875, 'rewards/margins': 0.7431429624557495, 'logps/chosen': -213.5987548828125, 'logps/rejected': -325.18511962890625, 'logps/ref_chosen': -91.8602294921875, 'logps/ref_rejected': -118.71000671386719, 'logits/chosen': -7.3866472244262695, 'logits/rejected': -7.172554016113281, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.008863699622452259, 'kl/avg_steps': 0.71875, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████████████████████████████▏ | 465/681 [34:33<11:11, 3.11s/it] 68%|█████████████████████████████████████████████████████████████████████████████▎ | 466/681 [34:36<11:08, 3.11s/it] {'loss': 0.8759, 'grad_norm': 8.964788436889648, 'learning_rate': 1.3856541105586545e-07, 'rewards/chosen': -0.9667361974716187, 'rewards/rejected': -1.7277427911758423, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.7610065937042236, 'logps/chosen': -195.26803588867188, 'logps/rejected': -294.0559387207031, 'logps/ref_chosen': -84.70140075683594, 'logps/ref_rejected': -96.05084228515625, 'logits/chosen': -7.547239780426025, 'logits/rejected': -7.197183609008789, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.0088004469871521, 'kl/avg_steps': 0.71875, 'epoch': 0.68} 68%|█████████████████████████████████████████████████████████████████████████████▎ | 466/681 [34:36<11:08, 3.11s/it] 69%|█████████████████████████████████████████████████████████████████████████████▍ | 467/681 [34:39<10:59, 3.08s/it] {'loss': 0.9698, 'grad_norm': 11.361374855041504, 'learning_rate': 1.3741809409947729e-07, 'rewards/chosen': -1.1772236824035645, 'rewards/rejected': -1.8375762701034546, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.6603525876998901, 'logps/chosen': -244.66354370117188, 'logps/rejected': -320.74517822265625, 'logps/ref_chosen': -109.29832458496094, 'logps/ref_rejected': -108.8436508178711, 'logits/chosen': -8.130990028381348, 'logits/rejected': -7.5615644454956055, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.008737645111978054, 'kl/avg_steps': 0.5625, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████████████████████████████▍ | 467/681 [34:39<10:59, 3.08s/it] 69%|█████████████████████████████████████████████████████████████████████████████▋ | 468/681 [34:42<11:06, 3.13s/it] {'loss': 1.0436, 'grad_norm': 11.03893756866455, 'learning_rate': 1.362737437810114e-07, 'rewards/chosen': -0.9716976881027222, 'rewards/rejected': -1.518856167793274, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.5471584796905518, 'logps/chosen': -210.38668823242188, 'logps/rejected': -282.5191345214844, 'logps/ref_chosen': -98.32164764404297, 'logps/ref_rejected': -106.68048095703125, 'logits/chosen': -7.599516868591309, 'logits/rejected': -7.804562091827393, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.008688771165907383, 'kl/avg_steps': 0.4375, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████████████████████████████▋ | 468/681 [34:42<11:06, 3.13s/it] 69%|█████████████████████████████████████████████████████████████████████████████▊ | 469/681 [34:46<11:11, 3.17s/it] {'loss': 0.9145, 'grad_norm': 12.710182189941406, 'learning_rate': 1.351323902551631e-07, 'rewards/chosen': -1.1398154497146606, 'rewards/rejected': -1.8454160690307617, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.7056005597114563, 'logps/chosen': -229.18951416015625, 'logps/rejected': -324.5375061035156, 'logps/ref_chosen': -96.76420593261719, 'logps/ref_rejected': -109.59500885009766, 'logits/chosen': -7.760175704956055, 'logits/rejected': -7.189078330993652, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.008650923147797585, 'kl/avg_steps': 0.6875, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████████████████████████████▊ | 469/681 [34:46<11:11, 3.17s/it] 69%|█████████████████████████████████████████████████████████████████████████████▉ | 470/681 [34:49<10:46, 3.06s/it] {'loss': 0.9357, 'grad_norm': 11.018048286437988, 'learning_rate': 1.339940635976592e-07, 'rewards/chosen': -1.0692801475524902, 'rewards/rejected': -1.7161200046539307, 'rewards/accuracies': 0.875, 'rewards/margins': 0.6468397378921509, 'logps/chosen': -208.847900390625, 'logps/rejected': -289.9870300292969, 'logps/ref_chosen': -83.49665832519531, 'logps/ref_rejected': -88.48578643798828, 'logits/chosen': -7.906304359436035, 'logits/rejected': -7.4226555824279785, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'kl/beta': 0.008591854013502598, 'kl/avg_steps': 0.78125, 'epoch': 0.69} 69%|█████████████████████████████████████████████████████████████████████████████▉ | 470/681 [34:49<10:46, 3.06s/it] 69%|██████████████████████████████████████████████████████████████████████████████▏ | 471/681 [34:51<10:28, 2.99s/it] {'loss': 0.9497, 'grad_norm': 10.446932792663574, 'learning_rate': 1.3285879380446563e-07, 'rewards/chosen': -1.2270324230194092, 'rewards/rejected': -1.8406031131744385, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.6135708093643188, 'logps/chosen': -233.13998413085938, 'logps/rejected': -307.9644470214844, 'logps/ref_chosen': -88.47430419921875, 'logps/ref_rejected': -90.48171997070312, 'logits/chosen': -7.087899208068848, 'logits/rejected': -6.836267948150635, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.008525250479578972, 'kl/avg_steps': 0.6875, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████████████████████████████▏ | 471/681 [34:51<10:28, 2.99s/it] 69%|██████████████████████████████████████████████████████████████████████████████▎ | 472/681 [34:55<10:58, 3.15s/it] {'loss': 0.8455, 'grad_norm': 9.13286304473877, 'learning_rate': 1.317266107909975e-07, 'rewards/chosen': -1.0882880687713623, 'rewards/rejected': -1.8462860584259033, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.757997989654541, 'logps/chosen': -254.59384155273438, 'logps/rejected': -340.85113525390625, 'logps/ref_chosen': -125.23369598388672, 'logps/ref_rejected': -121.05349731445312, 'logits/chosen': -7.951723098754883, 'logits/rejected': -7.220186233520508, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.008467039093375206, 'kl/avg_steps': 0.71875, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████████████████████████████▎ | 472/681 [34:55<10:58, 3.15s/it] 69%|██████████████████████████████████████████████████████████████████████████████▍ | 473/681 [34:58<11:09, 3.22s/it] {'loss': 1.0855, 'grad_norm': 11.645308494567871, 'learning_rate': 1.3059754439133002e-07, 'rewards/chosen': -1.2393798828125, 'rewards/rejected': -1.711702585220337, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.4723225235939026, 'logps/chosen': -243.3128662109375, 'logps/rejected': -292.8554992675781, 'logps/ref_chosen': -95.61137390136719, 'logps/ref_rejected': -88.15115356445312, 'logits/chosen': -7.586986064910889, 'logits/rejected': -7.300940990447998, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.008406616747379303, 'kl/avg_steps': 0.40625, 'epoch': 0.69} 69%|██████████████████████████████████████████████████████████████████████████████▍ | 473/681 [34:58<11:09, 3.22s/it] 70%|██████████████████████████████████████████████████████████████████████████████▋ | 474/681 [35:01<11:08, 3.23s/it] {'loss': 1.1116, 'grad_norm': 12.829643249511719, 'learning_rate': 1.2947162435741277e-07, 'rewards/chosen': -1.2031913995742798, 'rewards/rejected': -1.6386675834655762, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.43547606468200684, 'logps/chosen': -225.4574737548828, 'logps/rejected': -293.2532958984375, 'logps/ref_chosen': -81.47975158691406, 'logps/ref_rejected': -96.46562957763672, 'logits/chosen': -7.622153282165527, 'logits/rejected': -7.345100402832031, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.008372602984309196, 'kl/avg_steps': 0.4375, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████████████████████████████▋ | 474/681 [35:02<11:08, 3.23s/it] 70%|██████████████████████████████████████████████████████████████████████████████▊ | 475/681 [35:04<10:47, 3.14s/it] {'loss': 0.9232, 'grad_norm': 9.961170196533203, 'learning_rate': 1.2834888035828596e-07, 'rewards/chosen': -0.8945390582084656, 'rewards/rejected': -1.5759094953536987, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.6813703775405884, 'logps/chosen': -182.11282348632812, 'logps/rejected': -285.2503662109375, 'logps/ref_chosen': -74.19598388671875, 'logps/ref_rejected': -94.69242095947266, 'logits/chosen': -7.716796875, 'logits/rejected': -7.385658264160156, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.008336132392287254, 'kl/avg_steps': 0.625, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████████████████████████████▊ | 475/681 [35:04<10:47, 3.14s/it] 70%|██████████████████████████████████████████████████████████████████████████████▉ | 476/681 [35:07<10:36, 3.11s/it] {'loss': 0.9888, 'grad_norm': 12.75778579711914, 'learning_rate': 1.2722934197929802e-07, 'rewards/chosen': -1.067098617553711, 'rewards/rejected': -1.656817078590393, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.5897184610366821, 'logps/chosen': -201.364013671875, 'logps/rejected': -281.6759033203125, 'logps/ref_chosen': -71.97109985351562, 'logps/ref_rejected': -80.26224517822266, 'logits/chosen': -7.788543224334717, 'logits/rejected': -7.032910346984863, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.008284355513751507, 'kl/avg_steps': 0.59375, 'epoch': 0.7} 70%|██████████████████████████████████████████████████████████████████████████████▉ | 476/681 [35:07<10:36, 3.11s/it] 70%|███████████████████████████████████████████████████████████████████████████████▏ | 477/681 [35:11<10:38, 3.13s/it] {'loss': 0.9673, 'grad_norm': 10.86681079864502, 'learning_rate': 1.2611303872132631e-07, 'rewards/chosen': -1.0373156070709229, 'rewards/rejected': -1.6613343954086304, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.6240187883377075, 'logps/chosen': -231.5677032470703, 'logps/rejected': -285.14166259765625, 'logps/ref_chosen': -105.00555419921875, 'logps/ref_rejected': -81.87843322753906, 'logits/chosen': -7.685525894165039, 'logits/rejected': -7.131807327270508, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.008235457353293896, 'kl/avg_steps': 0.6875, 'epoch': 0.7} 70%|███████████████████████████████████████████████████████████████████████████████▏ | 477/681 [35:11<10:38, 3.13s/it] 70%|███████████████████████████████████████████████████████████████████████████████▎ | 478/681 [35:14<10:41, 3.16s/it] {'loss': 0.9417, 'grad_norm': 9.899713516235352, 'learning_rate': 1.2500000000000005e-07, 'rewards/chosen': -0.9919052124023438, 'rewards/rejected': -1.6340248584747314, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.6421196460723877, 'logps/chosen': -198.8499755859375, 'logps/rejected': -291.9015808105469, 'logps/ref_chosen': -76.7882080078125, 'logps/ref_rejected': -90.43994140625, 'logits/chosen': -7.471836090087891, 'logits/rejected': -7.25750732421875, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'kl/beta': 0.008179225027561188, 'kl/avg_steps': 0.75, 'epoch': 0.7} 70%|███████████████████████████████████████████████████████████████████████████████▎ | 478/681 [35:14<10:41, 3.16s/it] 70%|███████████████████████████████████████████████████████████████████████████████▍ | 479/681 [35:17<10:28, 3.11s/it] {'loss': 0.9958, 'grad_norm': 12.096019744873047, 'learning_rate': 1.2389025514492456e-07, 'rewards/chosen': -1.0483417510986328, 'rewards/rejected': -1.6027498245239258, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.5544079542160034, 'logps/chosen': -211.1195526123047, 'logps/rejected': -299.9999694824219, 'logps/ref_chosen': -81.3623046875, 'logps/ref_rejected': -101.09114074707031, 'logits/chosen': -7.843846321105957, 'logits/rejected': -7.429350852966309, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.008118337951600552, 'kl/avg_steps': 0.625, 'epoch': 0.7} 70%|███████████████████████████████████████████████████████████████████████████████▍ | 479/681 [35:17<10:28, 3.11s/it] 70%|███████████████████████████████████████████████████████████████████████████████▋ | 480/681 [35:20<10:24, 3.11s/it] {'loss': 1.0276, 'grad_norm': 12.593810081481934, 'learning_rate': 1.227838333989088e-07, 'rewards/chosen': -1.1555697917938232, 'rewards/rejected': -1.6818903684616089, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.5263204574584961, 'logps/chosen': -240.56443786621094, 'logps/rejected': -296.2643127441406, 'logps/ref_chosen': -96.7739028930664, 'logps/ref_rejected': -86.40473937988281, 'logits/chosen': -8.081748962402344, 'logits/rejected': -7.16196346282959, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.008067913353443146, 'kl/avg_steps': 0.5, 'epoch': 0.7} 70%|███████████████████████████████████████████████████████████████████████████████▋ | 480/681 [35:20<10:24, 3.11s/it] 71%|███████████████████████████████████████████████████████████████████████████████▊ | 481/681 [35:23<10:13, 3.07s/it] {'loss': 0.99, 'grad_norm': 11.46150016784668, 'learning_rate': 1.2168076391719489e-07, 'rewards/chosen': -1.1354894638061523, 'rewards/rejected': -1.721273422241211, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.5857840180397034, 'logps/chosen': -233.6854705810547, 'logps/rejected': -314.57574462890625, 'logps/ref_chosen': -91.670166015625, 'logps/ref_rejected': -98.69490051269531, 'logits/chosen': -7.622033596038818, 'logits/rejected': -6.947890281677246, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.008027774281799793, 'kl/avg_steps': 0.59375, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████████████████████████████▊ | 481/681 [35:23<10:13, 3.07s/it] 71%|███████████████████████████████████████████████████████████████████████████████▉ | 482/681 [35:26<10:15, 3.10s/it] {'loss': 0.9871, 'grad_norm': 11.263471603393555, 'learning_rate': 1.2058107576668938e-07, 'rewards/chosen': -1.0913074016571045, 'rewards/rejected': -1.6596999168395996, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.5683927536010742, 'logps/chosen': -235.95123291015625, 'logps/rejected': -304.33599853515625, 'logps/ref_chosen': -98.52011108398438, 'logps/ref_rejected': -94.8294448852539, 'logits/chosen': -7.860917091369629, 'logits/rejected': -7.225180625915527, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.007980390451848507, 'kl/avg_steps': 0.59375, 'epoch': 0.71} 71%|███████████████████████████████████████████████████████████████████████████████▉ | 482/681 [35:26<10:15, 3.10s/it] 71%|████████████████████████████████████████████████████████████████████████████████▏ | 483/681 [35:29<10:07, 3.07s/it] {'loss': 0.9526, 'grad_norm': 9.014310836791992, 'learning_rate': 1.194847979251979e-07, 'rewards/chosen': -1.1138007640838623, 'rewards/rejected': -1.7216558456420898, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.6078551411628723, 'logps/chosen': -248.36256408691406, 'logps/rejected': -319.85467529296875, 'logps/ref_chosen': -107.11860656738281, 'logps/ref_rejected': -101.11499786376953, 'logits/chosen': -8.008612632751465, 'logits/rejected': -7.4816975593566895, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'kl/beta': 0.00793328694999218, 'kl/avg_steps': 0.75, 'epoch': 0.71} 71%|████████████████████████████████████████████████████████████████████████████████▏ | 483/681 [35:29<10:07, 3.07s/it] 71%|████████████████████████████████████████████████████████████████████████████████▎ | 484/681 [35:32<09:46, 2.98s/it] {'loss': 0.9462, 'grad_norm': 8.998906135559082, 'learning_rate': 1.1839195928066101e-07, 'rewards/chosen': -1.049012303352356, 'rewards/rejected': -1.6971848011016846, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.6481724381446838, 'logps/chosen': -220.87344360351562, 'logps/rejected': -307.90826416015625, 'logps/ref_chosen': -86.97991943359375, 'logps/ref_rejected': -90.72367095947266, 'logits/chosen': -7.730357646942139, 'logits/rejected': -6.921565055847168, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.007874229922890663, 'kl/avg_steps': 0.6875, 'epoch': 0.71} 71%|████████████████████████████████████████████████████████████████████████████████▎ | 484/681 [35:32<09:46, 2.98s/it] 71%|████████████████████████████████████████████████████████████████████████████████▍ | 485/681 [35:35<09:51, 3.02s/it] {'loss': 0.8722, 'grad_norm': 9.495361328125, 'learning_rate': 1.1730258863039347e-07, 'rewards/chosen': -0.9427545070648193, 'rewards/rejected': -1.6760468482971191, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.733292281627655, 'logps/chosen': -215.27700805664062, 'logps/rejected': -324.5331726074219, 'logps/ref_chosen': -94.05874633789062, 'logps/ref_rejected': -108.56297302246094, 'logits/chosen': -7.978307247161865, 'logits/rejected': -7.098984241485596, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.007820464670658112, 'kl/avg_steps': 0.6875, 'epoch': 0.71} 71%|████████████████████████████████████████████████████████████████████████████████▍ | 485/681 [35:35<09:51, 3.02s/it] 71%|████████████████████████████████████████████████████████████████████████████████▋ | 486/681 [35:37<09:17, 2.86s/it] {'loss': 0.9519, 'grad_norm': 11.05057144165039, 'learning_rate': 1.1621671468032493e-07, 'rewards/chosen': -1.0858381986618042, 'rewards/rejected': -1.7294615507125854, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.643623411655426, 'logps/chosen': -233.9942626953125, 'logps/rejected': -322.165771484375, 'logps/ref_chosen': -93.74588012695312, 'logps/ref_rejected': -98.07064819335938, 'logits/chosen': -7.851730823516846, 'logits/rejected': -7.302485466003418, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.007767065893858671, 'kl/avg_steps': 0.53125, 'epoch': 0.71} 71%|████████████████████████████████████████████████████████████████████████████████▋ | 486/681 [35:38<09:17, 2.86s/it] 72%|████████████████████████████████████████████████████████████████████████████████▊ | 487/681 [35:41<09:38, 2.98s/it] {'loss': 0.9663, 'grad_norm': 10.726838111877441, 'learning_rate': 1.1513436604424378e-07, 'rewards/chosen': -1.047347068786621, 'rewards/rejected': -1.6354316473007202, 'rewards/accuracies': 0.875, 'rewards/margins': 0.5880845785140991, 'logps/chosen': -224.526123046875, 'logps/rejected': -311.92681884765625, 'logps/ref_chosen': -88.0335693359375, 'logps/ref_rejected': -98.47209930419922, 'logits/chosen': -7.636082649230957, 'logits/rejected': -7.332340240478516, 'kl/p_epsilon_steps': 0.890625, 'kl/n_epsilon_steps': 0.109375, 'kl/beta': 0.0077260215766727924, 'kl/avg_steps': 0.78125, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████████████████████████████▊ | 487/681 [35:41<09:38, 2.98s/it] 72%|████████████████████████████████████████████████████████████████████████████████▉ | 488/681 [35:44<09:57, 3.10s/it] {'loss': 0.935, 'grad_norm': 11.892740249633789, 'learning_rate': 1.1405557124304335e-07, 'rewards/chosen': -0.9786466360092163, 'rewards/rejected': -1.5881898403167725, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.6095430850982666, 'logps/chosen': -213.1976776123047, 'logps/rejected': -299.0603942871094, 'logps/ref_chosen': -84.78964233398438, 'logps/ref_rejected': -90.2734603881836, 'logits/chosen': -7.434564590454102, 'logits/rejected': -6.9452362060546875, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.007666129618883133, 'kl/avg_steps': 0.65625, 'epoch': 0.72} 72%|████████████████████████████████████████████████████████████████████████████████▉ | 488/681 [35:44<09:57, 3.10s/it] 72%|█████████████████████████████████████████████████████████████████████████████████▏ | 489/681 [35:47<09:45, 3.05s/it] {'loss': 0.982, 'grad_norm': 9.741364479064941, 'learning_rate': 1.1298035870396985e-07, 'rewards/chosen': -1.0582023859024048, 'rewards/rejected': -1.6288890838623047, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.5706866979598999, 'logps/chosen': -229.92735290527344, 'logps/rejected': -301.63934326171875, 'logps/ref_chosen': -90.46929931640625, 'logps/ref_rejected': -86.39761352539062, 'logits/chosen': -7.642593860626221, 'logits/rejected': -7.163801670074463, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.007616148795932531, 'kl/avg_steps': 0.53125, 'epoch': 0.72} 72%|█████████████████████████████████████████████████████████████████████████████████▏ | 489/681 [35:47<09:45, 3.05s/it] 72%|█████████████████████████████████████████████████████████████████████████████████▎ | 490/681 [35:50<09:54, 3.11s/it] {'loss': 1.094, 'grad_norm': 10.637678146362305, 'learning_rate': 1.1190875675987355e-07, 'rewards/chosen': -1.1003369092941284, 'rewards/rejected': -1.590024471282959, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.48968762159347534, 'logps/chosen': -230.93482971191406, 'logps/rejected': -327.2318115234375, 'logps/ref_chosen': -85.32012939453125, 'logps/ref_rejected': -115.99385070800781, 'logits/chosen': -7.2537336349487305, 'logits/rejected': -7.196831226348877, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.0075759016908705235, 'kl/avg_steps': 0.453125, 'epoch': 0.72} 72%|█████████████████████████████████████████████████████████████████████████████████▎ | 490/681 [35:50<09:54, 3.11s/it] 72%|█████████████████████████████████████████████████████████████████████████████████▍ | 491/681 [35:53<09:40, 3.05s/it] {'loss': 1.0378, 'grad_norm': 9.847221374511719, 'learning_rate': 1.1084079364846241e-07, 'rewards/chosen': -1.0794801712036133, 'rewards/rejected': -1.5843286514282227, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.5048484802246094, 'logps/chosen': -229.64122009277344, 'logps/rejected': -291.98773193359375, 'logps/ref_chosen': -86.14351654052734, 'logps/ref_rejected': -80.67945861816406, 'logits/chosen': -7.813044548034668, 'logits/rejected': -7.093764305114746, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.00754172820597887, 'kl/avg_steps': 0.5, 'epoch': 0.72} 72%|█████████████████████████████████████████████████████████████████████████████████▍ | 491/681 [35:53<09:40, 3.05s/it] 72%|█████████████████████████████████████████████████████████████████████████████████▋ | 492/681 [35:56<09:43, 3.09s/it] {'loss': 1.151, 'grad_norm': 13.724920272827148, 'learning_rate': 1.097764975115576e-07, 'rewards/chosen': -0.9921280145645142, 'rewards/rejected': -1.3964853286743164, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.40435731410980225, 'logps/chosen': -213.5839080810547, 'logps/rejected': -268.00067138671875, 'logps/ref_chosen': -81.10757446289062, 'logps/ref_rejected': -80.75199890136719, 'logits/chosen': -7.806938171386719, 'logits/rejected': -7.273170471191406, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.007504207547754049, 'kl/avg_steps': 0.46875, 'epoch': 0.72} 72%|█████████████████████████████████████████████████████████████████████████████████▋ | 492/681 [35:56<09:43, 3.09s/it] 72%|█████████████████████████████████████████████████████████████████████████████████▊ | 493/681 [36:00<09:47, 3.13s/it] {'loss': 1.0192, 'grad_norm': 11.94544506072998, 'learning_rate': 1.0871589639435203e-07, 'rewards/chosen': -0.9897406697273254, 'rewards/rejected': -1.4871938228607178, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.49745315313339233, 'logps/chosen': -245.13128662109375, 'logps/rejected': -293.9502868652344, 'logps/ref_chosen': -112.20733642578125, 'logps/ref_rejected': -93.60719299316406, 'logits/chosen': -7.833661079406738, 'logits/rejected': -6.987953186035156, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.007469195406883955, 'kl/avg_steps': 0.5625, 'epoch': 0.72} 72%|█████████████████████████████████████████████████████████████████████████████████▊ | 493/681 [36:00<09:47, 3.13s/it] 73%|█████████████████████████████████████████████████████████████████████████████████▉ | 494/681 [36:02<09:19, 2.99s/it] {'loss': 0.9054, 'grad_norm': 10.316105842590332, 'learning_rate': 1.0765901824467166e-07, 'rewards/chosen': -0.940959632396698, 'rewards/rejected': -1.6217553615570068, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.6807957291603088, 'logps/chosen': -200.50823974609375, 'logps/rejected': -312.24517822265625, 'logps/ref_chosen': -73.11489868164062, 'logps/ref_rejected': -92.16300201416016, 'logits/chosen': -7.544450759887695, 'logits/rejected': -7.313390731811523, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.007427416276186705, 'kl/avg_steps': 0.71875, 'epoch': 0.73} 73%|█████████████████████████████████████████████████████████████████████████████████▉ | 494/681 [36:02<09:19, 2.99s/it] 73%|██████████████████████████████████████████████████████████████████████████████████▏ | 495/681 [36:05<09:28, 3.06s/it] {'loss': 1.0093, 'grad_norm': 10.488041877746582, 'learning_rate': 1.0660589091223854e-07, 'rewards/chosen': -0.950130820274353, 'rewards/rejected': -1.5005515813827515, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.5504207611083984, 'logps/chosen': -228.8547821044922, 'logps/rejected': -302.8748779296875, 'logps/ref_chosen': -99.52032470703125, 'logps/ref_rejected': -97.93089294433594, 'logits/chosen': -7.922597408294678, 'logits/rejected': -7.498479843139648, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.0073744128458201885, 'kl/avg_steps': 0.625, 'epoch': 0.73} 73%|██████████████████████████████████████████████████████████████████████████████████▏ | 495/681 [36:06<09:28, 3.06s/it] 73%|██████████████████████████████████████████████████████████████████████████████████▎ | 496/681 [36:09<09:46, 3.17s/it] {'loss': 1.0384, 'grad_norm': 10.726445198059082, 'learning_rate': 1.0555654214793722e-07, 'rewards/chosen': -1.039567232131958, 'rewards/rejected': -1.520376443862915, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.48080918192863464, 'logps/chosen': -250.06234741210938, 'logps/rejected': -301.3846740722656, 'logps/ref_chosen': -107.85675048828125, 'logps/ref_rejected': -92.77056121826172, 'logits/chosen': -7.784862995147705, 'logits/rejected': -7.454352378845215, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.007328609004616737, 'kl/avg_steps': 0.4375, 'epoch': 0.73} 73%|██████████████████████████████████████████████████████████████████████████████████▎ | 496/681 [36:09<09:46, 3.17s/it] 73%|██████████████████████████████████████████████████████████████████████████████████▍ | 497/681 [36:12<09:39, 3.15s/it] {'loss': 1.1655, 'grad_norm': 10.815900802612305, 'learning_rate': 1.0451099960308374e-07, 'rewards/chosen': -1.1472631692886353, 'rewards/rejected': -1.483197808265686, 'rewards/accuracies': 0.703125, 'rewards/margins': 0.33593475818634033, 'logps/chosen': -249.58370971679688, 'logps/rejected': -286.0672607421875, 'logps/ref_chosen': -92.08322143554688, 'logps/ref_rejected': -81.79503631591797, 'logits/chosen': -7.584544658660889, 'logits/rejected': -7.411238670349121, 'kl/p_epsilon_steps': 0.65625, 'kl/n_epsilon_steps': 0.34375, 'kl/beta': 0.007296686060726643, 'kl/avg_steps': 0.3125, 'epoch': 0.73} 73%|██████████████████████████████████████████████████████████████████████████████████▍ | 497/681 [36:12<09:39, 3.15s/it] 73%|██████████████████████████████████████████████████████████████████████████████████▋ | 498/681 [36:15<09:33, 3.13s/it] {'loss': 0.9923, 'grad_norm': 10.606035232543945, 'learning_rate': 1.0346929082869641e-07, 'rewards/chosen': -0.9259578585624695, 'rewards/rejected': -1.4618239402770996, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.5358661413192749, 'logps/chosen': -226.19503784179688, 'logps/rejected': -293.2069396972656, 'logps/ref_chosen': -98.19436645507812, 'logps/ref_rejected': -90.68746185302734, 'logits/chosen': -7.672981262207031, 'logits/rejected': -7.160597801208496, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.007273954804986715, 'kl/avg_steps': 0.6875, 'epoch': 0.73} 73%|██████████████████████████████████████████████████████████████████████████████████▋ | 498/681 [36:15<09:33, 3.13s/it] 73%|██████████████████████████████████████████████████████████████████████████████████▊ | 499/681 [36:18<09:23, 3.10s/it] {'loss': 1.0261, 'grad_norm': 13.12460708618164, 'learning_rate': 1.0243144327477013e-07, 'rewards/chosen': -0.9939165711402893, 'rewards/rejected': -1.5242608785629272, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.5303443074226379, 'logps/chosen': -219.22235107421875, 'logps/rejected': -319.46099853515625, 'logps/ref_chosen': -81.0399169921875, 'logps/ref_rejected': -106.92170715332031, 'logits/chosen': -7.342049598693848, 'logits/rejected': -7.139578342437744, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.007224287837743759, 'kl/avg_steps': 0.625, 'epoch': 0.73} 73%|██████████████████████████████████████████████████████████████████████████████████▊ | 499/681 [36:18<09:23, 3.10s/it] 73%|██████████████████████████████████████████████████████████████████████████████████▉ | 500/681 [36:21<09:15, 3.07s/it] {'loss': 1.0563, 'grad_norm': 11.147185325622559, 'learning_rate': 1.0139748428955333e-07, 'rewards/chosen': -1.0610756874084473, 'rewards/rejected': -1.5483863353729248, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.48731058835983276, 'logps/chosen': -237.38864135742188, 'logps/rejected': -317.2994079589844, 'logps/ref_chosen': -89.248046875, 'logps/ref_rejected': -100.41021728515625, 'logits/chosen': -7.811801433563232, 'logits/rejected': -7.598600387573242, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.007179416250437498, 'kl/avg_steps': 0.4375, 'epoch': 0.73} 73%|██████████████████████████████████████████████████████████████████████████████████▉ | 500/681 [36:21<09:15, 3.07s/it][INFO|trainer.py:4307] 2026-04-24 04:52:28,739 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-24 04:52:28,739 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-24 04:52:28,739 >> Batch size = 8 0%| | 0/73 [00:00> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-24 04:58:23,933 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-24 04:58:23,933 >> Batch size = 8 0%| | 0/73 [00:00> Saving model checkpoint to /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-600 [INFO|configuration_utils.py:419] 2026-04-24 04:59:26,775 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-600/config.json [INFO|configuration_utils.py:911] 2026-04-24 04:59:26,778 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-600/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-24 05:00:06,997 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-600/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-24 05:00:07,002 >> tokenizer config file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-600/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-24 05:00:07,005 >> Special tokens file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-600/special_tokens_map.json [INFO|trainer.py:4083] 2026-04-24 05:03:05,410 >> Deleting older checkpoint [/scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-200] due to args.save_total_limit 88%|█████████████████████████████████████████████████████████████████████████████████████████████████▉ | 601/681 [47:03<1:57:28, 88.10s/it] {'loss': 1.1196, 'grad_norm': 6.323831081390381, 'learning_rate': 2.1301532877994742e-08, 'rewards/chosen': -0.6223502159118652, 'rewards/rejected': -0.9426168203353882, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.32026663422584534, 'logps/chosen': -244.5936279296875, 'logps/rejected': -336.119873046875, 'logps/ref_chosen': -89.21614074707031, 'logps/ref_rejected': -100.17054748535156, 'logits/chosen': -7.606431007385254, 'logits/rejected': -7.300946235656738, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.004020956344902515, 'kl/avg_steps': 0.59375, 'epoch': 0.88} 88%|█████████████████████████████████████████████████████████████████████████████████████████████████▉ | 601/681 [47:03<1:57:28, 88.10s/it] 88%|██████████████████████████████████████████████████████████████████████████████████████████████████ | 602/681 [47:06<1:22:30, 62.66s/it] {'loss': 1.0711, 'grad_norm': 7.207840919494629, 'learning_rate': 2.0786184285784298e-08, 'rewards/chosen': -0.48817095160484314, 'rewards/rejected': -0.8820826411247253, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.3939117193222046, 'logps/chosen': -202.654296875, 'logps/rejected': -315.4969482421875, 'logps/ref_chosen': -80.05760192871094, 'logps/ref_rejected': -93.197509765625, 'logits/chosen': -7.25759744644165, 'logits/rejected': -7.028449058532715, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.003997222986072302, 'kl/avg_steps': 0.65625, 'epoch': 0.88} 88%|██████████████████████████████████████████████████████████████████████████████████████████████████ | 602/681 [47:06<1:22:30, 62.66s/it] 89%|████████████████████████████████████████████████████████████████████████████████████████████████████ | 603/681 [47:09<58:13, 44.78s/it] {'loss': 1.1139, 'grad_norm': 6.062160968780518, 'learning_rate': 2.0276875690788204e-08, 'rewards/chosen': -0.526463508605957, 'rewards/rejected': -0.8603225946426392, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.33385905623435974, 'logps/chosen': -235.39010620117188, 'logps/rejected': -326.241455078125, 'logps/ref_chosen': -102.30957794189453, 'logps/ref_rejected': -108.06884765625, 'logits/chosen': -7.96243953704834, 'logits/rejected': -7.082343101501465, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.003971162252128124, 'kl/avg_steps': 0.5625, 'epoch': 0.89} 89%|████████████████████████████████████████████████████████████████████████████████████████████████████ | 603/681 [47:09<58:13, 44.78s/it] 89%|████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 604/681 [47:12<41:26, 32.29s/it] {'loss': 1.1058, 'grad_norm': 6.03358268737793, 'learning_rate': 1.977362051376158e-08, 'rewards/chosen': -0.5420645475387573, 'rewards/rejected': -0.8781407475471497, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.33607620000839233, 'logps/chosen': -216.16348266601562, 'logps/rejected': -323.61181640625, 'logps/ref_chosen': -78.17408752441406, 'logps/ref_rejected': -99.4961166381836, 'logits/chosen': -7.427624702453613, 'logits/rejected': -7.069514274597168, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.0039489492774009705, 'kl/avg_steps': 0.6875, 'epoch': 0.89} 89%|████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 604/681 [47:12<41:26, 32.29s/it] 89%|████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 605/681 [47:16<29:49, 23.55s/it] {'loss': 1.1539, 'grad_norm': 6.287881374359131, 'learning_rate': 1.9276432015946446e-08, 'rewards/chosen': -0.5785641074180603, 'rewards/rejected': -0.8597608208656311, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.2811967134475708, 'logps/chosen': -242.73826599121094, 'logps/rejected': -327.88031005859375, 'logps/ref_chosen': -94.77333068847656, 'logps/ref_rejected': -107.30490112304688, 'logits/chosen': -7.781911373138428, 'logits/rejected': -7.073343276977539, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.003921985626220703, 'kl/avg_steps': 0.46875, 'epoch': 0.89} 89%|████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 605/681 [47:16<29:49, 23.55s/it] 89%|████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 606/681 [47:18<21:39, 17.33s/it] {'loss': 1.1412, 'grad_norm': 6.576826572418213, 'learning_rate': 1.8785323298722093e-08, 'rewards/chosen': -0.5615659952163696, 'rewards/rejected': -0.8598177433013916, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.298251748085022, 'logps/chosen': -232.45693969726562, 'logps/rejected': -326.3280944824219, 'logps/ref_chosen': -87.7533950805664, 'logps/ref_rejected': -104.2422103881836, 'logits/chosen': -7.757646083831787, 'logits/rejected': -6.858328342437744, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.003903687233105302, 'kl/avg_steps': 0.65625, 'epoch': 0.89} 89%|████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 606/681 [47:18<21:39, 17.33s/it] 89%|████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 607/681 [47:21<16:06, 13.07s/it] {'loss': 1.1314, 'grad_norm': 6.907278060913086, 'learning_rate': 1.8300307303259904e-08, 'rewards/chosen': -0.5482698082923889, 'rewards/rejected': -0.8778215646743774, 'rewards/accuracies': 0.75, 'rewards/margins': 0.3295517861843109, 'logps/chosen': -229.9647216796875, 'logps/rejected': -314.5696716308594, 'logps/ref_chosen': -88.32904815673828, 'logps/ref_rejected': -86.76811218261719, 'logits/chosen': -7.828334808349609, 'logits/rejected': -6.711248397827148, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.0038782362826168537, 'kl/avg_steps': 0.5, 'epoch': 0.89} 89%|████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 607/681 [47:22<16:06, 13.07s/it] 89%|████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 608/681 [47:25<12:14, 10.06s/it] {'loss': 1.1084, 'grad_norm': 7.614426612854004, 'learning_rate': 1.7821396810182437e-08, 'rewards/chosen': -0.4974094033241272, 'rewards/rejected': -0.8393654823303223, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.34195607900619507, 'logps/chosen': -215.21920776367188, 'logps/rejected': -319.41949462890625, 'logps/ref_chosen': -85.76937103271484, 'logps/ref_rejected': -100.23281860351562, 'logits/chosen': -7.646144866943359, 'logits/rejected': -7.090588569641113, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.0038589416071772575, 'kl/avg_steps': 0.65625, 'epoch': 0.89} 89%|████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 608/681 [47:25<12:14, 10.06s/it] 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████ | 609/681 [47:27<09:27, 7.88s/it] {'loss': 1.1163, 'grad_norm': 6.746026992797852, 'learning_rate': 1.7348604439226617e-08, 'rewards/chosen': -0.531781792640686, 'rewards/rejected': -0.8633979558944702, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.3316161632537842, 'logps/chosen': -232.28611755371094, 'logps/rejected': -322.77618408203125, 'logps/ref_chosen': -92.96656799316406, 'logps/ref_rejected': -95.91818237304688, 'logits/chosen': -7.802424907684326, 'logits/rejected': -7.451755523681641, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.0038337823934853077, 'kl/avg_steps': 0.625, 'epoch': 0.89} 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████ | 609/681 [47:27<09:27, 7.88s/it] 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 610/681 [47:30<07:37, 6.44s/it] {'loss': 1.1011, 'grad_norm': 6.536015510559082, 'learning_rate': 1.6881942648911074e-08, 'rewards/chosen': -0.5073626637458801, 'rewards/rejected': -0.8506356477737427, 'rewards/accuracies': 0.875, 'rewards/margins': 0.3432729244232178, 'logps/chosen': -228.5191650390625, 'logps/rejected': -314.64813232421875, 'logps/ref_chosen': -94.70028686523438, 'logps/ref_rejected': -89.68739318847656, 'logits/chosen': -7.566076278686523, 'logits/rejected': -7.088051795959473, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.0038099701050668955, 'kl/avg_steps': 0.65625, 'epoch': 0.9} 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 610/681 [47:30<07:37, 6.44s/it] 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 611/681 [47:33<06:14, 5.35s/it] {'loss': 1.1105, 'grad_norm': 7.131064414978027, 'learning_rate': 1.6421423736208e-08, 'rewards/chosen': -0.52030348777771, 'rewards/rejected': -0.8540895581245422, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.3337860703468323, 'logps/chosen': -224.7576446533203, 'logps/rejected': -316.98431396484375, 'logps/ref_chosen': -86.78334045410156, 'logps/ref_rejected': -89.84307861328125, 'logits/chosen': -7.605221748352051, 'logits/rejected': -7.5947184562683105, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.003785130102187395, 'kl/avg_steps': 0.5625, 'epoch': 0.9} 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 611/681 [47:33<06:14, 5.35s/it] 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 612/681 [47:36<05:16, 4.58s/it] {'loss': 1.1023, 'grad_norm': 6.337128639221191, 'learning_rate': 1.5967059836219042e-08, 'rewards/chosen': -0.5628792643547058, 'rewards/rejected': -0.9013561010360718, 'rewards/accuracies': 0.90625, 'rewards/margins': 0.33847683668136597, 'logps/chosen': -251.38430786132812, 'logps/rejected': -335.0631103515625, 'logps/ref_chosen': -101.02015686035156, 'logps/ref_rejected': -93.78302764892578, 'logits/chosen': -7.775234699249268, 'logits/rejected': -6.849664688110352, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.003763957880437374, 'kl/avg_steps': 0.6875, 'epoch': 0.9} 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 612/681 [47:36<05:16, 4.58s/it] 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 613/681 [47:39<04:39, 4.12s/it] {'loss': 1.0762, 'grad_norm': 6.324493885040283, 'learning_rate': 1.551886292185553e-08, 'rewards/chosen': -0.46113917231559753, 'rewards/rejected': -0.8275177478790283, 'rewards/accuracies': 0.921875, 'rewards/margins': 0.3663785457611084, 'logps/chosen': -213.39401245117188, 'logps/rejected': -333.4338073730469, 'logps/ref_chosen': -88.9886245727539, 'logps/ref_rejected': -109.99551391601562, 'logits/chosen': -8.078614234924316, 'logits/rejected': -7.592404365539551, 'kl/p_epsilon_steps': 0.953125, 'kl/n_epsilon_steps': 0.046875, 'kl/beta': 0.003738257335498929, 'kl/avg_steps': 0.90625, 'epoch': 0.9} 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 613/681 [47:39<04:39, 4.12s/it] 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 614/681 [47:42<04:12, 3.77s/it] {'loss': 1.1461, 'grad_norm': 7.8051652908325195, 'learning_rate': 1.507684480352292e-08, 'rewards/chosen': -0.5560883283615112, 'rewards/rejected': -0.8495379686355591, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.29344964027404785, 'logps/chosen': -230.80239868164062, 'logps/rejected': -340.70416259765625, 'logps/ref_chosen': -80.20005798339844, 'logps/ref_rejected': -109.86239624023438, 'logits/chosen': -7.316326141357422, 'logits/rejected': -7.080374240875244, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.0037046836223453283, 'kl/avg_steps': 0.5625, 'epoch': 0.9} 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 614/681 [47:42<04:12, 3.77s/it] 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████ | 615/681 [47:45<03:56, 3.58s/it] {'loss': 1.1669, 'grad_norm': 6.5305585861206055, 'learning_rate': 1.4641017128809801e-08, 'rewards/chosen': -0.525856077671051, 'rewards/rejected': -0.7832653522491455, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.2574092149734497, 'logps/chosen': -243.8656463623047, 'logps/rejected': -315.3383483886719, 'logps/ref_chosen': -100.43526458740234, 'logps/ref_rejected': -101.1800537109375, 'logits/chosen': -7.641972541809082, 'logits/rejected': -6.816801071166992, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.0036839614622294903, 'kl/avg_steps': 0.59375, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████ | 615/681 [47:45<03:56, 3.58s/it] 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 616/681 [47:48<03:45, 3.47s/it] {'loss': 1.1124, 'grad_norm': 5.881459712982178, 'learning_rate': 1.4211391382180637e-08, 'rewards/chosen': -0.5267339944839478, 'rewards/rejected': -0.8600709438323975, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.3333369195461273, 'logps/chosen': -236.98391723632812, 'logps/rejected': -318.6792297363281, 'logps/ref_chosen': -92.49292755126953, 'logps/ref_rejected': -82.06065368652344, 'logits/chosen': -7.507961273193359, 'logits/rejected': -7.136435508728027, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.003662216942757368, 'kl/avg_steps': 0.625, 'epoch': 0.9} 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 616/681 [47:48<03:45, 3.47s/it] 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 617/681 [47:52<03:38, 3.42s/it] {'loss': 1.1664, 'grad_norm': 6.169250011444092, 'learning_rate': 1.378797888467345e-08, 'rewards/chosen': -0.5471093654632568, 'rewards/rejected': -0.8117498755455017, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.26464053988456726, 'logps/chosen': -241.96456909179688, 'logps/rejected': -294.9502868652344, 'logps/ref_chosen': -91.09699249267578, 'logps/ref_rejected': -70.41004943847656, 'logits/chosen': -7.898214817047119, 'logits/rejected': -7.082752227783203, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.0036394703201949596, 'kl/avg_steps': 0.59375, 'epoch': 0.91} 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 617/681 [47:52<03:38, 3.42s/it] 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 618/681 [47:55<03:33, 3.39s/it] {'loss': 1.1462, 'grad_norm': 7.049670696258545, 'learning_rate': 1.3370790793601371e-08, 'rewards/chosen': -0.5415487289428711, 'rewards/rejected': -0.8314208388328552, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.2898721694946289, 'logps/chosen': -252.33189392089844, 'logps/rejected': -331.26397705078125, 'logps/ref_chosen': -102.02059936523438, 'logps/ref_rejected': -99.80119323730469, 'logits/chosen': -7.613478660583496, 'logits/rejected': -7.167911052703857, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.003617988433688879, 'kl/avg_steps': 0.5625, 'epoch': 0.91} 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 618/681 [47:55<03:33, 3.39s/it] 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 619/681 [47:58<03:24, 3.30s/it] {'loss': 1.1708, 'grad_norm': 6.142784118652344, 'learning_rate': 1.2959838102258535e-08, 'rewards/chosen': -0.5201643705368042, 'rewards/rejected': -0.7863855361938477, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.26622116565704346, 'logps/chosen': -234.5894775390625, 'logps/rejected': -319.714111328125, 'logps/ref_chosen': -89.74136352539062, 'logps/ref_rejected': -99.90138244628906, 'logits/chosen': -7.600045204162598, 'logits/rejected': -7.366483688354492, 'kl/p_epsilon_steps': 0.6875, 'kl/n_epsilon_steps': 0.3125, 'kl/beta': 0.003597751259803772, 'kl/avg_steps': 0.375, 'epoch': 0.91} 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 619/681 [47:58<03:24, 3.30s/it] 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 620/681 [48:01<03:15, 3.20s/it] {'loss': 1.1526, 'grad_norm': 5.838797092437744, 'learning_rate': 1.2555131639630567e-08, 'rewards/chosen': -0.5036029815673828, 'rewards/rejected': -0.7929166555404663, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.28931373357772827, 'logps/chosen': -225.7647247314453, 'logps/rejected': -307.8377380371094, 'logps/ref_chosen': -85.12431335449219, 'logps/ref_rejected': -85.41253662109375, 'logits/chosen': -8.057722091674805, 'logits/rejected': -7.399764060974121, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.003584309946745634, 'kl/avg_steps': 0.40625, 'epoch': 0.91} 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 620/681 [48:01<03:15, 3.20s/it] 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████ | 621/681 [48:04<03:10, 3.17s/it] {'loss': 1.1604, 'grad_norm': 5.572709083557129, 'learning_rate': 1.2156682070109086e-08, 'rewards/chosen': -0.49918726086616516, 'rewards/rejected': -0.7818341851234436, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.28264692425727844, 'logps/chosen': -229.43209838867188, 'logps/rejected': -315.90655517578125, 'logps/ref_chosen': -89.24842071533203, 'logps/ref_rejected': -95.46463775634766, 'logits/chosen': -7.758251190185547, 'logits/rejected': -7.50982666015625, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.0035698076244443655, 'kl/avg_steps': 0.46875, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████ | 621/681 [48:04<03:10, 3.17s/it] 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 622/681 [48:07<03:04, 3.13s/it] {'loss': 1.1355, 'grad_norm': 6.386410713195801, 'learning_rate': 1.1764499893210878e-08, 'rewards/chosen': -0.5042383670806885, 'rewards/rejected': -0.8008482456207275, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.29660987854003906, 'logps/chosen': -242.18194580078125, 'logps/rejected': -317.67083740234375, 'logps/ref_chosen': -99.79413604736328, 'logps/ref_rejected': -90.82821655273438, 'logits/chosen': -7.883831977844238, 'logits/rejected': -6.893963813781738, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.0035531523171812296, 'kl/avg_steps': 0.5625, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 622/681 [48:07<03:04, 3.13s/it] 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 623/681 [48:10<02:51, 2.96s/it] {'loss': 1.2219, 'grad_norm': 7.438137531280518, 'learning_rate': 1.1378595443300998e-08, 'rewards/chosen': -0.5164909958839417, 'rewards/rejected': -0.7255691289901733, 'rewards/accuracies': 0.703125, 'rewards/margins': 0.20907816290855408, 'logps/chosen': -236.72573852539062, 'logps/rejected': -297.7119445800781, 'logps/ref_chosen': -90.45555114746094, 'logps/ref_rejected': -91.32276916503906, 'logits/chosen': -7.771790504455566, 'logits/rejected': -7.093353748321533, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.0035332776606082916, 'kl/avg_steps': 0.40625, 'epoch': 0.91} 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 623/681 [48:10<02:51, 2.96s/it] 92%|███████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 624/681 [48:13<02:50, 2.99s/it] {'loss': 1.1213, 'grad_norm': 6.896797180175781, 'learning_rate': 1.0998978889320582e-08, 'rewards/chosen': -0.5218902826309204, 'rewards/rejected': -0.8390946388244629, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.3172043263912201, 'logps/chosen': -258.7445373535156, 'logps/rejected': -344.84759521484375, 'logps/ref_chosen': -109.87522888183594, 'logps/ref_rejected': -104.77320861816406, 'logits/chosen': -8.146064758300781, 'logits/rejected': -7.289480209350586, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.0035189816262573004, 'kl/avg_steps': 0.5625, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 624/681 [48:13<02:50, 2.99s/it] 92%|███████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 625/681 [48:16<02:52, 3.08s/it] {'loss': 1.1839, 'grad_norm': 5.683692455291748, 'learning_rate': 1.0625660234518913e-08, 'rewards/chosen': -0.5464307069778442, 'rewards/rejected': -0.786613941192627, 'rewards/accuracies': 0.75, 'rewards/margins': 0.24018320441246033, 'logps/chosen': -243.88046264648438, 'logps/rejected': -318.0606994628906, 'logps/ref_chosen': -87.16815948486328, 'logps/ref_rejected': -91.86148071289062, 'logits/chosen': -7.770914554595947, 'logits/rejected': -7.289626121520996, 'kl/p_epsilon_steps': 0.75, 'kl/n_epsilon_steps': 0.25, 'kl/beta': 0.0034992981236428022, 'kl/avg_steps': 0.5, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 625/681 [48:16<02:52, 3.08s/it] 92%|███████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 626/681 [48:19<02:54, 3.17s/it] {'loss': 1.2005, 'grad_norm': 7.084156513214111, 'learning_rate': 1.0258649316189721e-08, 'rewards/chosen': -0.5199881792068481, 'rewards/rejected': -0.7504914402961731, 'rewards/accuracies': 0.75, 'rewards/margins': 0.23050320148468018, 'logps/chosen': -254.01075744628906, 'logps/rejected': -321.1068115234375, 'logps/ref_chosen': -104.22421264648438, 'logps/ref_rejected': -104.1774673461914, 'logits/chosen': -7.743939399719238, 'logits/rejected': -7.482439994812012, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.0034818886779248714, 'kl/avg_steps': 0.40625, 'epoch': 0.92} 92%|███████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 626/681 [48:19<02:54, 3.17s/it] 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████ | 627/681 [48:23<02:49, 3.15s/it] {'loss': 1.1507, 'grad_norm': 5.33841609954834, 'learning_rate': 9.897955805412e-09, 'rewards/chosen': -0.46203625202178955, 'rewards/rejected': -0.7497316002845764, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.2876952886581421, 'logps/chosen': -208.89431762695312, 'logps/rejected': -330.51983642578125, 'logps/ref_chosen': -74.93461608886719, 'logps/ref_rejected': -112.57289123535156, 'logits/chosen': -7.591939926147461, 'logits/rejected': -7.3537187576293945, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.0034678007941693068, 'kl/avg_steps': 0.625, 'epoch': 0.92} 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████ | 627/681 [48:23<02:49, 3.15s/it] 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 628/681 [48:26<02:45, 3.12s/it] {'loss': 1.1251, 'grad_norm': 6.301426410675049, 'learning_rate': 9.543589206795238e-09, 'rewards/chosen': -0.4717589020729065, 'rewards/rejected': -0.7812816500663757, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.30952274799346924, 'logps/chosen': -231.39364624023438, 'logps/rejected': -335.9172668457031, 'logps/ref_chosen': -93.69107818603516, 'logps/ref_rejected': -107.34395599365234, 'logits/chosen': -7.962158679962158, 'logits/rejected': -7.421542644500732, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.0034462616313248873, 'kl/avg_steps': 0.6875, 'epoch': 0.92} 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 628/681 [48:26<02:45, 3.12s/it] 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 629/681 [48:29<02:41, 3.10s/it] {'loss': 1.1605, 'grad_norm': 5.8255295753479, 'learning_rate': 9.19555885822887e-09, 'rewards/chosen': -0.5139710903167725, 'rewards/rejected': -0.7815043926239014, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.2675333023071289, 'logps/chosen': -253.8419952392578, 'logps/rejected': -326.94293212890625, 'logps/ref_chosen': -103.23037719726562, 'logps/ref_rejected': -97.16841888427734, 'logits/chosen': -7.680126190185547, 'logits/rejected': -7.379844665527344, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.0034227303694933653, 'kl/avg_steps': 0.53125, 'epoch': 0.92} 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 629/681 [48:29<02:41, 3.10s/it] 93%|████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 630/681 [48:32<02:38, 3.10s/it] {'loss': 1.2339, 'grad_norm': 7.221403121948242, 'learning_rate': 8.85387393063622e-09, 'rewards/chosen': -0.4714937210083008, 'rewards/rejected': -0.6566429138183594, 'rewards/accuracies': 0.75, 'rewards/margins': 0.1851492077112198, 'logps/chosen': -232.68309020996094, 'logps/rejected': -283.4253234863281, 'logps/ref_chosen': -93.89755249023438, 'logps/ref_rejected': -89.3743896484375, 'logits/chosen': -7.831258773803711, 'logits/rejected': -7.15199089050293, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.003404643153771758, 'kl/avg_steps': 0.46875, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 630/681 [48:32<02:38, 3.10s/it] 93%|████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 631/681 [48:35<02:32, 3.05s/it] {'loss': 1.1612, 'grad_norm': 5.580920219421387, 'learning_rate': 8.518543427732949e-09, 'rewards/chosen': -0.4492769241333008, 'rewards/rejected': -0.7134263515472412, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.2641494572162628, 'logps/chosen': -220.7677764892578, 'logps/rejected': -300.57403564453125, 'logps/ref_chosen': -87.77082061767578, 'logps/ref_rejected': -88.68241882324219, 'logits/chosen': -7.815197944641113, 'logits/rejected': -7.3377485275268555, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.003388758283108473, 'kl/avg_steps': 0.53125, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 631/681 [48:35<02:32, 3.05s/it] 93%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 632/681 [48:38<02:27, 3.00s/it] {'loss': 1.1516, 'grad_norm': 6.467636585235596, 'learning_rate': 8.189576185789637e-09, 'rewards/chosen': -0.44607216119766235, 'rewards/rejected': -0.7322953343391418, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.2862231731414795, 'logps/chosen': -221.4307861328125, 'logps/rejected': -310.26220703125, 'logps/ref_chosen': -88.62652587890625, 'logps/ref_rejected': -91.45091247558594, 'logits/chosen': -7.605412483215332, 'logits/rejected': -7.2681169509887695, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.0033708508126437664, 'kl/avg_steps': 0.53125, 'epoch': 0.93} 93%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 632/681 [48:38<02:27, 3.00s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████ | 633/681 [48:41<02:27, 3.07s/it] {'loss': 1.2589, 'grad_norm': 6.414995193481445, 'learning_rate': 7.866980873399015e-09, 'rewards/chosen': -0.5550001859664917, 'rewards/rejected': -0.7260842323303223, 'rewards/accuracies': 0.671875, 'rewards/margins': 0.17108407616615295, 'logps/chosen': -247.08786010742188, 'logps/rejected': -316.35614013671875, 'logps/ref_chosen': -81.37442016601562, 'logps/ref_rejected': -98.62571716308594, 'logits/chosen': -7.696202278137207, 'logits/rejected': -7.264307498931885, 'kl/p_epsilon_steps': 0.640625, 'kl/n_epsilon_steps': 0.359375, 'kl/beta': 0.00335303763858974, 'kl/avg_steps': 0.28125, 'epoch': 0.93} 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████ | 633/681 [48:41<02:27, 3.07s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 634/681 [48:44<02:26, 3.12s/it] {'loss': 1.1935, 'grad_norm': 6.248257637023926, 'learning_rate': 7.550765991247654e-09, 'rewards/chosen': -0.5128281712532043, 'rewards/rejected': -0.7364822030067444, 'rewards/accuracies': 0.75, 'rewards/margins': 0.22365406155586243, 'logps/chosen': -250.07144165039062, 'logps/rejected': -334.539306640625, 'logps/ref_chosen': -96.12284851074219, 'logps/ref_rejected': -112.84780883789062, 'logits/chosen': -7.881155490875244, 'logits/rejected': -7.148515701293945, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.0033436338417232037, 'kl/avg_steps': 0.53125, 'epoch': 0.93} 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 634/681 [48:44<02:26, 3.12s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 635/681 [48:47<02:20, 3.06s/it] {'loss': 1.2457, 'grad_norm': 6.869739055633545, 'learning_rate': 7.240939871891699e-09, 'rewards/chosen': -0.5283612012863159, 'rewards/rejected': -0.7009052634239197, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.17254400253295898, 'logps/chosen': -258.06280517578125, 'logps/rejected': -301.9739990234375, 'logps/ref_chosen': -98.68411254882812, 'logps/ref_rejected': -89.8991928100586, 'logits/chosen': -7.332803249359131, 'logits/rejected': -6.858902931213379, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.003325964557006955, 'kl/avg_steps': 0.46875, 'epoch': 0.93} 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 635/681 [48:47<02:20, 3.06s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 636/681 [48:50<02:19, 3.10s/it] {'loss': 1.1499, 'grad_norm': 5.351519584655762, 'learning_rate': 6.937510679537628e-09, 'rewards/chosen': -0.4755138158798218, 'rewards/rejected': -0.7524248361587524, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.27691105008125305, 'logps/chosen': -234.77163696289062, 'logps/rejected': -316.6680603027344, 'logps/ref_chosen': -90.41796112060547, 'logps/ref_rejected': -87.70687866210938, 'logits/chosen': -7.650508880615234, 'logits/rejected': -6.78857946395874, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.003310446860268712, 'kl/avg_steps': 0.625, 'epoch': 0.93} 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 636/681 [48:50<02:19, 3.10s/it] 94%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 637/681 [48:53<02:15, 3.09s/it] {'loss': 1.1395, 'grad_norm': 5.5197577476501465, 'learning_rate': 6.640486409826785e-09, 'rewards/chosen': -0.4761905372142792, 'rewards/rejected': -0.7680450677871704, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.29185453057289124, 'logps/chosen': -227.92446899414062, 'logps/rejected': -339.26043701171875, 'logps/ref_chosen': -82.44971466064453, 'logps/ref_rejected': -104.02860260009766, 'logits/chosen': -7.380791664123535, 'logits/rejected': -7.242560386657715, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.0032898851204663515, 'kl/avg_steps': 0.59375, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 637/681 [48:53<02:15, 3.09s/it] 94%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 638/681 [48:56<02:11, 3.06s/it] {'loss': 1.1318, 'grad_norm': 5.237280368804932, 'learning_rate': 6.349874889624962e-09, 'rewards/chosen': -0.43678027391433716, 'rewards/rejected': -0.7282345294952393, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.2914542555809021, 'logps/chosen': -226.06134033203125, 'logps/rejected': -310.4959716796875, 'logps/ref_chosen': -91.92498779296875, 'logps/ref_rejected': -86.28703308105469, 'logits/chosen': -7.648863792419434, 'logits/rejected': -6.924067497253418, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.0032704665791243315, 'kl/avg_steps': 0.59375, 'epoch': 0.94} 94%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 638/681 [48:56<02:11, 3.06s/it] 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████ | 639/681 [48:59<02:09, 3.09s/it] {'loss': 1.2695, 'grad_norm': 5.765613079071045, 'learning_rate': 6.065683776815933e-09, 'rewards/chosen': -0.5761454105377197, 'rewards/rejected': -0.7252421975135803, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.1490968018770218, 'logps/chosen': -281.9426574707031, 'logps/rejected': -305.64599609375, 'logps/ref_chosen': -104.52755737304688, 'logps/ref_rejected': -81.4803466796875, 'logits/chosen': -7.779665470123291, 'logits/rejected': -6.977321624755859, 'kl/p_epsilon_steps': 0.671875, 'kl/n_epsilon_steps': 0.328125, 'kl/beta': 0.003251162823289633, 'kl/avg_steps': 0.34375, 'epoch': 0.94} 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████ | 639/681 [48:59<02:09, 3.09s/it] 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 640/681 [49:02<02:06, 3.09s/it] {'loss': 1.1573, 'grad_norm': 5.791350364685059, 'learning_rate': 5.7879205600998296e-09, 'rewards/chosen': -0.47469913959503174, 'rewards/rejected': -0.7432272434234619, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.26852816343307495, 'logps/chosen': -245.22933959960938, 'logps/rejected': -343.93792724609375, 'logps/ref_chosen': -97.88526916503906, 'logps/ref_rejected': -112.70501708984375, 'logits/chosen': -7.959763526916504, 'logits/rejected': -7.383049964904785, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.00324002536945045, 'kl/avg_steps': 0.71875, 'epoch': 0.94} 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 640/681 [49:02<02:06, 3.09s/it] 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 641/681 [49:06<02:05, 3.14s/it] {'loss': 1.1614, 'grad_norm': 5.238857269287109, 'learning_rate': 5.516592558795746e-09, 'rewards/chosen': -0.4384271800518036, 'rewards/rejected': -0.7108187079429626, 'rewards/accuracies': 0.71875, 'rewards/margins': 0.27239149808883667, 'logps/chosen': -232.91995239257812, 'logps/rejected': -317.35809326171875, 'logps/ref_chosen': -96.4456787109375, 'logps/ref_rejected': -95.13568878173828, 'logits/chosen': -7.708554267883301, 'logits/rejected': -7.541186332702637, 'kl/p_epsilon_steps': 0.65625, 'kl/n_epsilon_steps': 0.34375, 'kl/beta': 0.0032169038895517588, 'kl/avg_steps': 0.3125, 'epoch': 0.94} 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 641/681 [49:06<02:05, 3.14s/it] 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 642/681 [49:09<02:01, 3.12s/it] {'loss': 1.1825, 'grad_norm': 5.874085903167725, 'learning_rate': 5.251706922648868e-09, 'rewards/chosen': -0.497159868478775, 'rewards/rejected': -0.7433220148086548, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.24616217613220215, 'logps/chosen': -256.21673583984375, 'logps/rejected': -347.91558837890625, 'logps/ref_chosen': -100.75984954833984, 'logps/ref_rejected': -114.70763397216797, 'logits/chosen': -7.545126914978027, 'logits/rejected': -7.102571487426758, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.003206882392987609, 'kl/avg_steps': 0.53125, 'epoch': 0.94} 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 642/681 [49:09<02:01, 3.12s/it] 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 643/681 [49:12<01:59, 3.14s/it] {'loss': 1.1822, 'grad_norm': 6.670008659362793, 'learning_rate': 4.993270631642038e-09, 'rewards/chosen': -0.4609188437461853, 'rewards/rejected': -0.7005389928817749, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.2396201640367508, 'logps/chosen': -229.69638061523438, 'logps/rejected': -315.35858154296875, 'logps/ref_chosen': -84.74365997314453, 'logps/ref_rejected': -94.31842041015625, 'logits/chosen': -7.875416278839111, 'logits/rejected': -7.243396759033203, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.003189935814589262, 'kl/avg_steps': 0.5625, 'epoch': 0.94} 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 643/681 [49:12<01:59, 3.14s/it] 95%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 644/681 [49:15<01:56, 3.14s/it] {'loss': 1.2111, 'grad_norm': 6.133624076843262, 'learning_rate': 4.741290495811873e-09, 'rewards/chosen': -0.45997491478919983, 'rewards/rejected': -0.6619677543640137, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.20199283957481384, 'logps/chosen': -230.85415649414062, 'logps/rejected': -304.661865234375, 'logps/ref_chosen': -85.32275390625, 'logps/ref_rejected': -94.60861206054688, 'logits/chosen': -7.913561820983887, 'logits/rejected': -7.103449821472168, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.0031720928382128477, 'kl/avg_steps': 0.5625, 'epoch': 0.95} 95%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 644/681 [49:15<01:56, 3.14s/it] 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████ | 645/681 [49:18<01:52, 3.12s/it] {'loss': 1.2771, 'grad_norm': 5.844942092895508, 'learning_rate': 4.495773155069299e-09, 'rewards/chosen': -0.5023009777069092, 'rewards/rejected': -0.6405566930770874, 'rewards/accuracies': 0.625, 'rewards/margins': 0.13825571537017822, 'logps/chosen': -241.77206420898438, 'logps/rejected': -306.9005432128906, 'logps/ref_chosen': -82.59024047851562, 'logps/ref_rejected': -103.00375366210938, 'logits/chosen': -7.571622371673584, 'logits/rejected': -7.209532737731934, 'kl/p_epsilon_steps': 0.59375, 'kl/n_epsilon_steps': 0.40625, 'kl/beta': 0.0031543495133519173, 'kl/avg_steps': 0.1875, 'epoch': 0.95} 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████ | 645/681 [49:18<01:52, 3.12s/it] 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 646/681 [49:21<01:46, 3.05s/it] {'loss': 1.1663, 'grad_norm': 5.194578647613525, 'learning_rate': 4.256725079024553e-09, 'rewards/chosen': -0.43383896350860596, 'rewards/rejected': -0.6858773231506348, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.2520383596420288, 'logps/chosen': -229.54640197753906, 'logps/rejected': -303.084716796875, 'logps/ref_chosen': -91.1920394897461, 'logps/ref_rejected': -83.77833557128906, 'logits/chosen': -7.935370922088623, 'logits/rejected': -7.05921745300293, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.0031484460923820734, 'kl/avg_steps': 0.59375, 'epoch': 0.95} 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 646/681 [49:21<01:46, 3.05s/it] 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 647/681 [49:24<01:46, 3.12s/it] {'loss': 1.2089, 'grad_norm': 5.19912576675415, 'learning_rate': 4.024152566816791e-09, 'rewards/chosen': -0.47964316606521606, 'rewards/rejected': -0.694766104221344, 'rewards/accuracies': 0.703125, 'rewards/margins': 0.21512295305728912, 'logps/chosen': -242.48236083984375, 'logps/rejected': -322.8554382324219, 'logps/ref_chosen': -88.84446716308594, 'logps/ref_rejected': -99.49832916259766, 'logits/chosen': -7.209627151489258, 'logits/rejected': -7.173065185546875, 'kl/p_epsilon_steps': 0.703125, 'kl/n_epsilon_steps': 0.296875, 'kl/beta': 0.00312986271455884, 'kl/avg_steps': 0.40625, 'epoch': 0.95} 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 647/681 [49:24<01:46, 3.12s/it] 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 648/681 [49:27<01:41, 3.07s/it] {'loss': 1.127, 'grad_norm': 5.339207172393799, 'learning_rate': 3.798061746947995e-09, 'rewards/chosen': -0.4329220652580261, 'rewards/rejected': -0.7390461564064026, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.30612409114837646, 'logps/chosen': -227.42025756835938, 'logps/rejected': -343.5663146972656, 'logps/ref_chosen': -87.84810638427734, 'logps/ref_rejected': -104.67005920410156, 'logits/chosen': -7.534468173980713, 'logits/rejected': -7.395105838775635, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.003117199055850506, 'kl/avg_steps': 0.6875, 'epoch': 0.95} 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 648/681 [49:27<01:41, 3.07s/it] 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 649/681 [49:31<01:41, 3.17s/it] {'loss': 1.1676, 'grad_norm': 5.596925258636475, 'learning_rate': 3.5784585771215235e-09, 'rewards/chosen': -0.39361506700515747, 'rewards/rejected': -0.6433808207511902, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.24976572394371033, 'logps/chosen': -217.3938446044922, 'logps/rejected': -297.97711181640625, 'logps/ref_chosen': -89.6925048828125, 'logps/ref_rejected': -88.70658111572266, 'logits/chosen': -7.8632001876831055, 'logits/rejected': -7.287814140319824, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.0030959146097302437, 'kl/avg_steps': 0.5625, 'epoch': 0.95} 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 649/681 [49:31<01:41, 3.17s/it] 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 650/681 [49:34<01:35, 3.08s/it] {'loss': 1.1869, 'grad_norm': 5.370362281799316, 'learning_rate': 3.3653488440851253e-09, 'rewards/chosen': -0.463270366191864, 'rewards/rejected': -0.7010980844497681, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.23782768845558167, 'logps/chosen': -241.04432678222656, 'logps/rejected': -332.0096435546875, 'logps/ref_chosen': -89.93060302734375, 'logps/ref_rejected': -102.61282348632812, 'logits/chosen': -7.673471450805664, 'logits/rejected': -7.417351722717285, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.0030785975977778435, 'kl/avg_steps': 0.625, 'epoch': 0.95} 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 650/681 [49:34<01:35, 3.08s/it] 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 651/681 [49:37<01:32, 3.07s/it] {'loss': 1.1412, 'grad_norm': 5.453701972961426, 'learning_rate': 3.158738163478475e-09, 'rewards/chosen': -0.3932613134384155, 'rewards/rejected': -0.675163745880127, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.2819024920463562, 'logps/chosen': -208.45730590820312, 'logps/rejected': -328.37841796875, 'logps/ref_chosen': -79.18731689453125, 'logps/ref_rejected': -105.93333435058594, 'logits/chosen': -7.594956398010254, 'logits/rejected': -6.981936454772949, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.0030594756826758385, 'kl/avg_steps': 0.65625, 'epoch': 0.96} 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 651/681 [49:37<01:32, 3.07s/it] 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 652/681 [49:40<01:28, 3.06s/it] {'loss': 1.1713, 'grad_norm': 4.904377460479736, 'learning_rate': 2.9586319796851555e-09, 'rewards/chosen': -0.40263664722442627, 'rewards/rejected': -0.6452191472053528, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.24258247017860413, 'logps/chosen': -234.99612426757812, 'logps/rejected': -330.2041015625, 'logps/ref_chosen': -101.79022979736328, 'logps/ref_rejected': -116.3245849609375, 'logits/chosen': -7.993418216705322, 'logits/rejected': -7.6948065757751465, 'kl/p_epsilon_steps': 0.84375, 'kl/n_epsilon_steps': 0.15625, 'kl/beta': 0.003039528848603368, 'kl/avg_steps': 0.6875, 'epoch': 0.96} 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 652/681 [49:40<01:28, 3.06s/it] 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 653/681 [49:43<01:25, 3.04s/it] {'loss': 1.1885, 'grad_norm': 5.360039234161377, 'learning_rate': 2.7650355656892166e-09, 'rewards/chosen': -0.45035725831985474, 'rewards/rejected': -0.678409218788147, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.22805194556713104, 'logps/chosen': -243.21240234375, 'logps/rejected': -335.51324462890625, 'logps/ref_chosen': -93.35359191894531, 'logps/ref_rejected': -109.12324523925781, 'logits/chosen': -7.770810127258301, 'logits/rejected': -7.401587963104248, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.003018774790689349, 'kl/avg_steps': 0.59375, 'epoch': 0.96} 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 653/681 [49:43<01:25, 3.04s/it] 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 654/681 [49:46<01:22, 3.04s/it] {'loss': 1.2262, 'grad_norm': 7.331425666809082, 'learning_rate': 2.577954022936174e-09, 'rewards/chosen': -0.45708924531936646, 'rewards/rejected': -0.6516642570495605, 'rewards/accuracies': 0.75, 'rewards/margins': 0.1945749968290329, 'logps/chosen': -241.98892211914062, 'logps/rejected': -323.5856628417969, 'logps/ref_chosen': -89.11553955078125, 'logps/ref_rejected': -104.91995239257812, 'logits/chosen': -7.537060260772705, 'logits/rejected': -7.259842872619629, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.003000956494361162, 'kl/avg_steps': 0.53125, 'epoch': 0.96} 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 654/681 [49:46<01:22, 3.04s/it] 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 655/681 [49:49<01:19, 3.05s/it] {'loss': 1.1856, 'grad_norm': 5.144998550415039, 'learning_rate': 2.397392281198729e-09, 'rewards/chosen': -0.4156780540943146, 'rewards/rejected': -0.65021151304245, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.23453345894813538, 'logps/chosen': -220.708984375, 'logps/rejected': -322.0316162109375, 'logps/ref_chosen': -81.03610229492188, 'logps/ref_rejected': -102.80233764648438, 'logits/chosen': -7.6725006103515625, 'logits/rejected': -7.398637771606445, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.0029850981663912535, 'kl/avg_steps': 0.53125, 'epoch': 0.96} 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 655/681 [49:49<01:19, 3.05s/it] 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 656/681 [49:52<01:19, 3.18s/it] {'loss': 1.1311, 'grad_norm': 5.95790958404541, 'learning_rate': 2.223355098446622e-09, 'rewards/chosen': -0.4093048572540283, 'rewards/rejected': -0.7035213708877563, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.294216513633728, 'logps/chosen': -223.98876953125, 'logps/rejected': -357.14013671875, 'logps/ref_chosen': -85.32534790039062, 'logps/ref_rejected': -118.33866882324219, 'logits/chosen': -7.680576324462891, 'logits/rejected': -7.717093467712402, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.00296932365745306, 'kl/avg_steps': 0.71875, 'epoch': 0.96} 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 656/681 [49:52<01:19, 3.18s/it] 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 657/681 [49:55<01:13, 3.07s/it] {'loss': 1.1867, 'grad_norm': 6.30833625793457, 'learning_rate': 2.055847060721566e-09, 'rewards/chosen': -0.4302343726158142, 'rewards/rejected': -0.666432797908783, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.23619841039180756, 'logps/chosen': -226.81549072265625, 'logps/rejected': -330.3614501953125, 'logps/ref_chosen': -80.19772338867188, 'logps/ref_rejected': -102.581298828125, 'logits/chosen': -7.752594947814941, 'logits/rejected': -7.609277725219727, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.002948133973404765, 'kl/avg_steps': 0.59375, 'epoch': 0.96} 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 657/681 [49:55<01:13, 3.07s/it] 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 658/681 [49:58<01:08, 2.99s/it] {'loss': 1.2098, 'grad_norm': 5.188457012176514, 'learning_rate': 1.8948725820160662e-09, 'rewards/chosen': -0.44880080223083496, 'rewards/rejected': -0.6514099836349487, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.202609121799469, 'logps/chosen': -248.58145141601562, 'logps/rejected': -325.5663146972656, 'logps/ref_chosen': -94.634521484375, 'logps/ref_rejected': -101.63162231445312, 'logits/chosen': -7.566948890686035, 'logits/rejected': -7.049898147583008, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.002930732909590006, 'kl/avg_steps': 0.65625, 'epoch': 0.97} 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 658/681 [49:58<01:08, 2.99s/it] 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 659/681 [50:01<01:06, 3.01s/it] {'loss': 1.1774, 'grad_norm': 4.891656875610352, 'learning_rate': 1.7404359041573723e-09, 'rewards/chosen': -0.3891860246658325, 'rewards/rejected': -0.6415092945098877, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.25232329964637756, 'logps/chosen': -246.71299743652344, 'logps/rejected': -315.93914794921875, 'logps/ref_chosen': -112.55587005615234, 'logps/ref_rejected': -93.9216079711914, 'logits/chosen': -8.355998039245605, 'logits/rejected': -7.4629597663879395, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.0029116251971572638, 'kl/avg_steps': 0.5625, 'epoch': 0.97} 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 659/681 [50:01<01:06, 3.01s/it] 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 660/681 [50:04<01:02, 2.98s/it] {'loss': 1.1675, 'grad_norm': 7.11164665222168, 'learning_rate': 1.592541096695571e-09, 'rewards/chosen': -0.4076111614704132, 'rewards/rejected': -0.6608085632324219, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.25319740176200867, 'logps/chosen': -234.69586181640625, 'logps/rejected': -311.2170715332031, 'logps/ref_chosen': -93.37742614746094, 'logps/ref_rejected': -81.39482116699219, 'logits/chosen': -7.641197681427002, 'logits/rejected': -7.153841972351074, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.0028953389264643192, 'kl/avg_steps': 0.59375, 'epoch': 0.97} 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 660/681 [50:04<01:02, 2.98s/it] 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 661/681 [50:06<00:57, 2.86s/it] {'loss': 1.1572, 'grad_norm': 4.977914333343506, 'learning_rate': 1.4511920567963908e-09, 'rewards/chosen': -0.41780516505241394, 'rewards/rejected': -0.6864430904388428, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.26863789558410645, 'logps/chosen': -233.6283416748047, 'logps/rejected': -332.63751220703125, 'logps/ref_chosen': -87.85516357421875, 'logps/ref_rejected': -92.40330505371094, 'logits/chosen': -7.581234931945801, 'logits/rejected': -6.8354692459106445, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.002878249390050769, 'kl/avg_steps': 0.65625, 'epoch': 0.97} 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 661/681 [50:06<00:57, 2.86s/it] 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 662/681 [50:10<00:57, 3.02s/it] {'loss': 1.1704, 'grad_norm': 4.641286373138428, 'learning_rate': 1.3163925091384532e-09, 'rewards/chosen': -0.4002068340778351, 'rewards/rejected': -0.6452619433403015, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.24505510926246643, 'logps/chosen': -243.25115966796875, 'logps/rejected': -322.36285400390625, 'logps/ref_chosen': -102.77980041503906, 'logps/ref_rejected': -95.22531127929688, 'logits/chosen': -7.912067413330078, 'logits/rejected': -7.321425437927246, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.002859483938664198, 'kl/avg_steps': 0.59375, 'epoch': 0.97} 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 662/681 [50:10<00:57, 3.02s/it] 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 663/681 [50:13<00:56, 3.14s/it] {'loss': 1.1945, 'grad_norm': 5.062544822692871, 'learning_rate': 1.1881460058152382e-09, 'rewards/chosen': -0.39547061920166016, 'rewards/rejected': -0.6204517483711243, 'rewards/accuracies': 0.75, 'rewards/margins': 0.2249811291694641, 'logps/chosen': -235.74203491210938, 'logps/rejected': -340.11309814453125, 'logps/ref_chosen': -96.34658813476562, 'logps/ref_rejected': -120.52645111083984, 'logits/chosen': -7.826132774353027, 'logits/rejected': -7.58699369430542, 'kl/p_epsilon_steps': 0.734375, 'kl/n_epsilon_steps': 0.265625, 'kl/beta': 0.002842606045305729, 'kl/avg_steps': 0.46875, 'epoch': 0.97} 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 663/681 [50:13<00:56, 3.14s/it] 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 664/681 [50:17<00:54, 3.20s/it] {'loss': 1.2061, 'grad_norm': 5.462507724761963, 'learning_rate': 1.066455926241383e-09, 'rewards/chosen': -0.43186432123184204, 'rewards/rejected': -0.6376165151596069, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.20575222373008728, 'logps/chosen': -245.00120544433594, 'logps/rejected': -338.63543701171875, 'logps/ref_chosen': -91.84242248535156, 'logps/ref_rejected': -111.83668518066406, 'logits/chosen': -7.908196449279785, 'logits/rejected': -7.449959754943848, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.002829343546181917, 'kl/avg_steps': 0.53125, 'epoch': 0.98} 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 664/681 [50:17<00:54, 3.20s/it] 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 665/681 [50:19<00:49, 3.07s/it] {'loss': 1.1848, 'grad_norm': 4.550708293914795, 'learning_rate': 9.513254770636137e-10, 'rewards/chosen': -0.36321356892585754, 'rewards/rejected': -0.5911097526550293, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.22789615392684937, 'logps/chosen': -217.97030639648438, 'logps/rejected': -303.5909423828125, 'logps/ref_chosen': -88.18618774414062, 'logps/ref_rejected': -91.9120101928711, 'logits/chosen': -7.470752716064453, 'logits/rejected': -7.1154465675354, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.0028143920935690403, 'kl/avg_steps': 0.71875, 'epoch': 0.98} 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 665/681 [50:19<00:49, 3.07s/it] 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 666/681 [50:23<00:46, 3.11s/it] {'loss': 1.2025, 'grad_norm': 6.450153350830078, 'learning_rate': 8.427576920763956e-10, 'rewards/chosen': -0.41387584805488586, 'rewards/rejected': -0.628452479839325, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.2145766168832779, 'logps/chosen': -249.63475036621094, 'logps/rejected': -327.71697998046875, 'logps/ref_chosen': -100.97460174560547, 'logps/ref_rejected': -101.24992370605469, 'logits/chosen': -7.702958583831787, 'logits/rejected': -7.515384674072266, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.002794307889416814, 'kl/avg_steps': 0.53125, 'epoch': 0.98} 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 666/681 [50:23<00:46, 3.11s/it] 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 667/681 [50:26<00:43, 3.13s/it] {'loss': 1.1737, 'grad_norm': 4.641995429992676, 'learning_rate': 7.407554321417764e-10, 'rewards/chosen': -0.4399784505367279, 'rewards/rejected': -0.6796703934669495, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.23969195783138275, 'logps/chosen': -256.4490661621094, 'logps/rejected': -339.6527404785156, 'logps/ref_chosen': -97.5711669921875, 'logps/ref_rejected': -93.58476257324219, 'logits/chosen': -7.869577884674072, 'logits/rejected': -7.049417495727539, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.002779541537165642, 'kl/avg_steps': 0.59375, 'epoch': 0.98} 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 667/681 [50:26<00:43, 3.13s/it] 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 668/681 [50:29<00:40, 3.14s/it] {'loss': 1.1573, 'grad_norm': 6.144704341888428, 'learning_rate': 6.453213851142225e-10, 'rewards/chosen': -0.38403064012527466, 'rewards/rejected': -0.6493211388587952, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.2652904689311981, 'logps/chosen': -242.12380981445312, 'logps/rejected': -345.40899658203125, 'logps/ref_chosen': -102.5750503540039, 'logps/ref_rejected': -108.81768798828125, 'logits/chosen': -8.091211318969727, 'logits/rejected': -7.701225280761719, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.002763135591521859, 'kl/avg_steps': 0.53125, 'epoch': 0.98} 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 668/681 [50:29<00:40, 3.14s/it] 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 669/681 [50:32<00:37, 3.16s/it] {'loss': 1.1609, 'grad_norm': 5.321188926696777, 'learning_rate': 5.564580657695939e-10, 'rewards/chosen': -0.3517535924911499, 'rewards/rejected': -0.6092487573623657, 'rewards/accuracies': 0.8125, 'rewards/margins': 0.2574951946735382, 'logps/chosen': -217.85406494140625, 'logps/rejected': -305.6316223144531, 'logps/ref_chosen': -89.49478149414062, 'logps/ref_rejected': -82.51950073242188, 'logits/chosen': -7.704500198364258, 'logits/rejected': -6.9445672035217285, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.0027485338505357504, 'kl/avg_steps': 0.59375, 'epoch': 0.98} 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 669/681 [50:32<00:37, 3.16s/it] 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 670/681 [50:35<00:35, 3.19s/it] {'loss': 1.1435, 'grad_norm': 4.645162582397461, 'learning_rate': 4.741678157389739e-10, 'rewards/chosen': -0.3478962182998657, 'rewards/rejected': -0.6254346370697021, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.2775384485721588, 'logps/chosen': -223.52005004882812, 'logps/rejected': -332.3464050292969, 'logps/ref_chosen': -95.45459747314453, 'logps/ref_rejected': -101.53292846679688, 'logits/chosen': -7.779049873352051, 'logits/rejected': -7.034740447998047, 'kl/p_epsilon_steps': 0.875, 'kl/n_epsilon_steps': 0.125, 'kl/beta': 0.0027323109097778797, 'kl/avg_steps': 0.75, 'epoch': 0.98} 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 670/681 [50:35<00:35, 3.19s/it] 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 671/681 [50:38<00:31, 3.17s/it] {'loss': 1.2244, 'grad_norm': 4.930266380310059, 'learning_rate': 3.9845280344705245e-10, 'rewards/chosen': -0.42587053775787354, 'rewards/rejected': -0.6129598617553711, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.18708932399749756, 'logps/chosen': -239.85293579101562, 'logps/rejected': -317.8214111328125, 'logps/ref_chosen': -82.12312316894531, 'logps/ref_rejected': -90.21969604492188, 'logits/chosen': -7.417923927307129, 'logits/rejected': -7.285045623779297, 'kl/p_epsilon_steps': 0.78125, 'kl/n_epsilon_steps': 0.21875, 'kl/beta': 0.0027119710575789213, 'kl/avg_steps': 0.5625, 'epoch': 0.99} 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 671/681 [50:39<00:31, 3.17s/it] 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 672/681 [50:42<00:28, 3.14s/it] {'loss': 1.2405, 'grad_norm': 5.901381969451904, 'learning_rate': 3.293150240547549e-10, 'rewards/chosen': -0.423230916261673, 'rewards/rejected': -0.5902296304702759, 'rewards/accuracies': 0.734375, 'rewards/margins': 0.1669987291097641, 'logps/chosen': -247.38026428222656, 'logps/rejected': -320.5558166503906, 'logps/ref_chosen': -90.0619125366211, 'logps/ref_rejected': -100.45323181152344, 'logits/chosen': -7.819128513336182, 'logits/rejected': -7.268359184265137, 'kl/p_epsilon_steps': 0.71875, 'kl/n_epsilon_steps': 0.28125, 'kl/beta': 0.0026968014426529408, 'kl/avg_steps': 0.4375, 'epoch': 0.99} 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 672/681 [50:42<00:28, 3.14s/it] 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 673/681 [50:44<00:24, 3.02s/it] {'loss': 1.2187, 'grad_norm': 5.085114479064941, 'learning_rate': 2.6675629940689504e-10, 'rewards/chosen': -0.3727704882621765, 'rewards/rejected': -0.562617301940918, 'rewards/accuracies': 0.78125, 'rewards/margins': 0.18984678387641907, 'logps/chosen': -218.5710906982422, 'logps/rejected': -302.2262268066406, 'logps/ref_chosen': -79.26315307617188, 'logps/ref_rejected': -91.34925079345703, 'logits/chosen': -7.903676986694336, 'logits/rejected': -7.309889793395996, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.002685054438188672, 'kl/avg_steps': 0.53125, 'epoch': 0.99} 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 673/681 [50:44<00:24, 3.02s/it] 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 674/681 [50:47<00:21, 3.06s/it] {'loss': 1.1642, 'grad_norm': 4.563065052032471, 'learning_rate': 2.1077827798404725e-10, 'rewards/chosen': -0.34820300340652466, 'rewards/rejected': -0.600648045539856, 'rewards/accuracies': 0.890625, 'rewards/margins': 0.2524449825286865, 'logps/chosen': -206.57798767089844, 'logps/rejected': -302.875, 'logps/ref_chosen': -75.45831298828125, 'logps/ref_rejected': -76.20362854003906, 'logits/chosen': -7.53197717666626, 'logits/rejected': -7.05259895324707, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.0026708655059337616, 'kl/avg_steps': 0.71875, 'epoch': 0.99} 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 674/681 [50:47<00:21, 3.06s/it] 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 675/681 [50:50<00:17, 2.98s/it] {'loss': 1.1993, 'grad_norm': 5.335831165313721, 'learning_rate': 1.6138243485910863e-10, 'rewards/chosen': -0.3907131552696228, 'rewards/rejected': -0.6111031174659729, 'rewards/accuracies': 0.765625, 'rewards/margins': 0.2203899621963501, 'logps/chosen': -227.656494140625, 'logps/rejected': -313.17236328125, 'logps/ref_chosen': -79.90953063964844, 'logps/ref_rejected': -81.21824645996094, 'logits/chosen': -7.707300186157227, 'logits/rejected': -7.3889970779418945, 'kl/p_epsilon_steps': 0.765625, 'kl/n_epsilon_steps': 0.234375, 'kl/beta': 0.0026518055237829685, 'kl/avg_steps': 0.53125, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 675/681 [50:50<00:17, 2.98s/it] 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 676/681 [50:53<00:15, 3.04s/it] {'loss': 1.1933, 'grad_norm': 4.61472749710083, 'learning_rate': 1.1857007165852472e-10, 'rewards/chosen': -0.41136687994003296, 'rewards/rejected': -0.632805585861206, 'rewards/accuracies': 0.796875, 'rewards/margins': 0.2214387059211731, 'logps/chosen': -254.79995727539062, 'logps/rejected': -336.66851806640625, 'logps/ref_chosen': -98.17111206054688, 'logps/ref_rejected': -95.024658203125, 'logits/chosen': -8.043817520141602, 'logits/rejected': -7.283336639404297, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.0026377923786640167, 'kl/avg_steps': 0.59375, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 676/681 [50:53<00:15, 3.04s/it] 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎| 677/681 [50:56<00:11, 2.96s/it] {'loss': 1.186, 'grad_norm': 4.470887660980225, 'learning_rate': 8.23423165278725e-11, 'rewards/chosen': -0.3634149134159088, 'rewards/rejected': -0.5882839560508728, 'rewards/accuracies': 0.84375, 'rewards/margins': 0.22486907243728638, 'logps/chosen': -230.60610961914062, 'logps/rejected': -308.8211669921875, 'logps/ref_chosen': -91.37928009033203, 'logps/ref_rejected': -82.87776947021484, 'logits/chosen': -7.727552890777588, 'logits/rejected': -6.984161376953125, 'kl/p_epsilon_steps': 0.796875, 'kl/n_epsilon_steps': 0.203125, 'kl/beta': 0.002622222760692239, 'kl/avg_steps': 0.59375, 'epoch': 0.99} 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎| 677/681 [50:56<00:11, 2.96s/it] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌| 678/681 [50:59<00:09, 3.00s/it] {'loss': 1.1519, 'grad_norm': 4.6202616691589355, 'learning_rate': 5.270012410216185e-11, 'rewards/chosen': -0.3453512489795685, 'rewards/rejected': -0.6112858057022095, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.265934556722641, 'logps/chosen': -208.89874267578125, 'logps/rejected': -323.3565979003906, 'logps/ref_chosen': -75.64586639404297, 'logps/ref_rejected': -86.96611022949219, 'logits/chosen': -7.629189491271973, 'logits/rejected': -6.984808921813965, 'kl/p_epsilon_steps': 0.859375, 'kl/n_epsilon_steps': 0.140625, 'kl/beta': 0.002606745343655348, 'kl/avg_steps': 0.71875, 'epoch': 1.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌| 678/681 [50:59<00:09, 3.00s/it] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 679/681 [51:02<00:06, 3.01s/it] {'loss': 1.1954, 'grad_norm': 4.485384464263916, 'learning_rate': 2.9644275480772416e-11, 'rewards/chosen': -0.39101141691207886, 'rewards/rejected': -0.6033735871315002, 'rewards/accuracies': 0.875, 'rewards/margins': 0.2123621702194214, 'logps/chosen': -232.62774658203125, 'logps/rejected': -317.65594482421875, 'logps/ref_chosen': -80.77344512939453, 'logps/ref_rejected': -82.87850189208984, 'logits/chosen': -7.64573860168457, 'logits/rejected': -6.9826436042785645, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.0025881431065499783, 'kl/avg_steps': 0.65625, 'epoch': 1.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 679/681 [51:02<00:06, 3.01s/it] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 680/681 [51:05<00:03, 3.03s/it] {'loss': 1.1887, 'grad_norm': 4.077805042266846, 'learning_rate': 1.31753782067201e-11, 'rewards/chosen': -0.3943979740142822, 'rewards/rejected': -0.6201244592666626, 'rewards/accuracies': 0.859375, 'rewards/margins': 0.22572645545005798, 'logps/chosen': -261.8695373535156, 'logps/rejected': -359.0973815917969, 'logps/ref_chosen': -107.68292999267578, 'logps/ref_rejected': -116.09486389160156, 'logits/chosen': -7.682188987731934, 'logits/rejected': -7.54879093170166, 'kl/p_epsilon_steps': 0.8125, 'kl/n_epsilon_steps': 0.1875, 'kl/beta': 0.0025712691713124514, 'kl/avg_steps': 0.625, 'epoch': 1.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 680/681 [51:05<00:03, 3.03s/it] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 681/681 [51:08<00:00, 3.02s/it] {'loss': 1.2117, 'grad_norm': 4.475795269012451, 'learning_rate': 3.2938662507808745e-12, 'rewards/chosen': -0.3517065942287445, 'rewards/rejected': -0.5499787330627441, 'rewards/accuracies': 0.828125, 'rewards/margins': 0.19827213883399963, 'logps/chosen': -231.38394165039062, 'logps/rejected': -311.75775146484375, 'logps/ref_chosen': -93.01106262207031, 'logps/ref_rejected': -94.82217407226562, 'logits/chosen': -7.837874412536621, 'logits/rejected': -7.355884552001953, 'kl/p_epsilon_steps': 0.828125, 'kl/n_epsilon_steps': 0.171875, 'kl/beta': 0.002555298386141658, 'kl/avg_steps': 0.65625, 'epoch': 1.0} 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 681/681 [51:08<00:00, 3.02s/it][INFO|trainer.py:3984] 2026-04-24 05:07:30,523 >> Saving model checkpoint to /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-681 [INFO|configuration_utils.py:419] 2026-04-24 05:07:30,528 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-681/config.json [INFO|configuration_utils.py:911] 2026-04-24 05:07:30,533 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-681/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-24 05:08:10,118 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-681/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-24 05:08:10,123 >> tokenizer config file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-681/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-24 05:08:10,142 >> Special tokens file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-681/special_tokens_map.json [INFO|trainer.py:4083] 2026-04-24 05:11:08,116 >> Deleting older checkpoint [/scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/checkpoint-400] due to args.save_total_limit [INFO|trainer.py:2681] 2026-04-24 05:11:10,478 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 3308.3388, 'train_samples_per_second': 13.178, 'train_steps_per_second': 0.206, 'train_loss': 1.0212064078200755, 'epoch': 1.0} 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 681/681 [55:03<00:00, 3.02s/it] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 681/681 [55:03<00:00, 4.85s/it] ***** train metrics ***** epoch = 1.0 total_flos = 0GF train_loss = 1.0212 train_runtime = 0:55:08.33 train_samples = 43598 train_samples_per_second = 13.178 train_steps_per_second = 0.206 2026-04-24 05:11:10 - INFO - __main__ - *** Training complete *** 2026-04-24 05:11:10 - INFO - __main__ - *** Save model *** [INFO|configuration_utils.py:419] 2026-04-24 05:11:26,900 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/config.json [INFO|configuration_utils.py:911] 2026-04-24 05:11:26,907 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/generation_config.json [INFO|modeling_utils.py:3580] 2026-04-24 05:12:11,170 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-04-24 05:12:11,174 >> tokenizer config file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-04-24 05:12:11,177 >> Special tokens file saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/special_tokens_map.json 2026-04-24 05:12:11 - INFO - __main__ - Saved HF-compatible model artifacts to /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306 [INFO|modelcard.py:450] 2026-04-24 05:12:11,571 >> Dropping the following result as it does not have all the necessary fields: {'dataset': {'name': 'Anthropic/hh-rlhf', 'type': 'Anthropic/hh-rlhf'}} [INFO|configuration_utils.py:419] 2026-04-24 05:12:11,579 >> Configuration saved in /scratch/qu.yang1/outputs/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306/config.json 2026-04-24 05:12:11 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:4307] 2026-04-24 05:12:11,587 >> ***** Running Evaluation ***** [INFO|trainer.py:4309] 2026-04-24 05:12:11,587 >> Num examples = 2339 [INFO|trainer.py:4312] 2026-04-24 05:12:11,587 >> Batch size = 8 0%| | 0/73 [00:00