初始化项目，由ModelHub XC社区提供模型

Model: ali-elganzory/open-sci-ref-v0.02-1.7b-nemotron-hq-300B-16384-rope_theta-1M-long_sft_16k Source: Original Platform
2026-05-14 20:28:26 +08:00
commit 42efe026c5
17 changed files with 254832 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,35 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,61 @@
+---
+library_name: transformers
+license: other
+base_model: open-sci/open-sci-ref-v0.02-1.7b-nemotron-hq-300B-16384-rope_theta-1M
+tags:
+- llama-factory
+- full
+- generated_from_trainer
+model-index:
+- name: 1.7b-Nemotron-cc-2024-HQ-real-synth-mix-16k
+  results: []
+---
+
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+
+# 1.7b-Nemotron-cc-2024-HQ-real-synth-mix-16k
+
+This model is a fine-tuned version of [open-sci/open-sci-ref-v0.02-1.7b-nemotron-hq-300B-16384-rope_theta-1M](https://huggingface.co/open-sci/open-sci-ref-v0.02-1.7b-nemotron-hq-300B-16384-rope_theta-1M) on the long_sft dataset.
+
+## Model description
+
+More information needed
+
+## Intended uses & limitations
+
+More information needed
+
+## Training and evaluation data
+
+More information needed
+
+## Training procedure
+
+### Training hyperparameters
+
+The following hyperparameters were used during training:
+- learning_rate: 0.0002
+- train_batch_size: 2
+- eval_batch_size: 8
+- seed: 42
+- distributed_type: multi-GPU
+- num_devices: 8
+- gradient_accumulation_steps: 2
+- total_train_batch_size: 32
+- total_eval_batch_size: 64
+- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: cosine
+- lr_scheduler_warmup_ratio: 0.05
+- num_epochs: 1.0
+
+### Training results
+
+
+
+### Framework versions
+
+- Transformers 4.57.0
+- Pytorch 2.6.0+cu124
+- Datasets 4.0.0
+- Tokenizers 0.22.1
--- a/all_results.json
+++ b/all_results.json
@@ -0,0 +1,8 @@
+{
+    "epoch": 1.0,
+    "total_flos": 889480391426048.0,
+    "train_loss": 0.9303844255645997,
+    "train_runtime": 18614.4314,
+    "train_samples_per_second": 2.79,
+    "train_steps_per_second": 0.087
+}
--- a/chat_template.jinja
+++ b/chat_template.jinja
@@ -0,0 +1,7 @@
+{{ '<|endoftext|>' }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% endif %}{% for message in loop_messages %}{% if loop.index0 == 0 and system_message is defined %}{% set content = system_message + '
+
+' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '<start_of_turn>user
+' + content + '<end_of_turn>
+<start_of_turn>model
+' }}{% elif message['role'] == 'assistant' %}{{ content + '<end_of_turn>
+' }}{% endif %}{% endfor %}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,37 @@
+{
+  "architectures": [
+    "OpensciForCausalLM"
+  ],
+  "attention_bias": true,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_opensci.OpensciConfig",
+    "AutoModel": "modeling_opensci.OpensciModel",
+    "AutoModelForCausalLM": "modeling_opensci.OpensciForCausalLM"
+  },
+  "bos_token_id": 0,
+  "dtype": "bfloat16",
+  "eos_token_id": 50277,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": 8192,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 16384,
+  "mlp_bias": true,
+  "model_type": "opensci",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 24,
+  "num_key_value_heads": 32,
+  "pad_token_id": 50277,
+  "pretraining_tp": 1,
+  "qk_layernorm": true,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 1000000,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.57.0",
+  "use_cache": false,
+  "vocab_size": 50304
+}
--- a/configuration_opensci.py
+++ b/configuration_opensci.py
@@ -0,0 +1,204 @@
+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""OpenSci model configuration"""
+
+from transformers.configuration_utils import PretrainedConfig
+from transformers.modeling_rope_utils import rope_config_validation
+
+
+class OpensciConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`OpensciModel`]. It is used to instantiate an Opensci
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the Opensci-7B.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the Opensci model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`OpensciModel`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 11008):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer decoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        num_key_value_heads (`int`, *optional*):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
+            `num_attention_heads`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        pad_token_id (`int`, *optional*):
+            Padding token id.
+        bos_token_id (`int`, *optional*, defaults to 1):
+            Beginning of stream token id.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            End of stream token id.
+        pretraining_tp (`int`, *optional*, defaults to 1):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to
+            understand more about it. This value is necessary to ensure exact reproducibility of the pretraining
+            results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
+            and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
+            accordingly.
+            Expected contents:
+                `rope_type` (`str`):
+                    The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
+                    'Llama3'], with 'default' being the original RoPE implementation.
+                `factor` (`float`, *optional*):
+                    Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
+                    most scaling types, a `factor` of x will enable the model to handle sequences of length x *
+                    original maximum pre-trained length.
+                `original_max_position_embeddings` (`int`, *optional*):
+                    Used with 'dynamic', 'longrope' and 'Llama3'. The original max position embeddings used during
+                    pretraining.
+                `attention_factor` (`float`, *optional*):
+                    Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
+                    computation. If unspecified, it defaults to value recommended by the implementation, using the
+                    `factor` field to infer the suggested value.
+                `beta_fast` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 32.
+                `beta_slow` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 1.
+                `short_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to short contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `long_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to long contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `low_freq_factor` (`float`, *optional*):
+                    Only used with 'Llama3'. Scaling factor applied to low frequency components of the RoPE
+                `high_freq_factor` (`float`, *optional*):
+                    Only used with 'Llama3'. Scaling factor applied to high frequency components of the RoPE
+        attention_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers during self-attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        mlp_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
+        head_dim (`int`, *optional*):
+            The attention head dimension. If None, it will default to hidden_size // num_attention_heads
+
+    ```python
+    >>> from transformers import OpensciModel, OpensciConfig
+
+    >>> # Initializing a Opensci Opensci-7b style configuration
+    >>> configuration = OpensciConfig()
+
+    >>> # Initializing a model from the Opensci-7b style configuration
+    >>> model = OpensciModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "opensci"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    def __init__(
+        self,
+        vocab_size=32000,
+        hidden_size=4096,
+        intermediate_size=11008,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=None,
+        hidden_act="silu",
+        max_position_embeddings=2048,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        pad_token_id=None,
+        bos_token_id=1,
+        eos_token_id=2,
+        pretraining_tp=1,
+        tie_word_embeddings=False,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        attention_bias=False,
+        attention_dropout=0.0,
+        mlp_bias=False,
+        head_dim=None,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.pretraining_tp = pretraining_tp
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.attention_bias = attention_bias
+        self.attention_dropout = attention_dropout
+        self.mlp_bias = mlp_bias
+        self.head_dim = head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
+        # Validate the correctness of rotary position embeddings parameters
+        # BC: if there is a 'type' field, copy it it to 'rope_type'.
+        if self.rope_scaling is not None and "type" in self.rope_scaling:
+            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+        rope_config_validation(self)
+
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,11 @@
+{
+  "_from_model_config": true,
+  "bos_token_id": 0,
+  "eos_token_id": [
+    50277,
+    0
+  ],
+  "pad_token_id": 50277,
+  "transformers_version": "4.57.0",
+  "use_cache": false
+}
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b33ff330599bdffb7fa5ea4a838213ad22edefc28ebdb3248a978fb096334cde
+size 3428804400
--- a/modeling_opensci.py
+++ b/modeling_opensci.py
@@ -0,0 +1,984 @@
+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Callable, List, Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache, StaticCache
+from transformers.generation import GenerationMixin
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPast,
+    CausalLMOutputWithPast,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutputWithPast,
+    TokenClassifierOutput,
+)
+from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from transformers.processing_utils import Unpack
+from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
+from transformers.utils import (
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+    replace_return_docstrings,
+)
+from transformers.utils.deprecation import deprecate_kwarg
+from .configuration_opensci import OpensciConfig
+
+
+logger = logging.get_logger(__name__)
+
+_CONFIG_FOR_DOC = "OpensciConfig"
+
+
+class OpensciRMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        OpensciRMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+
+
+ALL_LAYERNORM_LAYERS.append(OpensciRMSNorm)
+
+
+class OpensciRotaryEmbedding(nn.Module):
+    def __init__(self, config: OpensciConfig, device=None):
+        super().__init__()
+        # BC: "rope_type" was originally "type"
+        if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+
+    def _dynamic_frequency_update(self, position_ids, device):
+        """
+        dynamic RoPE layers should recompute `inv_freq` in the following situations:
+        1 - growing beyond the cached sequence length (allow scaling)
+        2 - the current sequence length is in the original scale (avoid losing precision with small sequences)
+        """
+        seq_len = torch.max(position_ids) + 1
+        if seq_len > self.max_seq_len_cached:  # growth
+            inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, seq_len=seq_len)
+            self.register_buffer("inv_freq", inv_freq, persistent=False)  # TODO joao: may break with compilation
+            self.max_seq_len_cached = seq_len
+
+        if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len:  # reset
+            # This .to() is needed if the model has been moved to a device after being initialized (because
+            # the buffer is automatically moved, but not the original copy)
+            self.original_inv_freq = self.original_inv_freq.to(device)
+            self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)
+            self.max_seq_len_cached = self.original_max_seq_len
+
+    @torch.no_grad()
+    def forward(self, x, position_ids):
+        if "dynamic" in self.rope_type:
+            self._dynamic_frequency_update(position_ids, device=x.device)
+
+        # Core RoPE block
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
+        position_ids_expanded = position_ids[:, None, :].float()
+        # Force float32 (see https://github.com/huggingface/transformers/pull/29285)
+        device_type = x.device.type
+        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos()
+            sin = emb.sin()
+
+        # Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
+        cos = cos * self.attention_scaling
+        sin = sin * self.attention_scaling
+
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+
+
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+class OpensciMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+
+
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs,
+):
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+
+    return attn_output, attn_weights
+
+
+class OpensciAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config: OpensciConfig, layer_idx: int):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = True
+
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.o_proj = nn.Linear(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
+        )
+        self.qk_layernorm = config.qk_layernorm
+        if self.qk_layernorm:
+            self.q_layernorm = OpensciRMSNorm(config.head_dim, eps=config.rms_norm_eps)
+            self.k_layernorm = OpensciRMSNorm(config.head_dim, eps=config.rms_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: Tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+
+        if self.qk_layernorm:
+            query_states = self.q_layernorm(query_states)
+            key_states = self.k_layernorm(key_states)
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
+                logger.warning_once(
+                    "`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
+                    'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
+                )
+            else:
+                attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            **kwargs,
+        )
+
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+
+
+class OpensciDecoderLayer(nn.Module):
+    def __init__(self, config: OpensciConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+
+        self.self_attn = OpensciAttention(config=config, layer_idx=layer_idx)
+
+        self.mlp = OpensciMLP(config)
+        self.input_layernorm = OpensciRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = OpensciRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+
+        hidden_states = self.input_layernorm(hidden_states)
+
+        # Self Attention
+        hidden_states, self_attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+
+        return outputs
+
+
+Opensci_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`OpensciConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+@add_start_docstrings(
+    "The bare Opensci Model outputting raw hidden-states without any specific head on top.",
+    Opensci_START_DOCSTRING,
+)
+class OpensciPreTrainedModel(PreTrainedModel):
+    config_class = OpensciConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["OpensciDecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _supports_cache_class = True
+    _supports_quantized_cache = True
+    _supports_static_cache = True
+    _supports_attention_backend = True
+
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+
+
+Opensci_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            [What are attention masks?](../glossary#attention-mask)
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
+            `past_key_values`).
+
+            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
+            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
+            information on the default strategy.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.n_positions - 1]`.
+
+            [What are position IDs?](../glossary#position-ids)
+        past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
+            Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
+            returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
+
+            Two formats are allowed:
+            - a [`~cache_utils.Cache`] instance, see our
+            [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache);
+            - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
+            shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
+            cache format.
+
+            The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
+            legacy cache format will be returned.
+
+            If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
+            have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
+            of shape `(batch_size, sequence_length)`.
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
+            Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
+            this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
+            the complete sequence length.
+"""
+
+
+@add_start_docstrings(
+    "The bare Opensci Model outputting raw hidden-states without any specific head on top.",
+    Opensci_START_DOCSTRING,
+)
+class OpensciModel(OpensciPreTrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`OpensciDecoderLayer`]
+
+    Args:
+        config: OpensciConfig
+    """
+
+    def __init__(self, config: OpensciConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [OpensciDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = OpensciRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = OpensciRotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    @add_start_docstrings_to_model_forward(Opensci_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once(
+                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
+            )
+            use_cache = False
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache()
+
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+
+        causal_mask = self._update_causal_mask(
+            attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
+        )
+
+        hidden_states = inputs_embeds
+
+        # create position embeddings to be shared across the decoder layers
+        position_embeddings = self.rotary_emb(hidden_states, position_ids)
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    decoder_layer.__call__,
+                    hidden_states,
+                    causal_mask,
+                    position_ids,
+                    past_key_values,
+                    output_attentions,
+                    use_cache,
+                    cache_position,
+                    position_embeddings,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=causal_mask,
+                    position_ids=position_ids,
+                    past_key_value=past_key_values,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                    cache_position=cache_position,
+                    position_embeddings=position_embeddings,
+                    **flash_attn_kwargs,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+        hidden_states = self.norm(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        output = BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values if use_cache else None,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+        return output if return_dict else output.to_tuple()
+
+    def _update_causal_mask(
+        self,
+        attention_mask: torch.Tensor,
+        input_tensor: torch.Tensor,
+        cache_position: torch.Tensor,
+        past_key_values: Cache,
+        output_attentions: bool,
+    ):
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and (attention_mask == 0.0).any():
+                return attention_mask
+            return None
+
+        # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
+        # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
+        # to infer the attention mask.
+        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+        using_static_cache = isinstance(past_key_values, StaticCache)
+
+        # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
+        if self.config._attn_implementation == "sdpa" and not using_static_cache and not output_attentions:
+            if AttentionMaskConverter._ignore_causal_mask_sdpa(
+                attention_mask,
+                inputs_embeds=input_tensor,
+                past_key_values_length=past_seen_tokens,
+                is_training=self.training,
+            ):
+                return None
+
+        dtype, device = input_tensor.dtype, input_tensor.device
+        sequence_length = input_tensor.shape[1]
+        if using_static_cache:
+            target_length = past_key_values.get_max_cache_shape()
+        else:
+            target_length = (
+                attention_mask.shape[-1]
+                if isinstance(attention_mask, torch.Tensor)
+                else past_seen_tokens + sequence_length + 1
+            )
+
+        # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
+        causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
+            attention_mask,
+            sequence_length=sequence_length,
+            target_length=target_length,
+            dtype=dtype,
+            device=device,
+            cache_position=cache_position,
+            batch_size=input_tensor.shape[0],
+        )
+
+        if (
+            self.config._attn_implementation == "sdpa"
+            and attention_mask is not None
+            and attention_mask.device.type in ["cuda", "xpu"]
+            and not output_attentions
+        ):
+            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
+            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+            # Details: https://github.com/pytorch/pytorch/issues/110213
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
+
+        return causal_mask
+
+    @staticmethod
+    def _prepare_4d_causal_attention_mask_with_cache_position(
+        attention_mask: torch.Tensor,
+        sequence_length: int,
+        target_length: int,
+        dtype: torch.dtype,
+        device: torch.device,
+        cache_position: torch.Tensor,
+        batch_size: int,
+        **kwargs,
+    ):
+        """
+        Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
+        `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
+
+        Args:
+            attention_mask (`torch.Tensor`):
+                A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape
+                `(batch_size, 1, query_length, key_value_length)`.
+            sequence_length (`int`):
+                The sequence length being processed.
+            target_length (`int`):
+                The target length: when generating with static cache, the mask should be as long as the static cache,
+                to account for the 0 padding, the part of the cache that is not filled yet.
+            dtype (`torch.dtype`):
+                The dtype to use for the 4D attention mask.
+            device (`torch.device`):
+                The device to plcae the 4D attention mask on.
+            cache_position (`torch.Tensor`):
+                Indices depicting the position of the input sequence tokens in the sequence.
+            batch_size (`torch.Tensor`):
+                Batch size.
+        """
+        if attention_mask is not None and attention_mask.dim() == 4:
+            # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
+            causal_mask = attention_mask
+        else:
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = torch.full(
+                (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
+            )
+            if sequence_length != 1:
+                causal_mask = torch.triu(causal_mask, diagonal=1)
+            causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
+            causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
+            if attention_mask is not None:
+                causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
+                mask_length = attention_mask.shape[-1]
+                padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
+                padding_mask = padding_mask == 0
+                causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
+                    padding_mask, min_dtype
+                )
+
+        return causal_mask
+
+
+class KwargsForCausalLM(FlashAttentionKwargs): ...
+
+
+class OpensciForCausalLM(OpensciPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = OpensciModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.model = decoder
+
+    def get_decoder(self):
+        return self.model
+
+    @deprecate_kwarg("num_logits_to_keep", version="4.50", new_name="logits_to_keep")
+    @add_start_docstrings_to_model_forward(Opensci_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[KwargsForCausalLM],
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+            logits_to_keep (`int` or `torch.Tensor`, *optional*):
+                If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
+                `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
+                token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
+                If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
+                This is useful when using packed tensor format (single dimension for batch and sequence length).
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, OpensciForCausalLM
+
+        >>> model = OpensciForCausalLM.from_pretrained("meta-Opensci/Opensci-2-7b-hf")
+        >>> tokenizer = AutoTokenizer.from_pretrained("meta-Opensci/Opensci-2-7b-hf")
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+            **kwargs,
+        )
+
+        hidden_states = outputs[0]
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    The Opensci Model transformer with a sequence classification head on top (linear layer).
+
+    [`OpensciForSequenceClassification`] uses the last token in order to do the classification, as other causal models
+    (e.g. GPT-2) do.
+
+    Since it does classification on the last token, it requires to know the position of the last token. If a
+    `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
+    no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
+    padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
+    each row of the batch).
+    """,
+    Opensci_START_DOCSTRING,
+)
+class OpensciForSequenceClassification(OpensciPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.model = OpensciModel(config)
+        self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+
+    @add_start_docstrings_to_model_forward(Opensci_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        logits = self.score(hidden_states)
+
+        if input_ids is not None:
+            batch_size = input_ids.shape[0]
+        else:
+            batch_size = inputs_embeds.shape[0]
+
+        if self.config.pad_token_id is None and batch_size != 1:
+            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
+        if self.config.pad_token_id is None:
+            last_non_pad_token = -1
+        elif input_ids is not None:
+            # To handle both left- and right- padding, we take the rightmost token that is not equal to pad_token_id
+            non_pad_mask = (input_ids != self.config.pad_token_id).to(logits.device, torch.int32)
+            token_indices = torch.arange(input_ids.shape[-1], device=logits.device)
+            last_non_pad_token = (token_indices * non_pad_mask).argmax(-1)
+        else:
+            last_non_pad_token = -1
+            logger.warning_once(
+                f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
+                "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
+            )
+
+        pooled_logits = logits[torch.arange(batch_size, device=logits.device), last_non_pad_token]
+
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, pooled_logits=pooled_logits, config=self.config)
+
+        if not return_dict:
+            output = (pooled_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutputWithPast(
+            loss=loss,
+            logits=pooled_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,24 @@
+{
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<end_of_turn>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<end_of_turn>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,225 @@
+{
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|padding|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50254": {
+      "content": "                        ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50255": {
+      "content": "                       ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50256": {
+      "content": "                      ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50257": {
+      "content": "                     ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50258": {
+      "content": "                    ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50259": {
+      "content": "                   ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50260": {
+      "content": "                  ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50261": {
+      "content": "                 ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50262": {
+      "content": "                ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50263": {
+      "content": "               ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50264": {
+      "content": "              ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50265": {
+      "content": "             ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50266": {
+      "content": "            ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50267": {
+      "content": "           ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50268": {
+      "content": "          ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50269": {
+      "content": "         ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50270": {
+      "content": "        ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50271": {
+      "content": "       ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50272": {
+      "content": "      ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50273": {
+      "content": "     ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50274": {
+      "content": "    ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50275": {
+      "content": "   ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50276": {
+      "content": "  ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50277": {
+      "content": "<end_of_turn>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<end_of_turn>",
+  "extra_special_tokens": {},
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<end_of_turn>",
+  "padding_side": "right",
+  "split_special_tokens": false,
+  "tokenizer_class": "GPTNeoXTokenizer",
+  "unk_token": "<|endoftext|>"
+}
--- a/train_results.json
+++ b/train_results.json
@@ -0,0 +1,8 @@
+{
+    "epoch": 1.0,
+    "total_flos": 889480391426048.0,
+    "train_loss": 0.9303844255645997,
+    "train_runtime": 18614.4314,
+    "train_samples_per_second": 2.79,
+    "train_steps_per_second": 0.087
+}
--- a/trainer_log.jsonl
+++ b/trainer_log.jsonl
@@ -0,0 +1,325 @@
+{"current_steps": 5, "total_steps": 1623, "loss": 1.1518, "lr": 9.756097560975611e-06, "epoch": 0.0030807147258163892, "percentage": 0.31, "elapsed_time": "0:01:05", "remaining_time": "5:53:08"}
+{"current_steps": 10, "total_steps": 1623, "loss": 1.2475, "lr": 2.1951219512195124e-05, "epoch": 0.0061614294516327784, "percentage": 0.62, "elapsed_time": "0:02:03", "remaining_time": "5:31:49"}
+{"current_steps": 15, "total_steps": 1623, "loss": 1.2543, "lr": 3.414634146341464e-05, "epoch": 0.009242144177449169, "percentage": 0.92, "elapsed_time": "0:03:03", "remaining_time": "5:26:59"}
+{"current_steps": 20, "total_steps": 1623, "loss": 1.2007, "lr": 4.634146341463415e-05, "epoch": 0.012322858903265557, "percentage": 1.23, "elapsed_time": "0:04:00", "remaining_time": "5:20:42"}
+{"current_steps": 25, "total_steps": 1623, "loss": 1.0335, "lr": 5.853658536585366e-05, "epoch": 0.015403573629081947, "percentage": 1.54, "elapsed_time": "0:04:38", "remaining_time": "4:56:40"}
+{"current_steps": 30, "total_steps": 1623, "loss": 1.0887, "lr": 7.073170731707317e-05, "epoch": 0.018484288354898338, "percentage": 1.85, "elapsed_time": "0:05:34", "remaining_time": "4:55:56"}
+{"current_steps": 35, "total_steps": 1623, "loss": 1.1421, "lr": 8.292682926829268e-05, "epoch": 0.021565003080714726, "percentage": 2.16, "elapsed_time": "0:06:33", "remaining_time": "4:57:35"}
+{"current_steps": 40, "total_steps": 1623, "loss": 1.1532, "lr": 9.51219512195122e-05, "epoch": 0.024645717806531114, "percentage": 2.46, "elapsed_time": "0:07:31", "remaining_time": "4:57:46"}
+{"current_steps": 45, "total_steps": 1623, "loss": 1.1098, "lr": 0.00010731707317073172, "epoch": 0.027726432532347505, "percentage": 2.77, "elapsed_time": "0:08:27", "remaining_time": "4:56:23"}
+{"current_steps": 50, "total_steps": 1623, "loss": 0.9678, "lr": 0.00011951219512195122, "epoch": 0.030807147258163893, "percentage": 3.08, "elapsed_time": "0:09:07", "remaining_time": "4:47:16"}
+{"current_steps": 55, "total_steps": 1623, "loss": 1.0811, "lr": 0.00013170731707317076, "epoch": 0.033887861983980284, "percentage": 3.39, "elapsed_time": "0:10:10", "remaining_time": "4:50:08"}
+{"current_steps": 60, "total_steps": 1623, "loss": 1.1093, "lr": 0.00014390243902439025, "epoch": 0.036968576709796676, "percentage": 3.7, "elapsed_time": "0:11:13", "remaining_time": "4:52:14"}
+{"current_steps": 65, "total_steps": 1623, "loss": 1.0733, "lr": 0.00015609756097560978, "epoch": 0.04004929143561306, "percentage": 4.0, "elapsed_time": "0:12:13", "remaining_time": "4:52:56"}
+{"current_steps": 70, "total_steps": 1623, "loss": 1.1062, "lr": 0.00016829268292682927, "epoch": 0.04313000616142945, "percentage": 4.31, "elapsed_time": "0:13:09", "remaining_time": "4:51:50"}
+{"current_steps": 75, "total_steps": 1623, "loss": 0.9404, "lr": 0.0001804878048780488, "epoch": 0.04621072088724584, "percentage": 4.62, "elapsed_time": "0:13:47", "remaining_time": "4:44:46"}
+{"current_steps": 80, "total_steps": 1623, "loss": 0.9997, "lr": 0.0001926829268292683, "epoch": 0.04929143561306223, "percentage": 4.93, "elapsed_time": "0:14:41", "remaining_time": "4:43:21"}
+{"current_steps": 85, "total_steps": 1623, "loss": 1.038, "lr": 0.0001999991687649223, "epoch": 0.05237215033887862, "percentage": 5.24, "elapsed_time": "0:15:37", "remaining_time": "4:42:50"}
+{"current_steps": 90, "total_steps": 1623, "loss": 1.1127, "lr": 0.00019998981752900036, "epoch": 0.05545286506469501, "percentage": 5.55, "elapsed_time": "0:16:36", "remaining_time": "4:42:50"}
+{"current_steps": 95, "total_steps": 1623, "loss": 1.1212, "lr": 0.00019997007698817557, "epoch": 0.0585335797905114, "percentage": 5.85, "elapsed_time": "0:17:34", "remaining_time": "4:42:42"}
+{"current_steps": 100, "total_steps": 1623, "loss": 0.8807, "lr": 0.00019993994919356167, "epoch": 0.061614294516327786, "percentage": 6.16, "elapsed_time": "0:18:16", "remaining_time": "4:38:12"}
+{"current_steps": 105, "total_steps": 1623, "loss": 0.9791, "lr": 0.00019989943727554598, "epoch": 0.06469500924214418, "percentage": 6.47, "elapsed_time": "0:19:16", "remaining_time": "4:38:38"}
+{"current_steps": 110, "total_steps": 1623, "loss": 1.0313, "lr": 0.00019984854544346367, "epoch": 0.06777572396796057, "percentage": 6.78, "elapsed_time": "0:20:22", "remaining_time": "4:40:18"}
+{"current_steps": 115, "total_steps": 1623, "loss": 1.073, "lr": 0.00019978727898516086, "epoch": 0.07085643869377696, "percentage": 7.09, "elapsed_time": "0:21:18", "remaining_time": "4:39:20"}
+{"current_steps": 120, "total_steps": 1623, "loss": 0.9766, "lr": 0.0001997156442664449, "epoch": 0.07393715341959335, "percentage": 7.39, "elapsed_time": "0:22:18", "remaining_time": "4:39:30"}
+{"current_steps": 125, "total_steps": 1623, "loss": 0.8606, "lr": 0.00019963364873042298, "epoch": 0.07701786814540973, "percentage": 7.7, "elapsed_time": "0:23:00", "remaining_time": "4:35:48"}
+{"current_steps": 130, "total_steps": 1623, "loss": 1.0665, "lr": 0.0001995413008967289, "epoch": 0.08009858287122612, "percentage": 8.01, "elapsed_time": "0:24:02", "remaining_time": "4:36:02"}
+{"current_steps": 135, "total_steps": 1623, "loss": 1.0262, "lr": 0.00019943861036063768, "epoch": 0.08317929759704251, "percentage": 8.32, "elapsed_time": "0:25:03", "remaining_time": "4:36:09"}
+{"current_steps": 140, "total_steps": 1623, "loss": 1.0675, "lr": 0.00019932558779206874, "epoch": 0.0862600123228589, "percentage": 8.63, "elapsed_time": "0:26:02", "remaining_time": "4:35:55"}
+{"current_steps": 145, "total_steps": 1623, "loss": 1.069, "lr": 0.00019920224493447702, "epoch": 0.0893407270486753, "percentage": 8.93, "elapsed_time": "0:27:02", "remaining_time": "4:35:41"}
+{"current_steps": 150, "total_steps": 1623, "loss": 0.8611, "lr": 0.00019906859460363307, "epoch": 0.09242144177449169, "percentage": 9.24, "elapsed_time": "0:27:43", "remaining_time": "4:32:16"}
+{"current_steps": 155, "total_steps": 1623, "loss": 0.9906, "lr": 0.00019892465068629131, "epoch": 0.09550215650030808, "percentage": 9.55, "elapsed_time": "0:28:39", "remaining_time": "4:31:20"}
+{"current_steps": 160, "total_steps": 1623, "loss": 1.1415, "lr": 0.0001987704281387471, "epoch": 0.09858287122612445, "percentage": 9.86, "elapsed_time": "0:29:39", "remaining_time": "4:31:12"}
+{"current_steps": 165, "total_steps": 1623, "loss": 1.1192, "lr": 0.00019860594298528282, "epoch": 0.10166358595194085, "percentage": 10.17, "elapsed_time": "0:30:40", "remaining_time": "4:31:07"}
+{"current_steps": 170, "total_steps": 1623, "loss": 1.1348, "lr": 0.0001984312123165028, "epoch": 0.10474430067775724, "percentage": 10.47, "elapsed_time": "0:31:37", "remaining_time": "4:30:14"}
+{"current_steps": 175, "total_steps": 1623, "loss": 0.8411, "lr": 0.0001982462542875576, "epoch": 0.10782501540357363, "percentage": 10.78, "elapsed_time": "0:32:18", "remaining_time": "4:27:23"}
+{"current_steps": 180, "total_steps": 1623, "loss": 1.0013, "lr": 0.00019805108811625773, "epoch": 0.11090573012939002, "percentage": 11.09, "elapsed_time": "0:33:22", "remaining_time": "4:27:35"}
+{"current_steps": 185, "total_steps": 1623, "loss": 0.9905, "lr": 0.00019784573408107657, "epoch": 0.11398644485520641, "percentage": 11.4, "elapsed_time": "0:34:29", "remaining_time": "4:28:07"}
+{"current_steps": 190, "total_steps": 1623, "loss": 0.9899, "lr": 0.00019763021351904358, "epoch": 0.1170671595810228, "percentage": 11.71, "elapsed_time": "0:35:27", "remaining_time": "4:27:23"}
+{"current_steps": 195, "total_steps": 1623, "loss": 1.0253, "lr": 0.00019740454882352732, "epoch": 0.12014787430683918, "percentage": 12.01, "elapsed_time": "0:36:29", "remaining_time": "4:27:13"}
+{"current_steps": 200, "total_steps": 1623, "loss": 0.8803, "lr": 0.0001971687634419086, "epoch": 0.12322858903265557, "percentage": 12.32, "elapsed_time": "0:37:10", "remaining_time": "4:24:33"}
+{"current_steps": 205, "total_steps": 1623, "loss": 0.9944, "lr": 0.0001969228818731442, "epoch": 0.12630930375847196, "percentage": 12.63, "elapsed_time": "0:38:12", "remaining_time": "4:24:16"}
+{"current_steps": 210, "total_steps": 1623, "loss": 1.0408, "lr": 0.00019666692966522145, "epoch": 0.12939001848428835, "percentage": 12.94, "elapsed_time": "0:39:18", "remaining_time": "4:24:29"}
+{"current_steps": 215, "total_steps": 1623, "loss": 0.9837, "lr": 0.00019640093341250357, "epoch": 0.13247073321010475, "percentage": 13.25, "elapsed_time": "0:40:13", "remaining_time": "4:23:24"}
+{"current_steps": 220, "total_steps": 1623, "loss": 1.0473, "lr": 0.0001961249207529665, "epoch": 0.13555144793592114, "percentage": 13.56, "elapsed_time": "0:41:09", "remaining_time": "4:22:29"}
+{"current_steps": 225, "total_steps": 1623, "loss": 0.872, "lr": 0.00019583892036532726, "epoch": 0.13863216266173753, "percentage": 13.86, "elapsed_time": "0:41:55", "remaining_time": "4:20:29"}
+{"current_steps": 230, "total_steps": 1623, "loss": 0.9982, "lr": 0.00019554296196606395, "epoch": 0.14171287738755392, "percentage": 14.17, "elapsed_time": "0:42:55", "remaining_time": "4:19:59"}
+{"current_steps": 235, "total_steps": 1623, "loss": 0.9808, "lr": 0.00019523707630632835, "epoch": 0.1447935921133703, "percentage": 14.48, "elapsed_time": "0:44:01", "remaining_time": "4:19:59"}
+{"current_steps": 240, "total_steps": 1623, "loss": 1.045, "lr": 0.00019492129516875055, "epoch": 0.1478743068391867, "percentage": 14.79, "elapsed_time": "0:45:03", "remaining_time": "4:19:40"}
+{"current_steps": 245, "total_steps": 1623, "loss": 1.0869, "lr": 0.00019459565136413666, "epoch": 0.15095502156500307, "percentage": 15.1, "elapsed_time": "0:46:02", "remaining_time": "4:18:58"}
+{"current_steps": 250, "total_steps": 1623, "loss": 0.8503, "lr": 0.0001942601787280598, "epoch": 0.15403573629081946, "percentage": 15.4, "elapsed_time": "0:46:45", "remaining_time": "4:16:49"}
+{"current_steps": 255, "total_steps": 1623, "loss": 0.9952, "lr": 0.00019391491211734425, "epoch": 0.15711645101663585, "percentage": 15.71, "elapsed_time": "0:47:49", "remaining_time": "4:16:35"}
+{"current_steps": 260, "total_steps": 1623, "loss": 0.9793, "lr": 0.0001935598874064438, "epoch": 0.16019716574245224, "percentage": 16.02, "elapsed_time": "0:48:47", "remaining_time": "4:15:48"}
+{"current_steps": 265, "total_steps": 1623, "loss": 0.9427, "lr": 0.00019319514148371435, "epoch": 0.16327788046826863, "percentage": 16.33, "elapsed_time": "0:49:42", "remaining_time": "4:14:44"}
+{"current_steps": 270, "total_steps": 1623, "loss": 1.0259, "lr": 0.00019282071224758091, "epoch": 0.16635859519408502, "percentage": 16.64, "elapsed_time": "0:50:44", "remaining_time": "4:14:13"}
+{"current_steps": 275, "total_steps": 1623, "loss": 0.8559, "lr": 0.00019243663860259993, "epoch": 0.16943930991990142, "percentage": 16.94, "elapsed_time": "0:51:26", "remaining_time": "4:12:10"}
+{"current_steps": 280, "total_steps": 1623, "loss": 0.9995, "lr": 0.00019204296045541685, "epoch": 0.1725200246457178, "percentage": 17.25, "elapsed_time": "0:52:30", "remaining_time": "4:11:49"}
+{"current_steps": 285, "total_steps": 1623, "loss": 0.9409, "lr": 0.0001916397187106199, "epoch": 0.1756007393715342, "percentage": 17.56, "elapsed_time": "0:53:29", "remaining_time": "4:11:09"}
+{"current_steps": 290, "total_steps": 1623, "loss": 0.9847, "lr": 0.00019122695526648968, "epoch": 0.1786814540973506, "percentage": 17.87, "elapsed_time": "0:54:31", "remaining_time": "4:10:38"}
+{"current_steps": 295, "total_steps": 1623, "loss": 1.0982, "lr": 0.00019080471301064598, "epoch": 0.18176216882316698, "percentage": 18.18, "elapsed_time": "0:55:37", "remaining_time": "4:10:24"}
+{"current_steps": 300, "total_steps": 1623, "loss": 0.8299, "lr": 0.00019037303581559143, "epoch": 0.18484288354898337, "percentage": 18.48, "elapsed_time": "0:56:19", "remaining_time": "4:08:22"}
+{"current_steps": 305, "total_steps": 1623, "loss": 0.9737, "lr": 0.00018993196853415317, "epoch": 0.18792359827479976, "percentage": 18.79, "elapsed_time": "0:57:17", "remaining_time": "4:07:34"}
+{"current_steps": 310, "total_steps": 1623, "loss": 0.9551, "lr": 0.00018948155699482244, "epoch": 0.19100431300061615, "percentage": 19.1, "elapsed_time": "0:58:16", "remaining_time": "4:06:51"}
+{"current_steps": 315, "total_steps": 1623, "loss": 1.057, "lr": 0.00018902184799699263, "epoch": 0.19408502772643252, "percentage": 19.41, "elapsed_time": "0:59:13", "remaining_time": "4:05:55"}
+{"current_steps": 320, "total_steps": 1623, "loss": 1.0065, "lr": 0.00018855288930609692, "epoch": 0.1971657424522489, "percentage": 19.72, "elapsed_time": "1:00:12", "remaining_time": "4:05:09"}
+{"current_steps": 325, "total_steps": 1623, "loss": 0.8492, "lr": 0.00018807472964864515, "epoch": 0.2002464571780653, "percentage": 20.02, "elapsed_time": "1:00:52", "remaining_time": "4:03:05"}
+{"current_steps": 330, "total_steps": 1623, "loss": 1.0248, "lr": 0.00018758741870716092, "epoch": 0.2033271719038817, "percentage": 20.33, "elapsed_time": "1:01:50", "remaining_time": "4:02:17"}
+{"current_steps": 335, "total_steps": 1623, "loss": 1.0095, "lr": 0.00018709100711501955, "epoch": 0.20640788662969808, "percentage": 20.64, "elapsed_time": "1:02:57", "remaining_time": "4:02:03"}
+{"current_steps": 340, "total_steps": 1623, "loss": 0.9469, "lr": 0.0001865855464511869, "epoch": 0.20948860135551448, "percentage": 20.95, "elapsed_time": "1:03:56", "remaining_time": "4:01:15"}
+{"current_steps": 345, "total_steps": 1623, "loss": 0.8772, "lr": 0.00018607108923486025, "epoch": 0.21256931608133087, "percentage": 21.26, "elapsed_time": "1:04:57", "remaining_time": "4:00:39"}
+{"current_steps": 350, "total_steps": 1623, "loss": 0.8309, "lr": 0.00018554768892001136, "epoch": 0.21565003080714726, "percentage": 21.57, "elapsed_time": "1:05:38", "remaining_time": "3:58:43"}
+{"current_steps": 355, "total_steps": 1623, "loss": 0.8526, "lr": 0.00018501539988983234, "epoch": 0.21873074553296365, "percentage": 21.87, "elapsed_time": "1:06:36", "remaining_time": "3:57:56"}
+{"current_steps": 360, "total_steps": 1623, "loss": 0.9808, "lr": 0.0001844742774510851, "epoch": 0.22181146025878004, "percentage": 22.18, "elapsed_time": "1:07:44", "remaining_time": "3:57:39"}
+{"current_steps": 365, "total_steps": 1623, "loss": 0.9952, "lr": 0.00018392437782835475, "epoch": 0.22489217498459643, "percentage": 22.49, "elapsed_time": "1:08:44", "remaining_time": "3:56:54"}
+{"current_steps": 370, "total_steps": 1623, "loss": 1.011, "lr": 0.00018336575815820766, "epoch": 0.22797288971041282, "percentage": 22.8, "elapsed_time": "1:09:42", "remaining_time": "3:56:03"}
+{"current_steps": 375, "total_steps": 1623, "loss": 0.8487, "lr": 0.00018279847648325478, "epoch": 0.23105360443622922, "percentage": 23.11, "elapsed_time": "1:10:25", "remaining_time": "3:54:23"}
+{"current_steps": 380, "total_steps": 1623, "loss": 0.9038, "lr": 0.0001822225917461208, "epoch": 0.2341343191620456, "percentage": 23.41, "elapsed_time": "1:11:25", "remaining_time": "3:53:37"}
+{"current_steps": 385, "total_steps": 1623, "loss": 0.9601, "lr": 0.0001816381637833198, "epoch": 0.23721503388786197, "percentage": 23.72, "elapsed_time": "1:12:29", "remaining_time": "3:53:07"}
+{"current_steps": 390, "total_steps": 1623, "loss": 1.0495, "lr": 0.00018104525331903799, "epoch": 0.24029574861367836, "percentage": 24.03, "elapsed_time": "1:13:34", "remaining_time": "3:52:36"}
+{"current_steps": 395, "total_steps": 1623, "loss": 1.0792, "lr": 0.00018044392195882427, "epoch": 0.24337646333949475, "percentage": 24.34, "elapsed_time": "1:14:33", "remaining_time": "3:51:46"}
+{"current_steps": 400, "total_steps": 1623, "loss": 0.8639, "lr": 0.00017983423218318918, "epoch": 0.24645717806531114, "percentage": 24.65, "elapsed_time": "1:15:15", "remaining_time": "3:50:06"}
+{"current_steps": 405, "total_steps": 1623, "loss": 0.9426, "lr": 0.00017921624734111292, "epoch": 0.24953789279112754, "percentage": 24.95, "elapsed_time": "1:16:14", "remaining_time": "3:49:18"}
+{"current_steps": 410, "total_steps": 1623, "loss": 0.9744, "lr": 0.00017859003164346336, "epoch": 0.2526186075169439, "percentage": 25.26, "elapsed_time": "1:17:21", "remaining_time": "3:48:52"}
+{"current_steps": 415, "total_steps": 1623, "loss": 0.9612, "lr": 0.0001779556501563239, "epoch": 0.2556993222427603, "percentage": 25.57, "elapsed_time": "1:18:24", "remaining_time": "3:48:14"}
+{"current_steps": 420, "total_steps": 1623, "loss": 1.034, "lr": 0.00017731316879423327, "epoch": 0.2587800369685767, "percentage": 25.88, "elapsed_time": "1:19:22", "remaining_time": "3:47:20"}
+{"current_steps": 425, "total_steps": 1623, "loss": 0.8632, "lr": 0.00017666265431333654, "epoch": 0.2618607516943931, "percentage": 26.19, "elapsed_time": "1:20:05", "remaining_time": "3:45:45"}
+{"current_steps": 430, "total_steps": 1623, "loss": 0.9842, "lr": 0.000176004174304449, "epoch": 0.2649414664202095, "percentage": 26.49, "elapsed_time": "1:21:05", "remaining_time": "3:44:57"}
+{"current_steps": 435, "total_steps": 1623, "loss": 0.9874, "lr": 0.00017533779718603313, "epoch": 0.2680221811460259, "percentage": 26.8, "elapsed_time": "1:22:13", "remaining_time": "3:44:32"}
+{"current_steps": 440, "total_steps": 1623, "loss": 0.9457, "lr": 0.00017466359219708985, "epoch": 0.2711028958718423, "percentage": 27.11, "elapsed_time": "1:23:09", "remaining_time": "3:43:35"}
+{"current_steps": 445, "total_steps": 1623, "loss": 0.9501, "lr": 0.00017398162938996422, "epoch": 0.27418361059765867, "percentage": 27.42, "elapsed_time": "1:24:09", "remaining_time": "3:42:46"}
+{"current_steps": 450, "total_steps": 1623, "loss": 0.8123, "lr": 0.00017329197962306664, "epoch": 0.27726432532347506, "percentage": 27.73, "elapsed_time": "1:24:48", "remaining_time": "3:41:03"}
+{"current_steps": 455, "total_steps": 1623, "loss": 0.9048, "lr": 0.00017259471455351072, "epoch": 0.28034504004929145, "percentage": 28.03, "elapsed_time": "1:25:47", "remaining_time": "3:40:12"}
+{"current_steps": 460, "total_steps": 1623, "loss": 0.9711, "lr": 0.0001718899066296675, "epoch": 0.28342575477510784, "percentage": 28.34, "elapsed_time": "1:26:50", "remaining_time": "3:39:32"}
+{"current_steps": 465, "total_steps": 1623, "loss": 0.9762, "lr": 0.000171177629083638, "epoch": 0.28650646950092423, "percentage": 28.65, "elapsed_time": "1:27:54", "remaining_time": "3:38:54"}
+{"current_steps": 470, "total_steps": 1623, "loss": 1.0148, "lr": 0.0001704579559236441, "epoch": 0.2895871842267406, "percentage": 28.96, "elapsed_time": "1:28:57", "remaining_time": "3:38:12"}
+{"current_steps": 475, "total_steps": 1623, "loss": 0.786, "lr": 0.00016973096192633884, "epoch": 0.292667898952557, "percentage": 29.27, "elapsed_time": "1:29:38", "remaining_time": "3:36:39"}
+{"current_steps": 480, "total_steps": 1623, "loss": 0.9034, "lr": 0.00016899672262903677, "epoch": 0.2957486136783734, "percentage": 29.57, "elapsed_time": "1:30:38", "remaining_time": "3:35:50"}
+{"current_steps": 485, "total_steps": 1623, "loss": 0.9694, "lr": 0.00016825531432186543, "epoch": 0.2988293284041898, "percentage": 29.88, "elapsed_time": "1:31:36", "remaining_time": "3:34:57"}
+{"current_steps": 490, "total_steps": 1623, "loss": 1.0684, "lr": 0.00016750681403983846, "epoch": 0.30191004313000613, "percentage": 30.19, "elapsed_time": "1:32:34", "remaining_time": "3:34:03"}
+{"current_steps": 495, "total_steps": 1623, "loss": 0.9935, "lr": 0.00016675129955485152, "epoch": 0.3049907578558225, "percentage": 30.5, "elapsed_time": "1:33:32", "remaining_time": "3:33:08"}
+{"current_steps": 500, "total_steps": 1623, "loss": 0.8232, "lr": 0.00016598884936760131, "epoch": 0.3080714725816389, "percentage": 30.81, "elapsed_time": "1:34:19", "remaining_time": "3:31:50"}
+{"current_steps": 505, "total_steps": 1623, "loss": 0.989, "lr": 0.00016521954269942918, "epoch": 0.3111521873074553, "percentage": 31.12, "elapsed_time": "1:35:30", "remaining_time": "3:31:27"}
+{"current_steps": 510, "total_steps": 1623, "loss": 0.9521, "lr": 0.00016444345948408984, "epoch": 0.3142329020332717, "percentage": 31.42, "elapsed_time": "1:36:36", "remaining_time": "3:30:49"}
+{"current_steps": 515, "total_steps": 1623, "loss": 1.0013, "lr": 0.0001636606803594457, "epoch": 0.3173136167590881, "percentage": 31.73, "elapsed_time": "1:37:34", "remaining_time": "3:29:56"}
+{"current_steps": 520, "total_steps": 1623, "loss": 0.9773, "lr": 0.0001628712866590885, "epoch": 0.3203943314849045, "percentage": 32.04, "elapsed_time": "1:38:32", "remaining_time": "3:29:00"}
+{"current_steps": 525, "total_steps": 1623, "loss": 0.8414, "lr": 0.00016207536040388845, "epoch": 0.3234750462107209, "percentage": 32.35, "elapsed_time": "1:39:19", "remaining_time": "3:27:42"}
+{"current_steps": 530, "total_steps": 1623, "loss": 0.9793, "lr": 0.0001612729842934718, "epoch": 0.32655576093653726, "percentage": 32.66, "elapsed_time": "1:40:19", "remaining_time": "3:26:52"}
+{"current_steps": 535, "total_steps": 1623, "loss": 1.0042, "lr": 0.00016046424169762827, "epoch": 0.32963647566235366, "percentage": 32.96, "elapsed_time": "1:41:19", "remaining_time": "3:26:03"}
+{"current_steps": 540, "total_steps": 1623, "loss": 1.0067, "lr": 0.0001596492166476485, "epoch": 0.33271719038817005, "percentage": 33.27, "elapsed_time": "1:42:28", "remaining_time": "3:25:31"}
+{"current_steps": 545, "total_steps": 1623, "loss": 0.9971, "lr": 0.0001588279938275929, "epoch": 0.33579790511398644, "percentage": 33.58, "elapsed_time": "1:43:28", "remaining_time": "3:24:41"}
+{"current_steps": 550, "total_steps": 1623, "loss": 0.7794, "lr": 0.00015800065856549269, "epoch": 0.33887861983980283, "percentage": 33.89, "elapsed_time": "1:44:12", "remaining_time": "3:23:18"}
+{"current_steps": 555, "total_steps": 1623, "loss": 0.9553, "lr": 0.00015716729682448393, "epoch": 0.3419593345656192, "percentage": 34.2, "elapsed_time": "1:45:14", "remaining_time": "3:22:31"}
+{"current_steps": 560, "total_steps": 1623, "loss": 0.9601, "lr": 0.0001563279951938758, "epoch": 0.3450400492914356, "percentage": 34.5, "elapsed_time": "1:46:16", "remaining_time": "3:21:43"}
+{"current_steps": 565, "total_steps": 1623, "loss": 1.0177, "lr": 0.00015548284088015354, "epoch": 0.348120764017252, "percentage": 34.81, "elapsed_time": "1:47:17", "remaining_time": "3:20:54"}
+{"current_steps": 570, "total_steps": 1623, "loss": 0.9958, "lr": 0.00015463192169791741, "epoch": 0.3512014787430684, "percentage": 35.12, "elapsed_time": "1:48:17", "remaining_time": "3:20:02"}
+{"current_steps": 575, "total_steps": 1623, "loss": 0.8352, "lr": 0.0001537753260607584, "epoch": 0.3542821934688848, "percentage": 35.43, "elapsed_time": "1:48:57", "remaining_time": "3:18:35"}
+{"current_steps": 580, "total_steps": 1623, "loss": 0.9472, "lr": 0.00015291314297207175, "epoch": 0.3573629081947012, "percentage": 35.74, "elapsed_time": "1:50:02", "remaining_time": "3:17:53"}
+{"current_steps": 585, "total_steps": 1623, "loss": 0.9853, "lr": 0.0001520454620158093, "epoch": 0.36044362292051757, "percentage": 36.04, "elapsed_time": "1:51:04", "remaining_time": "3:17:05"}
+{"current_steps": 590, "total_steps": 1623, "loss": 0.9141, "lr": 0.00015117237334717117, "epoch": 0.36352433764633396, "percentage": 36.35, "elapsed_time": "1:52:02", "remaining_time": "3:16:10"}
+{"current_steps": 595, "total_steps": 1623, "loss": 1.0516, "lr": 0.00015029396768323846, "epoch": 0.36660505237215035, "percentage": 36.66, "elapsed_time": "1:53:03", "remaining_time": "3:15:20"}
+{"current_steps": 600, "total_steps": 1623, "loss": 0.8681, "lr": 0.00014941033629354734, "epoch": 0.36968576709796674, "percentage": 36.97, "elapsed_time": "1:53:46", "remaining_time": "3:13:59"}
+{"current_steps": 605, "total_steps": 1623, "loss": 0.9942, "lr": 0.00014852157099060596, "epoch": 0.37276648182378314, "percentage": 37.28, "elapsed_time": "1:54:52", "remaining_time": "3:13:17"}
+{"current_steps": 610, "total_steps": 1623, "loss": 1.0202, "lr": 0.00014762776412035456, "epoch": 0.3758471965495995, "percentage": 37.58, "elapsed_time": "1:55:56", "remaining_time": "3:12:31"}
+{"current_steps": 615, "total_steps": 1623, "loss": 0.9941, "lr": 0.00014672900855257056, "epoch": 0.3789279112754159, "percentage": 37.89, "elapsed_time": "1:56:54", "remaining_time": "3:11:37"}
+{"current_steps": 620, "total_steps": 1623, "loss": 0.9866, "lr": 0.00014582539767121904, "epoch": 0.3820086260012323, "percentage": 38.2, "elapsed_time": "1:57:53", "remaining_time": "3:10:43"}
+{"current_steps": 625, "total_steps": 1623, "loss": 0.741, "lr": 0.0001449170253647498, "epoch": 0.3850893407270487, "percentage": 38.51, "elapsed_time": "1:58:35", "remaining_time": "3:09:22"}
+{"current_steps": 630, "total_steps": 1623, "loss": 0.9465, "lr": 0.0001440039860163419, "epoch": 0.38817005545286504, "percentage": 38.82, "elapsed_time": "1:59:38", "remaining_time": "3:08:34"}
+{"current_steps": 635, "total_steps": 1623, "loss": 0.9403, "lr": 0.00014308637449409706, "epoch": 0.39125077017868143, "percentage": 39.13, "elapsed_time": "2:00:39", "remaining_time": "3:07:44"}
+{"current_steps": 640, "total_steps": 1623, "loss": 1.0146, "lr": 0.00014216428614118243, "epoch": 0.3943314849044978, "percentage": 39.43, "elapsed_time": "2:01:45", "remaining_time": "3:07:00"}
+{"current_steps": 645, "total_steps": 1623, "loss": 0.9778, "lr": 0.00014123781676592418, "epoch": 0.3974121996303142, "percentage": 39.74, "elapsed_time": "2:02:50", "remaining_time": "3:06:15"}
+{"current_steps": 650, "total_steps": 1623, "loss": 0.8311, "lr": 0.00014030706263185247, "epoch": 0.4004929143561306, "percentage": 40.05, "elapsed_time": "2:03:34", "remaining_time": "3:04:58"}
+{"current_steps": 655, "total_steps": 1623, "loss": 0.9141, "lr": 0.00013937212044769955, "epoch": 0.403573629081947, "percentage": 40.36, "elapsed_time": "2:04:41", "remaining_time": "3:04:16"}
+{"current_steps": 660, "total_steps": 1623, "loss": 0.9867, "lr": 0.0001384330873573513, "epoch": 0.4066543438077634, "percentage": 40.67, "elapsed_time": "2:05:46", "remaining_time": "3:03:31"}
+{"current_steps": 665, "total_steps": 1623, "loss": 1.0004, "lr": 0.00013749006092975347, "epoch": 0.4097350585335798, "percentage": 40.97, "elapsed_time": "2:06:44", "remaining_time": "3:02:34"}
+{"current_steps": 670, "total_steps": 1623, "loss": 0.9771, "lr": 0.00013654313914877414, "epoch": 0.41281577325939617, "percentage": 41.28, "elapsed_time": "2:07:42", "remaining_time": "3:01:39"}
+{"current_steps": 675, "total_steps": 1623, "loss": 0.7806, "lr": 0.00013559242040302272, "epoch": 0.41589648798521256, "percentage": 41.59, "elapsed_time": "2:08:27", "remaining_time": "3:00:24"}
+{"current_steps": 680, "total_steps": 1623, "loss": 0.9531, "lr": 0.00013463800347562706, "epoch": 0.41897720271102895, "percentage": 41.9, "elapsed_time": "2:09:33", "remaining_time": "2:59:40"}
+{"current_steps": 685, "total_steps": 1623, "loss": 0.8862, "lr": 0.00013367998753396944, "epoch": 0.42205791743684534, "percentage": 42.21, "elapsed_time": "2:10:37", "remaining_time": "2:58:52"}
+{"current_steps": 690, "total_steps": 1623, "loss": 0.978, "lr": 0.00013271847211938285, "epoch": 0.42513863216266173, "percentage": 42.51, "elapsed_time": "2:11:45", "remaining_time": "2:58:08"}
+{"current_steps": 695, "total_steps": 1623, "loss": 1.0035, "lr": 0.0001317535571368082, "epoch": 0.4282193468884781, "percentage": 42.82, "elapsed_time": "2:12:44", "remaining_time": "2:57:14"}
+{"current_steps": 700, "total_steps": 1623, "loss": 0.8737, "lr": 0.00013078534284441382, "epoch": 0.4313000616142945, "percentage": 43.13, "elapsed_time": "2:13:29", "remaining_time": "2:56:01"}
+{"current_steps": 705, "total_steps": 1623, "loss": 0.9117, "lr": 0.00012981392984317834, "epoch": 0.4343807763401109, "percentage": 43.44, "elapsed_time": "2:14:28", "remaining_time": "2:55:06"}
+{"current_steps": 710, "total_steps": 1623, "loss": 0.9657, "lr": 0.00012883941906643786, "epoch": 0.4374614910659273, "percentage": 43.75, "elapsed_time": "2:15:31", "remaining_time": "2:54:16"}
+{"current_steps": 715, "total_steps": 1623, "loss": 0.9081, "lr": 0.00012786191176939848, "epoch": 0.4405422057917437, "percentage": 44.05, "elapsed_time": "2:16:28", "remaining_time": "2:53:19"}
+{"current_steps": 720, "total_steps": 1623, "loss": 0.9299, "lr": 0.00012688150951861582, "epoch": 0.4436229205175601, "percentage": 44.36, "elapsed_time": "2:17:30", "remaining_time": "2:52:28"}
+{"current_steps": 725, "total_steps": 1623, "loss": 0.8259, "lr": 0.00012589831418144154, "epoch": 0.4467036352433765, "percentage": 44.67, "elapsed_time": "2:18:13", "remaining_time": "2:51:13"}
+{"current_steps": 730, "total_steps": 1623, "loss": 0.9407, "lr": 0.00012491242791543922, "epoch": 0.44978434996919286, "percentage": 44.98, "elapsed_time": "2:19:18", "remaining_time": "2:50:24"}
+{"current_steps": 735, "total_steps": 1623, "loss": 0.9092, "lr": 0.00012392395315776963, "epoch": 0.45286506469500926, "percentage": 45.29, "elapsed_time": "2:20:18", "remaining_time": "2:49:30"}
+{"current_steps": 740, "total_steps": 1623, "loss": 0.9285, "lr": 0.00012293299261454725, "epoch": 0.45594577942082565, "percentage": 45.59, "elapsed_time": "2:21:18", "remaining_time": "2:48:36"}
+{"current_steps": 745, "total_steps": 1623, "loss": 0.9379, "lr": 0.00012193964925016872, "epoch": 0.45902649414664204, "percentage": 45.9, "elapsed_time": "2:22:13", "remaining_time": "2:47:37"}
+{"current_steps": 750, "total_steps": 1623, "loss": 0.7754, "lr": 0.00012094402627661447, "epoch": 0.46210720887245843, "percentage": 46.21, "elapsed_time": "2:22:56", "remaining_time": "2:46:23"}
+{"current_steps": 755, "total_steps": 1623, "loss": 0.9358, "lr": 0.00011994622714272448, "epoch": 0.4651879235982748, "percentage": 46.52, "elapsed_time": "2:24:02", "remaining_time": "2:45:36"}
+{"current_steps": 760, "total_steps": 1623, "loss": 0.9574, "lr": 0.00011894635552344975, "epoch": 0.4682686383240912, "percentage": 46.83, "elapsed_time": "2:25:04", "remaining_time": "2:44:44"}
+{"current_steps": 765, "total_steps": 1623, "loss": 0.9345, "lr": 0.00011794451530908011, "epoch": 0.4713493530499076, "percentage": 47.13, "elapsed_time": "2:26:07", "remaining_time": "2:43:53"}
+{"current_steps": 770, "total_steps": 1623, "loss": 0.9837, "lr": 0.00011694081059444946, "epoch": 0.47443006777572394, "percentage": 47.44, "elapsed_time": "2:27:11", "remaining_time": "2:43:03"}
+{"current_steps": 775, "total_steps": 1623, "loss": 0.816, "lr": 0.0001159353456681201, "epoch": 0.47751078250154033, "percentage": 47.75, "elapsed_time": "2:27:54", "remaining_time": "2:41:50"}
+{"current_steps": 780, "total_steps": 1623, "loss": 0.9001, "lr": 0.00011492822500154667, "epoch": 0.4805914972273567, "percentage": 48.06, "elapsed_time": "2:28:55", "remaining_time": "2:40:57"}
+{"current_steps": 785, "total_steps": 1623, "loss": 0.8926, "lr": 0.00011391955323822126, "epoch": 0.4836722119531731, "percentage": 48.37, "elapsed_time": "2:29:54", "remaining_time": "2:40:01"}
+{"current_steps": 790, "total_steps": 1623, "loss": 1.0207, "lr": 0.00011290943518280057, "epoch": 0.4867529266789895, "percentage": 48.68, "elapsed_time": "2:30:54", "remaining_time": "2:39:07"}
+{"current_steps": 795, "total_steps": 1623, "loss": 0.9285, "lr": 0.0001118979757902162, "epoch": 0.4898336414048059, "percentage": 48.98, "elapsed_time": "2:31:58", "remaining_time": "2:38:16"}
+{"current_steps": 800, "total_steps": 1623, "loss": 0.8541, "lr": 0.00011088528015476964, "epoch": 0.4929143561306223, "percentage": 49.29, "elapsed_time": "2:32:41", "remaining_time": "2:37:04"}
+{"current_steps": 805, "total_steps": 1623, "loss": 0.9033, "lr": 0.00010987145349921251, "epoch": 0.4959950708564387, "percentage": 49.6, "elapsed_time": "2:33:41", "remaining_time": "2:36:10"}
+{"current_steps": 810, "total_steps": 1623, "loss": 0.9413, "lr": 0.0001088566011638134, "epoch": 0.49907578558225507, "percentage": 49.91, "elapsed_time": "2:34:43", "remaining_time": "2:35:17"}
+{"current_steps": 815, "total_steps": 1623, "loss": 0.9315, "lr": 0.00010784082859541292, "epoch": 0.5021565003080715, "percentage": 50.22, "elapsed_time": "2:35:43", "remaining_time": "2:34:23"}
+{"current_steps": 820, "total_steps": 1623, "loss": 0.9527, "lr": 0.0001068242413364671, "epoch": 0.5052372150338879, "percentage": 50.52, "elapsed_time": "2:36:48", "remaining_time": "2:33:33"}
+{"current_steps": 825, "total_steps": 1623, "loss": 0.8284, "lr": 0.00010580694501408138, "epoch": 0.5083179297597042, "percentage": 50.83, "elapsed_time": "2:37:32", "remaining_time": "2:32:23"}
+{"current_steps": 830, "total_steps": 1623, "loss": 0.8648, "lr": 0.00010478904532903535, "epoch": 0.5113986444855206, "percentage": 51.14, "elapsed_time": "2:38:31", "remaining_time": "2:31:27"}
+{"current_steps": 835, "total_steps": 1623, "loss": 1.0178, "lr": 0.00010377064804480025, "epoch": 0.514479359211337, "percentage": 51.45, "elapsed_time": "2:39:36", "remaining_time": "2:30:37"}
+{"current_steps": 840, "total_steps": 1623, "loss": 0.8944, "lr": 0.00010275185897654971, "epoch": 0.5175600739371534, "percentage": 51.76, "elapsed_time": "2:40:36", "remaining_time": "2:29:42"}
+{"current_steps": 845, "total_steps": 1623, "loss": 0.922, "lr": 0.00010173278398016501, "epoch": 0.5206407886629698, "percentage": 52.06, "elapsed_time": "2:41:40", "remaining_time": "2:28:51"}
+{"current_steps": 850, "total_steps": 1623, "loss": 0.7921, "lr": 0.00010071352894123654, "epoch": 0.5237215033887862, "percentage": 52.37, "elapsed_time": "2:42:24", "remaining_time": "2:27:41"}
+{"current_steps": 855, "total_steps": 1623, "loss": 0.9301, "lr": 9.969419976406165e-05, "epoch": 0.5268022181146026, "percentage": 52.68, "elapsed_time": "2:43:29", "remaining_time": "2:26:51"}
+{"current_steps": 860, "total_steps": 1623, "loss": 0.9367, "lr": 9.867490236064108e-05, "epoch": 0.529882932840419, "percentage": 52.99, "elapsed_time": "2:44:34", "remaining_time": "2:26:00"}
+{"current_steps": 865, "total_steps": 1623, "loss": 1.0116, "lr": 9.765574263967396e-05, "epoch": 0.5329636475662354, "percentage": 53.3, "elapsed_time": "2:45:47", "remaining_time": "2:25:16"}
+{"current_steps": 870, "total_steps": 1623, "loss": 0.915, "lr": 9.66368264955539e-05, "epoch": 0.5360443622920518, "percentage": 53.6, "elapsed_time": "2:46:49", "remaining_time": "2:24:23"}
+{"current_steps": 875, "total_steps": 1623, "loss": 0.8123, "lr": 9.56182597973658e-05, "epoch": 0.5391250770178682, "percentage": 53.91, "elapsed_time": "2:47:32", "remaining_time": "2:23:13"}
+{"current_steps": 880, "total_steps": 1623, "loss": 0.9215, "lr": 9.460014837788605e-05, "epoch": 0.5422057917436846, "percentage": 54.22, "elapsed_time": "2:48:39", "remaining_time": "2:22:24"}
+{"current_steps": 885, "total_steps": 1623, "loss": 0.9195, "lr": 9.358259802258581e-05, "epoch": 0.5452865064695009, "percentage": 54.53, "elapsed_time": "2:49:48", "remaining_time": "2:21:35"}
+{"current_steps": 890, "total_steps": 1623, "loss": 0.9105, "lr": 9.256571445863972e-05, "epoch": 0.5483672211953173, "percentage": 54.84, "elapsed_time": "2:50:47", "remaining_time": "2:20:39"}
+{"current_steps": 895, "total_steps": 1623, "loss": 0.965, "lr": 9.154960334394027e-05, "epoch": 0.5514479359211337, "percentage": 55.14, "elapsed_time": "2:51:46", "remaining_time": "2:19:42"}
+{"current_steps": 900, "total_steps": 1623, "loss": 0.7986, "lr": 9.053437025611973e-05, "epoch": 0.5545286506469501, "percentage": 55.45, "elapsed_time": "2:52:31", "remaining_time": "2:18:36"}
+{"current_steps": 905, "total_steps": 1623, "loss": 0.9545, "lr": 8.952012068158027e-05, "epoch": 0.5576093653727665, "percentage": 55.76, "elapsed_time": "2:53:32", "remaining_time": "2:17:40"}
+{"current_steps": 910, "total_steps": 1623, "loss": 0.9846, "lr": 8.850696000453326e-05, "epoch": 0.5606900800985829, "percentage": 56.07, "elapsed_time": "2:54:36", "remaining_time": "2:16:48"}
+{"current_steps": 915, "total_steps": 1623, "loss": 0.9375, "lr": 8.749499349604993e-05, "epoch": 0.5637707948243993, "percentage": 56.38, "elapsed_time": "2:55:42", "remaining_time": "2:15:57"}
+{"current_steps": 920, "total_steps": 1623, "loss": 0.8851, "lr": 8.64843263031228e-05, "epoch": 0.5668515095502157, "percentage": 56.69, "elapsed_time": "2:56:42", "remaining_time": "2:15:01"}
+{"current_steps": 925, "total_steps": 1623, "loss": 0.7475, "lr": 8.547506343774097e-05, "epoch": 0.5699322242760321, "percentage": 56.99, "elapsed_time": "2:57:23", "remaining_time": "2:13:51"}
+{"current_steps": 930, "total_steps": 1623, "loss": 1.0023, "lr": 8.446730976597878e-05, "epoch": 0.5730129390018485, "percentage": 57.3, "elapsed_time": "2:58:28", "remaining_time": "2:12:59"}
+{"current_steps": 935, "total_steps": 1623, "loss": 0.9047, "lr": 8.346116999709975e-05, "epoch": 0.5760936537276649, "percentage": 57.61, "elapsed_time": "2:59:31", "remaining_time": "2:12:06"}
+{"current_steps": 940, "total_steps": 1623, "loss": 0.9262, "lr": 8.245674867267724e-05, "epoch": 0.5791743684534812, "percentage": 57.92, "elapsed_time": "3:00:36", "remaining_time": "2:11:14"}
+{"current_steps": 945, "total_steps": 1623, "loss": 0.9537, "lr": 8.145415015573183e-05, "epoch": 0.5822550831792976, "percentage": 58.23, "elapsed_time": "3:01:35", "remaining_time": "2:10:17"}
+{"current_steps": 950, "total_steps": 1623, "loss": 0.7926, "lr": 8.045347861988789e-05, "epoch": 0.585335797905114, "percentage": 58.53, "elapsed_time": "3:02:20", "remaining_time": "2:09:10"}
+{"current_steps": 955, "total_steps": 1623, "loss": 0.9144, "lr": 7.945483803854936e-05, "epoch": 0.5884165126309304, "percentage": 58.84, "elapsed_time": "3:03:17", "remaining_time": "2:08:12"}
+{"current_steps": 960, "total_steps": 1623, "loss": 1.0055, "lr": 7.845833217409675e-05, "epoch": 0.5914972273567468, "percentage": 59.15, "elapsed_time": "3:04:25", "remaining_time": "2:07:22"}
+{"current_steps": 965, "total_steps": 1623, "loss": 0.9012, "lr": 7.746406456710564e-05, "epoch": 0.5945779420825632, "percentage": 59.46, "elapsed_time": "3:05:24", "remaining_time": "2:06:25"}
+{"current_steps": 970, "total_steps": 1623, "loss": 0.9128, "lr": 7.64721385255886e-05, "epoch": 0.5976586568083796, "percentage": 59.77, "elapsed_time": "3:06:23", "remaining_time": "2:05:28"}
+{"current_steps": 975, "total_steps": 1623, "loss": 0.7712, "lr": 7.548265711426104e-05, "epoch": 0.600739371534196, "percentage": 60.07, "elapsed_time": "3:07:07", "remaining_time": "2:04:21"}
+{"current_steps": 980, "total_steps": 1623, "loss": 0.9942, "lr": 7.449572314383237e-05, "epoch": 0.6038200862600123, "percentage": 60.38, "elapsed_time": "3:08:10", "remaining_time": "2:03:28"}
+{"current_steps": 985, "total_steps": 1623, "loss": 0.9889, "lr": 7.351143916032374e-05, "epoch": 0.6069008009858287, "percentage": 60.69, "elapsed_time": "3:09:13", "remaining_time": "2:02:34"}
+{"current_steps": 990, "total_steps": 1623, "loss": 0.9398, "lr": 7.252990743441293e-05, "epoch": 0.609981515711645, "percentage": 61.0, "elapsed_time": "3:10:14", "remaining_time": "2:01:38"}
+{"current_steps": 995, "total_steps": 1623, "loss": 1.0196, "lr": 7.155122995080827e-05, "epoch": 0.6130622304374614, "percentage": 61.31, "elapsed_time": "3:11:15", "remaining_time": "2:00:42"}
+{"current_steps": 1000, "total_steps": 1623, "loss": 0.803, "lr": 7.057550839765188e-05, "epoch": 0.6161429451632778, "percentage": 61.61, "elapsed_time": "3:11:58", "remaining_time": "1:59:36"}
+{"current_steps": 1005, "total_steps": 1623, "loss": 0.9066, "lr": 6.960284415595407e-05, "epoch": 0.6192236598890942, "percentage": 61.92, "elapsed_time": "3:13:06", "remaining_time": "1:58:44"}
+{"current_steps": 1010, "total_steps": 1623, "loss": 1.0486, "lr": 6.863333828905929e-05, "epoch": 0.6223043746149106, "percentage": 62.23, "elapsed_time": "3:14:12", "remaining_time": "1:57:52"}
+{"current_steps": 1015, "total_steps": 1623, "loss": 0.9454, "lr": 6.766709153214542e-05, "epoch": 0.625385089340727, "percentage": 62.54, "elapsed_time": "3:15:16", "remaining_time": "1:56:58"}
+{"current_steps": 1020, "total_steps": 1623, "loss": 0.9561, "lr": 6.670420428175705e-05, "epoch": 0.6284658040665434, "percentage": 62.85, "elapsed_time": "3:16:15", "remaining_time": "1:56:01"}
+{"current_steps": 1025, "total_steps": 1623, "loss": 0.7882, "lr": 6.574477658537375e-05, "epoch": 0.6315465187923598, "percentage": 63.15, "elapsed_time": "3:17:00", "remaining_time": "1:54:56"}
+{"current_steps": 1030, "total_steps": 1623, "loss": 0.8443, "lr": 6.4788908131015e-05, "epoch": 0.6346272335181762, "percentage": 63.46, "elapsed_time": "3:17:58", "remaining_time": "1:53:59"}
+{"current_steps": 1035, "total_steps": 1623, "loss": 0.8491, "lr": 6.38366982368819e-05, "epoch": 0.6377079482439926, "percentage": 63.77, "elapsed_time": "3:19:01", "remaining_time": "1:53:04"}
+{"current_steps": 1040, "total_steps": 1623, "loss": 0.9222, "lr": 6.288824584103816e-05, "epoch": 0.640788662969809, "percentage": 64.08, "elapsed_time": "3:20:01", "remaining_time": "1:52:07"}
+{"current_steps": 1045, "total_steps": 1623, "loss": 0.9085, "lr": 6.194364949112953e-05, "epoch": 0.6438693776956254, "percentage": 64.39, "elapsed_time": "3:21:04", "remaining_time": "1:51:12"}
+{"current_steps": 1050, "total_steps": 1623, "loss": 0.8007, "lr": 6.100300733414474e-05, "epoch": 0.6469500924214417, "percentage": 64.7, "elapsed_time": "3:21:47", "remaining_time": "1:50:07"}
+{"current_steps": 1055, "total_steps": 1623, "loss": 0.8945, "lr": 6.0066417106217455e-05, "epoch": 0.6500308071472581, "percentage": 65.0, "elapsed_time": "3:22:48", "remaining_time": "1:49:11"}
+{"current_steps": 1060, "total_steps": 1623, "loss": 0.9188, "lr": 5.9133976122471214e-05, "epoch": 0.6531115218730745, "percentage": 65.31, "elapsed_time": "3:23:51", "remaining_time": "1:48:16"}
+{"current_steps": 1065, "total_steps": 1623, "loss": 0.9509, "lr": 5.82057812669081e-05, "epoch": 0.6561922365988909, "percentage": 65.62, "elapsed_time": "3:24:49", "remaining_time": "1:47:19"}
+{"current_steps": 1070, "total_steps": 1623, "loss": 0.851, "lr": 5.728192898234195e-05, "epoch": 0.6592729513247073, "percentage": 65.93, "elapsed_time": "3:25:50", "remaining_time": "1:46:23"}
+{"current_steps": 1075, "total_steps": 1623, "loss": 0.7561, "lr": 5.6362515260377835e-05, "epoch": 0.6623536660505237, "percentage": 66.24, "elapsed_time": "3:26:33", "remaining_time": "1:45:17"}
+{"current_steps": 1080, "total_steps": 1623, "loss": 0.9267, "lr": 5.544763563143793e-05, "epoch": 0.6654343807763401, "percentage": 66.54, "elapsed_time": "3:27:36", "remaining_time": "1:44:22"}
+{"current_steps": 1085, "total_steps": 1623, "loss": 0.9299, "lr": 5.4537385154835864e-05, "epoch": 0.6685150955021565, "percentage": 66.85, "elapsed_time": "3:28:41", "remaining_time": "1:43:28"}
+{"current_steps": 1090, "total_steps": 1623, "loss": 0.8646, "lr": 5.363185840889935e-05, "epoch": 0.6715958102279729, "percentage": 67.16, "elapsed_time": "3:29:40", "remaining_time": "1:42:31"}
+{"current_steps": 1095, "total_steps": 1623, "loss": 0.9427, "lr": 5.273114948114346e-05, "epoch": 0.6746765249537893, "percentage": 67.47, "elapsed_time": "3:30:39", "remaining_time": "1:41:34"}
+{"current_steps": 1100, "total_steps": 1623, "loss": 0.7519, "lr": 5.1835351958494515e-05, "epoch": 0.6777572396796057, "percentage": 67.78, "elapsed_time": "3:31:24", "remaining_time": "1:40:30"}
+{"current_steps": 1105, "total_steps": 1623, "loss": 0.9132, "lr": 5.094455891756587e-05, "epoch": 0.680837954405422, "percentage": 68.08, "elapsed_time": "3:32:25", "remaining_time": "1:39:34"}
+{"current_steps": 1110, "total_steps": 1623, "loss": 0.9795, "lr": 5.00588629149872e-05, "epoch": 0.6839186691312384, "percentage": 68.39, "elapsed_time": "3:33:26", "remaining_time": "1:38:38"}
+{"current_steps": 1115, "total_steps": 1623, "loss": 0.905, "lr": 4.91783559777873e-05, "epoch": 0.6869993838570548, "percentage": 68.7, "elapsed_time": "3:34:25", "remaining_time": "1:37:41"}
+{"current_steps": 1120, "total_steps": 1623, "loss": 0.909, "lr": 4.830312959383238e-05, "epoch": 0.6900800985828712, "percentage": 69.01, "elapsed_time": "3:35:27", "remaining_time": "1:36:45"}
+{"current_steps": 1125, "total_steps": 1623, "loss": 0.7293, "lr": 4.7433274702319815e-05, "epoch": 0.6931608133086876, "percentage": 69.32, "elapsed_time": "3:36:11", "remaining_time": "1:35:41"}
+{"current_steps": 1130, "total_steps": 1623, "loss": 0.8847, "lr": 4.656888168432962e-05, "epoch": 0.696241528034504, "percentage": 69.62, "elapsed_time": "3:37:12", "remaining_time": "1:34:45"}
+{"current_steps": 1135, "total_steps": 1623, "loss": 0.9697, "lr": 4.571004035343315e-05, "epoch": 0.6993222427603204, "percentage": 69.93, "elapsed_time": "3:38:16", "remaining_time": "1:33:50"}
+{"current_steps": 1140, "total_steps": 1623, "loss": 0.8963, "lr": 4.485683994636144e-05, "epoch": 0.7024029574861368, "percentage": 70.24, "elapsed_time": "3:39:19", "remaining_time": "1:32:55"}
+{"current_steps": 1145, "total_steps": 1623, "loss": 0.9756, "lr": 4.400936911373308e-05, "epoch": 0.7054836722119532, "percentage": 70.55, "elapsed_time": "3:40:21", "remaining_time": "1:31:59"}
+{"current_steps": 1150, "total_steps": 1623, "loss": 0.7932, "lr": 4.3167715910842966e-05, "epoch": 0.7085643869377696, "percentage": 70.86, "elapsed_time": "3:41:03", "remaining_time": "1:30:55"}
+{"current_steps": 1155, "total_steps": 1623, "loss": 0.9168, "lr": 4.2331967788513295e-05, "epoch": 0.711645101663586, "percentage": 71.16, "elapsed_time": "3:42:03", "remaining_time": "1:29:58"}
+{"current_steps": 1160, "total_steps": 1623, "loss": 0.9272, "lr": 4.1502211584006836e-05, "epoch": 0.7147258163894024, "percentage": 71.47, "elapsed_time": "3:43:03", "remaining_time": "1:29:02"}
+{"current_steps": 1165, "total_steps": 1623, "loss": 0.9724, "lr": 4.067853351200446e-05, "epoch": 0.7178065311152187, "percentage": 71.78, "elapsed_time": "3:44:08", "remaining_time": "1:28:07"}
+{"current_steps": 1170, "total_steps": 1623, "loss": 0.9153, "lr": 3.986101915564695e-05, "epoch": 0.7208872458410351, "percentage": 72.09, "elapsed_time": "3:45:08", "remaining_time": "1:27:10"}
+{"current_steps": 1175, "total_steps": 1623, "loss": 0.7897, "lr": 3.904975345764262e-05, "epoch": 0.7239679605668515, "percentage": 72.4, "elapsed_time": "3:45:49", "remaining_time": "1:26:05"}
+{"current_steps": 1180, "total_steps": 1623, "loss": 0.931, "lr": 3.824482071144163e-05, "epoch": 0.7270486752926679, "percentage": 72.7, "elapsed_time": "3:46:59", "remaining_time": "1:25:12"}
+{"current_steps": 1185, "total_steps": 1623, "loss": 0.905, "lr": 3.744630455247739e-05, "epoch": 0.7301293900184843, "percentage": 73.01, "elapsed_time": "3:48:02", "remaining_time": "1:24:17"}
+{"current_steps": 1190, "total_steps": 1623, "loss": 0.927, "lr": 3.6654287949476626e-05, "epoch": 0.7332101047443007, "percentage": 73.32, "elapsed_time": "3:49:05", "remaining_time": "1:23:21"}
+{"current_steps": 1195, "total_steps": 1623, "loss": 0.9488, "lr": 3.586885319583858e-05, "epoch": 0.7362908194701171, "percentage": 73.63, "elapsed_time": "3:50:08", "remaining_time": "1:22:25"}
+{"current_steps": 1200, "total_steps": 1623, "loss": 0.8075, "lr": 3.5090081901084525e-05, "epoch": 0.7393715341959335, "percentage": 73.94, "elapsed_time": "3:50:49", "remaining_time": "1:21:21"}
+{"current_steps": 1205, "total_steps": 1623, "loss": 0.9658, "lr": 3.431805498237808e-05, "epoch": 0.7424522489217499, "percentage": 74.25, "elapsed_time": "3:51:55", "remaining_time": "1:20:27"}
+{"current_steps": 1210, "total_steps": 1623, "loss": 0.953, "lr": 3.355285265611784e-05, "epoch": 0.7455329636475663, "percentage": 74.55, "elapsed_time": "3:53:02", "remaining_time": "1:19:32"}
+{"current_steps": 1215, "total_steps": 1623, "loss": 0.9542, "lr": 3.279455442960238e-05, "epoch": 0.7486136783733827, "percentage": 74.86, "elapsed_time": "3:54:04", "remaining_time": "1:18:36"}
+{"current_steps": 1220, "total_steps": 1623, "loss": 0.9838, "lr": 3.204323909276924e-05, "epoch": 0.751694393099199, "percentage": 75.17, "elapsed_time": "3:55:09", "remaining_time": "1:17:40"}
+{"current_steps": 1225, "total_steps": 1623, "loss": 0.7694, "lr": 3.1298984710008484e-05, "epoch": 0.7547751078250154, "percentage": 75.48, "elapsed_time": "3:55:53", "remaining_time": "1:16:38"}
+{"current_steps": 1230, "total_steps": 1623, "loss": 0.8751, "lr": 3.056186861205136e-05, "epoch": 0.7578558225508318, "percentage": 75.79, "elapsed_time": "3:56:50", "remaining_time": "1:15:40"}
+{"current_steps": 1235, "total_steps": 1623, "loss": 0.9526, "lr": 2.9831967387935467e-05, "epoch": 0.7609365372766482, "percentage": 76.09, "elapsed_time": "3:57:58", "remaining_time": "1:14:45"}
+{"current_steps": 1240, "total_steps": 1623, "loss": 0.8726, "lr": 2.9109356877046712e-05, "epoch": 0.7640172520024646, "percentage": 76.4, "elapsed_time": "3:59:00", "remaining_time": "1:13:49"}
+{"current_steps": 1245, "total_steps": 1623, "loss": 0.943, "lr": 2.8394112161239605e-05, "epoch": 0.767097966728281, "percentage": 76.71, "elapsed_time": "4:00:01", "remaining_time": "1:12:52"}
+{"current_steps": 1250, "total_steps": 1623, "loss": 0.7294, "lr": 2.7686307557035685e-05, "epoch": 0.7701786814540974, "percentage": 77.02, "elapsed_time": "4:00:44", "remaining_time": "1:11:50"}
+{"current_steps": 1255, "total_steps": 1623, "loss": 0.8862, "lr": 2.6986016607901908e-05, "epoch": 0.7732593961799138, "percentage": 77.33, "elapsed_time": "4:01:47", "remaining_time": "1:10:54"}
+{"current_steps": 1260, "total_steps": 1623, "loss": 0.9054, "lr": 2.629331207660931e-05, "epoch": 0.7763401109057301, "percentage": 77.63, "elapsed_time": "4:02:48", "remaining_time": "1:09:57"}
+{"current_steps": 1265, "total_steps": 1623, "loss": 0.8883, "lr": 2.5608265937672436e-05, "epoch": 0.7794208256315465, "percentage": 77.94, "elapsed_time": "4:03:47", "remaining_time": "1:08:59"}
+{"current_steps": 1270, "total_steps": 1623, "loss": 0.9571, "lr": 2.4930949369871203e-05, "epoch": 0.7825015403573629, "percentage": 78.25, "elapsed_time": "4:04:46", "remaining_time": "1:08:02"}
+{"current_steps": 1275, "total_steps": 1623, "loss": 0.7375, "lr": 2.426143274885493e-05, "epoch": 0.7855822550831792, "percentage": 78.56, "elapsed_time": "4:05:32", "remaining_time": "1:07:01"}
+{"current_steps": 1280, "total_steps": 1623, "loss": 0.8827, "lr": 2.359978563983022e-05, "epoch": 0.7886629698089956, "percentage": 78.87, "elapsed_time": "4:06:37", "remaining_time": "1:06:05"}
+{"current_steps": 1285, "total_steps": 1623, "loss": 0.8892, "lr": 2.2946076790332827e-05, "epoch": 0.791743684534812, "percentage": 79.17, "elapsed_time": "4:07:42", "remaining_time": "1:05:09"}
+{"current_steps": 1290, "total_steps": 1623, "loss": 0.8561, "lr": 2.2300374123084522e-05, "epoch": 0.7948243992606284, "percentage": 79.48, "elapsed_time": "4:08:39", "remaining_time": "1:04:11"}
+{"current_steps": 1295, "total_steps": 1623, "loss": 0.9178, "lr": 2.166274472893567e-05, "epoch": 0.7979051139864448, "percentage": 79.79, "elapsed_time": "4:09:40", "remaining_time": "1:03:14"}
+{"current_steps": 1300, "total_steps": 1623, "loss": 0.7465, "lr": 2.1033254859894226e-05, "epoch": 0.8009858287122612, "percentage": 80.1, "elapsed_time": "4:10:20", "remaining_time": "1:02:11"}
+{"current_steps": 1305, "total_steps": 1623, "loss": 0.8865, "lr": 2.041196992224206e-05, "epoch": 0.8040665434380776, "percentage": 80.41, "elapsed_time": "4:11:16", "remaining_time": "1:01:13"}
+{"current_steps": 1310, "total_steps": 1623, "loss": 0.8778, "lr": 1.9798954469738762e-05, "epoch": 0.807147258163894, "percentage": 80.71, "elapsed_time": "4:12:19", "remaining_time": "1:00:17"}
+{"current_steps": 1315, "total_steps": 1623, "loss": 0.9287, "lr": 1.919427219691453e-05, "epoch": 0.8102279728897104, "percentage": 81.02, "elapsed_time": "4:13:17", "remaining_time": "0:59:19"}
+{"current_steps": 1320, "total_steps": 1623, "loss": 0.8981, "lr": 1.8597985932451856e-05, "epoch": 0.8133086876155268, "percentage": 81.33, "elapsed_time": "4:14:14", "remaining_time": "0:58:21"}
+{"current_steps": 1325, "total_steps": 1623, "loss": 0.7387, "lr": 1.8010157632657543e-05, "epoch": 0.8163894023413432, "percentage": 81.64, "elapsed_time": "4:14:54", "remaining_time": "0:57:19"}
+{"current_steps": 1330, "total_steps": 1623, "loss": 0.9106, "lr": 1.7430848375025176e-05, "epoch": 0.8194701170671596, "percentage": 81.95, "elapsed_time": "4:15:52", "remaining_time": "0:56:22"}
+{"current_steps": 1335, "total_steps": 1623, "loss": 0.9232, "lr": 1.686011835188891e-05, "epoch": 0.822550831792976, "percentage": 82.26, "elapsed_time": "4:16:50", "remaining_time": "0:55:24"}
+{"current_steps": 1340, "total_steps": 1623, "loss": 0.9458, "lr": 1.6298026864169335e-05, "epoch": 0.8256315465187923, "percentage": 82.56, "elapsed_time": "4:17:47", "remaining_time": "0:54:26"}
+{"current_steps": 1345, "total_steps": 1623, "loss": 0.9359, "lr": 1.5744632315211815e-05, "epoch": 0.8287122612446087, "percentage": 82.87, "elapsed_time": "4:18:42", "remaining_time": "0:53:28"}
+{"current_steps": 1350, "total_steps": 1623, "loss": 0.7866, "lr": 1.5199992204718294e-05, "epoch": 0.8317929759704251, "percentage": 83.18, "elapsed_time": "4:19:22", "remaining_time": "0:52:27"}
+{"current_steps": 1355, "total_steps": 1623, "loss": 0.9127, "lr": 1.4664163122772689e-05, "epoch": 0.8348736906962415, "percentage": 83.49, "elapsed_time": "4:20:18", "remaining_time": "0:51:29"}
+{"current_steps": 1360, "total_steps": 1623, "loss": 0.9092, "lr": 1.4137200743961188e-05, "epoch": 0.8379544054220579, "percentage": 83.8, "elapsed_time": "4:21:19", "remaining_time": "0:50:32"}
+{"current_steps": 1365, "total_steps": 1623, "loss": 0.9071, "lr": 1.3619159821587235e-05, "epoch": 0.8410351201478743, "percentage": 84.1, "elapsed_time": "4:22:19", "remaining_time": "0:49:35"}
+{"current_steps": 1370, "total_steps": 1623, "loss": 0.901, "lr": 1.3110094181982657e-05, "epoch": 0.8441158348736907, "percentage": 84.41, "elapsed_time": "4:23:15", "remaining_time": "0:48:36"}
+{"current_steps": 1375, "total_steps": 1623, "loss": 0.7692, "lr": 1.261005671891482e-05, "epoch": 0.8471965495995071, "percentage": 84.72, "elapsed_time": "4:24:02", "remaining_time": "0:47:37"}
+{"current_steps": 1380, "total_steps": 1623, "loss": 0.9479, "lr": 1.2119099388090716e-05, "epoch": 0.8502772643253235, "percentage": 85.03, "elapsed_time": "4:25:02", "remaining_time": "0:46:40"}
+{"current_steps": 1385, "total_steps": 1623, "loss": 0.8972, "lr": 1.1637273201758748e-05, "epoch": 0.8533579790511399, "percentage": 85.34, "elapsed_time": "4:26:01", "remaining_time": "0:45:42"}
+{"current_steps": 1390, "total_steps": 1623, "loss": 0.8494, "lr": 1.1164628223408168e-05, "epoch": 0.8564386937769563, "percentage": 85.64, "elapsed_time": "4:26:56", "remaining_time": "0:44:44"}
+{"current_steps": 1395, "total_steps": 1623, "loss": 0.9043, "lr": 1.0701213562567492e-05, "epoch": 0.8595194085027726, "percentage": 85.95, "elapsed_time": "4:27:57", "remaining_time": "0:43:47"}
+{"current_steps": 1400, "total_steps": 1623, "loss": 0.7521, "lr": 1.0247077369701653e-05, "epoch": 0.862600123228589, "percentage": 86.26, "elapsed_time": "4:28:37", "remaining_time": "0:42:47"}
+{"current_steps": 1405, "total_steps": 1623, "loss": 0.8408, "lr": 9.802266831209206e-06, "epoch": 0.8656808379544054, "percentage": 86.57, "elapsed_time": "4:29:38", "remaining_time": "0:41:50"}
+{"current_steps": 1410, "total_steps": 1623, "loss": 0.8577, "lr": 9.366828164519258e-06, "epoch": 0.8687615526802218, "percentage": 86.88, "elapsed_time": "4:30:37", "remaining_time": "0:40:52"}
+{"current_steps": 1415, "total_steps": 1623, "loss": 0.9402, "lr": 8.940806613289498e-06, "epoch": 0.8718422674060382, "percentage": 87.18, "elapsed_time": "4:31:41", "remaining_time": "0:39:56"}
+{"current_steps": 1420, "total_steps": 1623, "loss": 0.8714, "lr": 8.524246442705153e-06, "epoch": 0.8749229821318546, "percentage": 87.49, "elapsed_time": "4:32:41", "remaining_time": "0:38:58"}
+{"current_steps": 1425, "total_steps": 1623, "loss": 0.7554, "lr": 8.117190934879593e-06, "epoch": 0.878003696857671, "percentage": 87.8, "elapsed_time": "4:33:21", "remaining_time": "0:37:58"}
+{"current_steps": 1430, "total_steps": 1623, "loss": 0.9058, "lr": 7.719682384357308e-06, "epoch": 0.8810844115834874, "percentage": 88.11, "elapsed_time": "4:34:20", "remaining_time": "0:37:01"}
+{"current_steps": 1435, "total_steps": 1623, "loss": 0.9048, "lr": 7.33176209371923e-06, "epoch": 0.8841651263093038, "percentage": 88.42, "elapsed_time": "4:35:18", "remaining_time": "0:36:04"}
+{"current_steps": 1440, "total_steps": 1623, "loss": 0.9097, "lr": 6.953470369291348e-06, "epoch": 0.8872458410351202, "percentage": 88.72, "elapsed_time": "4:36:21", "remaining_time": "0:35:07"}
+{"current_steps": 1445, "total_steps": 1623, "loss": 0.9375, "lr": 6.5848465169566e-06, "epoch": 0.8903265557609366, "percentage": 89.03, "elapsed_time": "4:37:16", "remaining_time": "0:34:09"}
+{"current_steps": 1450, "total_steps": 1623, "loss": 0.7327, "lr": 6.225928838071016e-06, "epoch": 0.893407270486753, "percentage": 89.34, "elapsed_time": "4:37:56", "remaining_time": "0:33:09"}
+{"current_steps": 1455, "total_steps": 1623, "loss": 0.829, "lr": 5.876754625483904e-06, "epoch": 0.8964879852125693, "percentage": 89.65, "elapsed_time": "4:38:52", "remaining_time": "0:32:12"}
+{"current_steps": 1460, "total_steps": 1623, "loss": 0.893, "lr": 5.537360159663108e-06, "epoch": 0.8995686999383857, "percentage": 89.96, "elapsed_time": "4:39:58", "remaining_time": "0:31:15"}
+{"current_steps": 1465, "total_steps": 1623, "loss": 0.8752, "lr": 5.207780704925314e-06, "epoch": 0.9026494146642021, "percentage": 90.26, "elapsed_time": "4:40:59", "remaining_time": "0:30:18"}
+{"current_steps": 1470, "total_steps": 1623, "loss": 0.9341, "lr": 4.888050505771868e-06, "epoch": 0.9057301293900185, "percentage": 90.57, "elapsed_time": "4:42:00", "remaining_time": "0:29:21"}
+{"current_steps": 1475, "total_steps": 1623, "loss": 0.7766, "lr": 4.578202783330799e-06, "epoch": 0.9088108441158349, "percentage": 90.88, "elapsed_time": "4:42:40", "remaining_time": "0:28:21"}
+{"current_steps": 1480, "total_steps": 1623, "loss": 0.8861, "lr": 4.2782697319048605e-06, "epoch": 0.9118915588416513, "percentage": 91.19, "elapsed_time": "4:43:37", "remaining_time": "0:27:24"}
+{"current_steps": 1485, "total_steps": 1623, "loss": 0.8434, "lr": 3.988282515626585e-06, "epoch": 0.9149722735674677, "percentage": 91.5, "elapsed_time": "4:44:37", "remaining_time": "0:26:26"}
+{"current_steps": 1490, "total_steps": 1623, "loss": 0.8912, "lr": 3.7082712652200867e-06, "epoch": 0.9180529882932841, "percentage": 91.81, "elapsed_time": "4:45:33", "remaining_time": "0:25:29"}
+{"current_steps": 1495, "total_steps": 1623, "loss": 0.9744, "lr": 3.438265074870417e-06, "epoch": 0.9211337030191005, "percentage": 92.11, "elapsed_time": "4:46:28", "remaining_time": "0:24:31"}
+{"current_steps": 1500, "total_steps": 1623, "loss": 0.7479, "lr": 3.1782919992006333e-06, "epoch": 0.9242144177449169, "percentage": 92.42, "elapsed_time": "4:47:09", "remaining_time": "0:23:32"}
+{"current_steps": 1505, "total_steps": 1623, "loss": 0.9081, "lr": 2.9283790503567222e-06, "epoch": 0.9272951324707333, "percentage": 92.73, "elapsed_time": "4:48:15", "remaining_time": "0:22:36"}
+{"current_steps": 1510, "total_steps": 1623, "loss": 0.9355, "lr": 2.6885521952010105e-06, "epoch": 0.9303758471965496, "percentage": 93.04, "elapsed_time": "4:49:20", "remaining_time": "0:21:39"}
+{"current_steps": 1515, "total_steps": 1623, "loss": 0.8545, "lr": 2.458836352614069e-06, "epoch": 0.933456561922366, "percentage": 93.35, "elapsed_time": "4:50:16", "remaining_time": "0:20:41"}
+{"current_steps": 1520, "total_steps": 1623, "loss": 0.9361, "lr": 2.239255390905581e-06, "epoch": 0.9365372766481824, "percentage": 93.65, "elapsed_time": "4:51:11", "remaining_time": "0:19:43"}
+{"current_steps": 1525, "total_steps": 1623, "loss": 0.7706, "lr": 2.029832125334319e-06, "epoch": 0.9396179913739988, "percentage": 93.96, "elapsed_time": "4:51:52", "remaining_time": "0:18:45"}
+{"current_steps": 1530, "total_steps": 1623, "loss": 0.842, "lr": 1.8305883157375804e-06, "epoch": 0.9426987060998152, "percentage": 94.27, "elapsed_time": "4:52:48", "remaining_time": "0:17:47"}
+{"current_steps": 1535, "total_steps": 1623, "loss": 0.9651, "lr": 1.6415446642702337e-06, "epoch": 0.9457794208256316, "percentage": 94.58, "elapsed_time": "4:53:47", "remaining_time": "0:16:50"}
+{"current_steps": 1540, "total_steps": 1623, "loss": 0.902, "lr": 1.462720813253682e-06, "epoch": 0.9488601355514479, "percentage": 94.89, "elapsed_time": "4:54:45", "remaining_time": "0:15:53"}
+{"current_steps": 1545, "total_steps": 1623, "loss": 0.9256, "lr": 1.2941353431350056e-06, "epoch": 0.9519408502772643, "percentage": 95.19, "elapsed_time": "4:55:45", "remaining_time": "0:14:55"}
+{"current_steps": 1550, "total_steps": 1623, "loss": 0.7639, "lr": 1.135805770556364e-06, "epoch": 0.9550215650030807, "percentage": 95.5, "elapsed_time": "4:56:26", "remaining_time": "0:13:57"}
+{"current_steps": 1555, "total_steps": 1623, "loss": 0.931, "lr": 9.877485465349058e-07, "epoch": 0.958102279728897, "percentage": 95.81, "elapsed_time": "4:57:30", "remaining_time": "0:13:00"}
+{"current_steps": 1560, "total_steps": 1623, "loss": 0.8409, "lr": 8.499790547535025e-07, "epoch": 0.9611829944547134, "percentage": 96.12, "elapsed_time": "4:58:30", "remaining_time": "0:12:03"}
+{"current_steps": 1565, "total_steps": 1623, "loss": 0.867, "lr": 7.225116099623286e-07, "epoch": 0.9642637091805298, "percentage": 96.43, "elapsed_time": "4:59:34", "remaining_time": "0:11:06"}
+{"current_steps": 1570, "total_steps": 1623, "loss": 0.9427, "lr": 6.053594564914611e-07, "epoch": 0.9673444239063462, "percentage": 96.73, "elapsed_time": "5:00:33", "remaining_time": "0:10:08"}
+{"current_steps": 1575, "total_steps": 1623, "loss": 0.7485, "lr": 4.985347668747809e-07, "epoch": 0.9704251386321626, "percentage": 97.04, "elapsed_time": "5:01:13", "remaining_time": "0:09:10"}
+{"current_steps": 1580, "total_steps": 1623, "loss": 0.9249, "lr": 4.0204864058522864e-07, "epoch": 0.973505853357979, "percentage": 97.35, "elapsed_time": "5:02:09", "remaining_time": "0:08:13"}
+{"current_steps": 1585, "total_steps": 1623, "loss": 0.9969, "lr": 3.15911102881461e-07, "epoch": 0.9765865680837954, "percentage": 97.66, "elapsed_time": "5:03:11", "remaining_time": "0:07:16"}
+{"current_steps": 1590, "total_steps": 1623, "loss": 0.8852, "lr": 2.40131103766239e-07, "epoch": 0.9796672828096118, "percentage": 97.97, "elapsed_time": "5:04:07", "remaining_time": "0:06:18"}
+{"current_steps": 1595, "total_steps": 1623, "loss": 0.9672, "lr": 1.747165170564724e-07, "epoch": 0.9827479975354282, "percentage": 98.27, "elapsed_time": "5:05:06", "remaining_time": "0:05:21"}
+{"current_steps": 1600, "total_steps": 1623, "loss": 0.7987, "lr": 1.1967413956510686e-07, "epoch": 0.9858287122612446, "percentage": 98.58, "elapsed_time": "5:05:44", "remaining_time": "0:04:23"}
+{"current_steps": 1605, "total_steps": 1623, "loss": 0.8614, "lr": 7.500969039491157e-08, "epoch": 0.988909426987061, "percentage": 98.89, "elapsed_time": "5:06:39", "remaining_time": "0:03:26"}
+{"current_steps": 1610, "total_steps": 1623, "loss": 0.9483, "lr": 4.0727810344254325e-08, "epoch": 0.9919901417128774, "percentage": 99.2, "elapsed_time": "5:07:39", "remaining_time": "0:02:29"}
+{"current_steps": 1615, "total_steps": 1623, "loss": 0.884, "lr": 1.6832061424865153e-08, "epoch": 0.9950708564386938, "percentage": 99.51, "elapsed_time": "5:08:36", "remaining_time": "0:01:31"}
+{"current_steps": 1620, "total_steps": 1623, "loss": 0.8332, "lr": 3.3249264917878387e-09, "epoch": 0.9981515711645101, "percentage": 99.82, "elapsed_time": "5:09:33", "remaining_time": "0:00:34"}
+{"current_steps": 1623, "total_steps": 1623, "epoch": 1.0, "percentage": 100.0, "elapsed_time": "5:10:14", "remaining_time": "0:00:00"}
--- a/trainer_state.json
+++ b/trainer_state.json
--- a/training_args.bin
+++ b/training_args.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:7a1cd56e871d61c9b89f86c2576d0195672a4e92fd284eff48c5ba860cc0ec44
+size 7864
--- a/training_loss.png
+++ b/training_loss.png