初始化项目,由ModelHub XC社区提供模型
Model: ali-elganzory/open-sci-ref-v0.02-1.7b-nemotron-hq-300B-16384-rope_theta-1M-long_sft_16k Source: Original Platform
This commit is contained in:
35
.gitattributes
vendored
Normal file
35
.gitattributes
vendored
Normal file
@@ -0,0 +1,35 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
61
README.md
Normal file
61
README.md
Normal file
@@ -0,0 +1,61 @@
|
||||
---
|
||||
library_name: transformers
|
||||
license: other
|
||||
base_model: open-sci/open-sci-ref-v0.02-1.7b-nemotron-hq-300B-16384-rope_theta-1M
|
||||
tags:
|
||||
- llama-factory
|
||||
- full
|
||||
- generated_from_trainer
|
||||
model-index:
|
||||
- name: 1.7b-Nemotron-cc-2024-HQ-real-synth-mix-16k
|
||||
results: []
|
||||
---
|
||||
|
||||
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
||||
should probably proofread and complete it, then remove this comment. -->
|
||||
|
||||
# 1.7b-Nemotron-cc-2024-HQ-real-synth-mix-16k
|
||||
|
||||
This model is a fine-tuned version of [open-sci/open-sci-ref-v0.02-1.7b-nemotron-hq-300B-16384-rope_theta-1M](https://huggingface.co/open-sci/open-sci-ref-v0.02-1.7b-nemotron-hq-300B-16384-rope_theta-1M) on the long_sft dataset.
|
||||
|
||||
## Model description
|
||||
|
||||
More information needed
|
||||
|
||||
## Intended uses & limitations
|
||||
|
||||
More information needed
|
||||
|
||||
## Training and evaluation data
|
||||
|
||||
More information needed
|
||||
|
||||
## Training procedure
|
||||
|
||||
### Training hyperparameters
|
||||
|
||||
The following hyperparameters were used during training:
|
||||
- learning_rate: 0.0002
|
||||
- train_batch_size: 2
|
||||
- eval_batch_size: 8
|
||||
- seed: 42
|
||||
- distributed_type: multi-GPU
|
||||
- num_devices: 8
|
||||
- gradient_accumulation_steps: 2
|
||||
- total_train_batch_size: 32
|
||||
- total_eval_batch_size: 64
|
||||
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
||||
- lr_scheduler_type: cosine
|
||||
- lr_scheduler_warmup_ratio: 0.05
|
||||
- num_epochs: 1.0
|
||||
|
||||
### Training results
|
||||
|
||||
|
||||
|
||||
### Framework versions
|
||||
|
||||
- Transformers 4.57.0
|
||||
- Pytorch 2.6.0+cu124
|
||||
- Datasets 4.0.0
|
||||
- Tokenizers 0.22.1
|
||||
8
all_results.json
Normal file
8
all_results.json
Normal file
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"epoch": 1.0,
|
||||
"total_flos": 889480391426048.0,
|
||||
"train_loss": 0.9303844255645997,
|
||||
"train_runtime": 18614.4314,
|
||||
"train_samples_per_second": 2.79,
|
||||
"train_steps_per_second": 0.087
|
||||
}
|
||||
7
chat_template.jinja
Normal file
7
chat_template.jinja
Normal file
@@ -0,0 +1,7 @@
|
||||
{{ '<|endoftext|>' }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% endif %}{% for message in loop_messages %}{% if loop.index0 == 0 and system_message is defined %}{% set content = system_message + '
|
||||
|
||||
' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '<start_of_turn>user
|
||||
' + content + '<end_of_turn>
|
||||
<start_of_turn>model
|
||||
' }}{% elif message['role'] == 'assistant' %}{{ content + '<end_of_turn>
|
||||
' }}{% endif %}{% endfor %}
|
||||
37
config.json
Normal file
37
config.json
Normal file
@@ -0,0 +1,37 @@
|
||||
{
|
||||
"architectures": [
|
||||
"OpensciForCausalLM"
|
||||
],
|
||||
"attention_bias": true,
|
||||
"attention_dropout": 0.0,
|
||||
"auto_map": {
|
||||
"AutoConfig": "configuration_opensci.OpensciConfig",
|
||||
"AutoModel": "modeling_opensci.OpensciModel",
|
||||
"AutoModelForCausalLM": "modeling_opensci.OpensciForCausalLM"
|
||||
},
|
||||
"bos_token_id": 0,
|
||||
"dtype": "bfloat16",
|
||||
"eos_token_id": 50277,
|
||||
"head_dim": 64,
|
||||
"hidden_act": "silu",
|
||||
"hidden_size": 2048,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 8192,
|
||||
"layer_norm_eps": 1e-05,
|
||||
"max_position_embeddings": 16384,
|
||||
"mlp_bias": true,
|
||||
"model_type": "opensci",
|
||||
"num_attention_heads": 32,
|
||||
"num_hidden_layers": 24,
|
||||
"num_key_value_heads": 32,
|
||||
"pad_token_id": 50277,
|
||||
"pretraining_tp": 1,
|
||||
"qk_layernorm": true,
|
||||
"rms_norm_eps": 1e-05,
|
||||
"rope_scaling": null,
|
||||
"rope_theta": 1000000,
|
||||
"tie_word_embeddings": true,
|
||||
"transformers_version": "4.57.0",
|
||||
"use_cache": false,
|
||||
"vocab_size": 50304
|
||||
}
|
||||
204
configuration_opensci.py
Normal file
204
configuration_opensci.py
Normal file
@@ -0,0 +1,204 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
|
||||
# and OPT implementations in this library. It has been modified from its
|
||||
# original forms to accommodate minor architectural differences compared
|
||||
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""OpenSci model configuration"""
|
||||
|
||||
from transformers.configuration_utils import PretrainedConfig
|
||||
from transformers.modeling_rope_utils import rope_config_validation
|
||||
|
||||
|
||||
class OpensciConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`OpensciModel`]. It is used to instantiate an Opensci
|
||||
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
||||
defaults will yield a similar configuration to that of the Opensci-7B.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
|
||||
Args:
|
||||
vocab_size (`int`, *optional*, defaults to 32000):
|
||||
Vocabulary size of the Opensci model. Defines the number of different tokens that can be represented by the
|
||||
`inputs_ids` passed when calling [`OpensciModel`]
|
||||
hidden_size (`int`, *optional*, defaults to 4096):
|
||||
Dimension of the hidden representations.
|
||||
intermediate_size (`int`, *optional*, defaults to 11008):
|
||||
Dimension of the MLP representations.
|
||||
num_hidden_layers (`int`, *optional*, defaults to 32):
|
||||
Number of hidden layers in the Transformer decoder.
|
||||
num_attention_heads (`int`, *optional*, defaults to 32):
|
||||
Number of attention heads for each attention layer in the Transformer decoder.
|
||||
num_key_value_heads (`int`, *optional*):
|
||||
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
|
||||
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
|
||||
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
|
||||
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
|
||||
by meanpooling all the original heads within that group. For more details checkout [this
|
||||
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
|
||||
`num_attention_heads`.
|
||||
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
|
||||
The non-linear activation function (function or string) in the decoder.
|
||||
max_position_embeddings (`int`, *optional*, defaults to 2048):
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
rms_norm_eps (`float`, *optional*, defaults to 1e-06):
|
||||
The epsilon used by the rms normalization layers.
|
||||
use_cache (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models). Only
|
||||
relevant if `config.is_decoder=True`.
|
||||
pad_token_id (`int`, *optional*):
|
||||
Padding token id.
|
||||
bos_token_id (`int`, *optional*, defaults to 1):
|
||||
Beginning of stream token id.
|
||||
eos_token_id (`int`, *optional*, defaults to 2):
|
||||
End of stream token id.
|
||||
pretraining_tp (`int`, *optional*, defaults to 1):
|
||||
Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
|
||||
document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to
|
||||
understand more about it. This value is necessary to ensure exact reproducibility of the pretraining
|
||||
results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).
|
||||
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
|
||||
Whether to tie weight embeddings
|
||||
rope_theta (`float`, *optional*, defaults to 10000.0):
|
||||
The base period of the RoPE embeddings.
|
||||
rope_scaling (`Dict`, *optional*):
|
||||
Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
|
||||
and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
|
||||
accordingly.
|
||||
Expected contents:
|
||||
`rope_type` (`str`):
|
||||
The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
|
||||
'Llama3'], with 'default' being the original RoPE implementation.
|
||||
`factor` (`float`, *optional*):
|
||||
Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
|
||||
most scaling types, a `factor` of x will enable the model to handle sequences of length x *
|
||||
original maximum pre-trained length.
|
||||
`original_max_position_embeddings` (`int`, *optional*):
|
||||
Used with 'dynamic', 'longrope' and 'Llama3'. The original max position embeddings used during
|
||||
pretraining.
|
||||
`attention_factor` (`float`, *optional*):
|
||||
Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
|
||||
computation. If unspecified, it defaults to value recommended by the implementation, using the
|
||||
`factor` field to infer the suggested value.
|
||||
`beta_fast` (`float`, *optional*):
|
||||
Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
|
||||
ramp function. If unspecified, it defaults to 32.
|
||||
`beta_slow` (`float`, *optional*):
|
||||
Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
|
||||
ramp function. If unspecified, it defaults to 1.
|
||||
`short_factor` (`List[float]`, *optional*):
|
||||
Only used with 'longrope'. The scaling factor to be applied to short contexts (<
|
||||
`original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
|
||||
size divided by the number of attention heads divided by 2
|
||||
`long_factor` (`List[float]`, *optional*):
|
||||
Only used with 'longrope'. The scaling factor to be applied to long contexts (<
|
||||
`original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
|
||||
size divided by the number of attention heads divided by 2
|
||||
`low_freq_factor` (`float`, *optional*):
|
||||
Only used with 'Llama3'. Scaling factor applied to low frequency components of the RoPE
|
||||
`high_freq_factor` (`float`, *optional*):
|
||||
Only used with 'Llama3'. Scaling factor applied to high frequency components of the RoPE
|
||||
attention_bias (`bool`, *optional*, defaults to `False`):
|
||||
Whether to use a bias in the query, key, value and output projection layers during self-attention.
|
||||
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio for the attention probabilities.
|
||||
mlp_bias (`bool`, *optional*, defaults to `False`):
|
||||
Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
|
||||
head_dim (`int`, *optional*):
|
||||
The attention head dimension. If None, it will default to hidden_size // num_attention_heads
|
||||
|
||||
```python
|
||||
>>> from transformers import OpensciModel, OpensciConfig
|
||||
|
||||
>>> # Initializing a Opensci Opensci-7b style configuration
|
||||
>>> configuration = OpensciConfig()
|
||||
|
||||
>>> # Initializing a model from the Opensci-7b style configuration
|
||||
>>> model = OpensciModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```"""
|
||||
|
||||
model_type = "opensci"
|
||||
keys_to_ignore_at_inference = ["past_key_values"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=32000,
|
||||
hidden_size=4096,
|
||||
intermediate_size=11008,
|
||||
num_hidden_layers=32,
|
||||
num_attention_heads=32,
|
||||
num_key_value_heads=None,
|
||||
hidden_act="silu",
|
||||
max_position_embeddings=2048,
|
||||
initializer_range=0.02,
|
||||
rms_norm_eps=1e-6,
|
||||
use_cache=True,
|
||||
pad_token_id=None,
|
||||
bos_token_id=1,
|
||||
eos_token_id=2,
|
||||
pretraining_tp=1,
|
||||
tie_word_embeddings=False,
|
||||
rope_theta=10000.0,
|
||||
rope_scaling=None,
|
||||
attention_bias=False,
|
||||
attention_dropout=0.0,
|
||||
mlp_bias=False,
|
||||
head_dim=None,
|
||||
**kwargs,
|
||||
):
|
||||
self.vocab_size = vocab_size
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.hidden_size = hidden_size
|
||||
self.intermediate_size = intermediate_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
|
||||
# for backward compatibility
|
||||
if num_key_value_heads is None:
|
||||
num_key_value_heads = num_attention_heads
|
||||
|
||||
self.num_key_value_heads = num_key_value_heads
|
||||
self.hidden_act = hidden_act
|
||||
self.initializer_range = initializer_range
|
||||
self.rms_norm_eps = rms_norm_eps
|
||||
self.pretraining_tp = pretraining_tp
|
||||
self.use_cache = use_cache
|
||||
self.rope_theta = rope_theta
|
||||
self.rope_scaling = rope_scaling
|
||||
self.attention_bias = attention_bias
|
||||
self.attention_dropout = attention_dropout
|
||||
self.mlp_bias = mlp_bias
|
||||
self.head_dim = head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
|
||||
# Validate the correctness of rotary position embeddings parameters
|
||||
# BC: if there is a 'type' field, copy it it to 'rope_type'.
|
||||
if self.rope_scaling is not None and "type" in self.rope_scaling:
|
||||
self.rope_scaling["rope_type"] = self.rope_scaling["type"]
|
||||
rope_config_validation(self)
|
||||
|
||||
super().__init__(
|
||||
pad_token_id=pad_token_id,
|
||||
bos_token_id=bos_token_id,
|
||||
eos_token_id=eos_token_id,
|
||||
tie_word_embeddings=tie_word_embeddings,
|
||||
**kwargs,
|
||||
)
|
||||
11
generation_config.json
Normal file
11
generation_config.json
Normal file
@@ -0,0 +1,11 @@
|
||||
{
|
||||
"_from_model_config": true,
|
||||
"bos_token_id": 0,
|
||||
"eos_token_id": [
|
||||
50277,
|
||||
0
|
||||
],
|
||||
"pad_token_id": 50277,
|
||||
"transformers_version": "4.57.0",
|
||||
"use_cache": false
|
||||
}
|
||||
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:b33ff330599bdffb7fa5ea4a838213ad22edefc28ebdb3248a978fb096334cde
|
||||
size 3428804400
|
||||
984
modeling_opensci.py
Normal file
984
modeling_opensci.py
Normal file
@@ -0,0 +1,984 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
|
||||
# and OPT implementations in this library. It has been modified from its
|
||||
# original forms to accommodate minor architectural differences compared
|
||||
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import Callable, List, Optional, Tuple, Union
|
||||
|
||||
import torch
|
||||
import torch.utils.checkpoint
|
||||
from torch import nn
|
||||
|
||||
from transformers.activations import ACT2FN
|
||||
from transformers.cache_utils import Cache, DynamicCache, StaticCache
|
||||
from transformers.generation import GenerationMixin
|
||||
from transformers.modeling_attn_mask_utils import AttentionMaskConverter
|
||||
from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
|
||||
from transformers.modeling_outputs import (
|
||||
BaseModelOutputWithPast,
|
||||
CausalLMOutputWithPast,
|
||||
QuestionAnsweringModelOutput,
|
||||
SequenceClassifierOutputWithPast,
|
||||
TokenClassifierOutput,
|
||||
)
|
||||
from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
|
||||
from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
|
||||
from transformers.processing_utils import Unpack
|
||||
from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
|
||||
from transformers.utils import (
|
||||
add_code_sample_docstrings,
|
||||
add_start_docstrings,
|
||||
add_start_docstrings_to_model_forward,
|
||||
logging,
|
||||
replace_return_docstrings,
|
||||
)
|
||||
from transformers.utils.deprecation import deprecate_kwarg
|
||||
from .configuration_opensci import OpensciConfig
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
_CONFIG_FOR_DOC = "OpensciConfig"
|
||||
|
||||
|
||||
class OpensciRMSNorm(nn.Module):
|
||||
def __init__(self, hidden_size, eps=1e-6):
|
||||
"""
|
||||
OpensciRMSNorm is equivalent to T5LayerNorm
|
||||
"""
|
||||
super().__init__()
|
||||
self.weight = nn.Parameter(torch.ones(hidden_size))
|
||||
self.variance_epsilon = eps
|
||||
|
||||
def forward(self, hidden_states):
|
||||
input_dtype = hidden_states.dtype
|
||||
hidden_states = hidden_states.to(torch.float32)
|
||||
variance = hidden_states.pow(2).mean(-1, keepdim=True)
|
||||
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
|
||||
return self.weight * hidden_states.to(input_dtype)
|
||||
|
||||
def extra_repr(self):
|
||||
return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
|
||||
|
||||
|
||||
ALL_LAYERNORM_LAYERS.append(OpensciRMSNorm)
|
||||
|
||||
|
||||
class OpensciRotaryEmbedding(nn.Module):
|
||||
def __init__(self, config: OpensciConfig, device=None):
|
||||
super().__init__()
|
||||
# BC: "rope_type" was originally "type"
|
||||
if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
|
||||
self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
|
||||
else:
|
||||
self.rope_type = "default"
|
||||
self.max_seq_len_cached = config.max_position_embeddings
|
||||
self.original_max_seq_len = config.max_position_embeddings
|
||||
|
||||
self.config = config
|
||||
self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
|
||||
|
||||
inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
|
||||
self.register_buffer("inv_freq", inv_freq, persistent=False)
|
||||
self.original_inv_freq = self.inv_freq
|
||||
|
||||
def _dynamic_frequency_update(self, position_ids, device):
|
||||
"""
|
||||
dynamic RoPE layers should recompute `inv_freq` in the following situations:
|
||||
1 - growing beyond the cached sequence length (allow scaling)
|
||||
2 - the current sequence length is in the original scale (avoid losing precision with small sequences)
|
||||
"""
|
||||
seq_len = torch.max(position_ids) + 1
|
||||
if seq_len > self.max_seq_len_cached: # growth
|
||||
inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, seq_len=seq_len)
|
||||
self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: may break with compilation
|
||||
self.max_seq_len_cached = seq_len
|
||||
|
||||
if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len: # reset
|
||||
# This .to() is needed if the model has been moved to a device after being initialized (because
|
||||
# the buffer is automatically moved, but not the original copy)
|
||||
self.original_inv_freq = self.original_inv_freq.to(device)
|
||||
self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)
|
||||
self.max_seq_len_cached = self.original_max_seq_len
|
||||
|
||||
@torch.no_grad()
|
||||
def forward(self, x, position_ids):
|
||||
if "dynamic" in self.rope_type:
|
||||
self._dynamic_frequency_update(position_ids, device=x.device)
|
||||
|
||||
# Core RoPE block
|
||||
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
|
||||
position_ids_expanded = position_ids[:, None, :].float()
|
||||
# Force float32 (see https://github.com/huggingface/transformers/pull/29285)
|
||||
device_type = x.device.type
|
||||
device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
|
||||
with torch.autocast(device_type=device_type, enabled=False):
|
||||
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
|
||||
emb = torch.cat((freqs, freqs), dim=-1)
|
||||
cos = emb.cos()
|
||||
sin = emb.sin()
|
||||
|
||||
# Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
|
||||
cos = cos * self.attention_scaling
|
||||
sin = sin * self.attention_scaling
|
||||
|
||||
return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
|
||||
|
||||
|
||||
def rotate_half(x):
|
||||
"""Rotates half the hidden dims of the input."""
|
||||
x1 = x[..., : x.shape[-1] // 2]
|
||||
x2 = x[..., x.shape[-1] // 2 :]
|
||||
return torch.cat((-x2, x1), dim=-1)
|
||||
|
||||
|
||||
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
|
||||
"""Applies Rotary Position Embedding to the query and key tensors.
|
||||
|
||||
Args:
|
||||
q (`torch.Tensor`): The query tensor.
|
||||
k (`torch.Tensor`): The key tensor.
|
||||
cos (`torch.Tensor`): The cosine part of the rotary embedding.
|
||||
sin (`torch.Tensor`): The sine part of the rotary embedding.
|
||||
position_ids (`torch.Tensor`, *optional*):
|
||||
Deprecated and unused.
|
||||
unsqueeze_dim (`int`, *optional*, defaults to 1):
|
||||
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
|
||||
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
|
||||
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
|
||||
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
|
||||
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
|
||||
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
|
||||
Returns:
|
||||
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
|
||||
"""
|
||||
cos = cos.unsqueeze(unsqueeze_dim)
|
||||
sin = sin.unsqueeze(unsqueeze_dim)
|
||||
q_embed = (q * cos) + (rotate_half(q) * sin)
|
||||
k_embed = (k * cos) + (rotate_half(k) * sin)
|
||||
return q_embed, k_embed
|
||||
|
||||
|
||||
class OpensciMLP(nn.Module):
|
||||
def __init__(self, config):
|
||||
super().__init__()
|
||||
self.config = config
|
||||
self.hidden_size = config.hidden_size
|
||||
self.intermediate_size = config.intermediate_size
|
||||
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
|
||||
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
|
||||
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
|
||||
self.act_fn = ACT2FN[config.hidden_act]
|
||||
|
||||
def forward(self, x):
|
||||
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
|
||||
return down_proj
|
||||
|
||||
|
||||
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
|
||||
"""
|
||||
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
|
||||
num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
|
||||
"""
|
||||
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
|
||||
if n_rep == 1:
|
||||
return hidden_states
|
||||
hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
|
||||
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
|
||||
|
||||
|
||||
def eager_attention_forward(
|
||||
module: nn.Module,
|
||||
query: torch.Tensor,
|
||||
key: torch.Tensor,
|
||||
value: torch.Tensor,
|
||||
attention_mask: Optional[torch.Tensor],
|
||||
scaling: float,
|
||||
dropout: float = 0.0,
|
||||
**kwargs,
|
||||
):
|
||||
key_states = repeat_kv(key, module.num_key_value_groups)
|
||||
value_states = repeat_kv(value, module.num_key_value_groups)
|
||||
|
||||
attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
|
||||
if attention_mask is not None:
|
||||
causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
|
||||
attn_weights = attn_weights + causal_mask
|
||||
|
||||
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
|
||||
attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
|
||||
attn_output = torch.matmul(attn_weights, value_states)
|
||||
attn_output = attn_output.transpose(1, 2).contiguous()
|
||||
|
||||
return attn_output, attn_weights
|
||||
|
||||
|
||||
class OpensciAttention(nn.Module):
|
||||
"""Multi-headed attention from 'Attention Is All You Need' paper"""
|
||||
|
||||
def __init__(self, config: OpensciConfig, layer_idx: int):
|
||||
super().__init__()
|
||||
self.config = config
|
||||
self.layer_idx = layer_idx
|
||||
self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
|
||||
self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
|
||||
self.scaling = self.head_dim**-0.5
|
||||
self.attention_dropout = config.attention_dropout
|
||||
self.is_causal = True
|
||||
|
||||
self.q_proj = nn.Linear(
|
||||
config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
|
||||
)
|
||||
self.k_proj = nn.Linear(
|
||||
config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
|
||||
)
|
||||
self.v_proj = nn.Linear(
|
||||
config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
|
||||
)
|
||||
self.o_proj = nn.Linear(
|
||||
config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
|
||||
)
|
||||
self.qk_layernorm = config.qk_layernorm
|
||||
if self.qk_layernorm:
|
||||
self.q_layernorm = OpensciRMSNorm(config.head_dim, eps=config.rms_norm_eps)
|
||||
self.k_layernorm = OpensciRMSNorm(config.head_dim, eps=config.rms_norm_eps)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
position_embeddings: Tuple[torch.Tensor, torch.Tensor],
|
||||
attention_mask: Optional[torch.Tensor],
|
||||
past_key_value: Optional[Cache] = None,
|
||||
cache_position: Optional[torch.LongTensor] = None,
|
||||
**kwargs: Unpack[FlashAttentionKwargs],
|
||||
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
|
||||
input_shape = hidden_states.shape[:-1]
|
||||
hidden_shape = (*input_shape, -1, self.head_dim)
|
||||
|
||||
query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
|
||||
key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
|
||||
value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
|
||||
|
||||
if self.qk_layernorm:
|
||||
query_states = self.q_layernorm(query_states)
|
||||
key_states = self.k_layernorm(key_states)
|
||||
cos, sin = position_embeddings
|
||||
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
|
||||
|
||||
if past_key_value is not None:
|
||||
# sin and cos are specific to RoPE models; cache_position needed for the static cache
|
||||
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
|
||||
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
|
||||
|
||||
attention_interface: Callable = eager_attention_forward
|
||||
if self.config._attn_implementation != "eager":
|
||||
if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
|
||||
logger.warning_once(
|
||||
"`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
|
||||
'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
|
||||
)
|
||||
else:
|
||||
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
|
||||
|
||||
attn_output, attn_weights = attention_interface(
|
||||
self,
|
||||
query_states,
|
||||
key_states,
|
||||
value_states,
|
||||
attention_mask,
|
||||
dropout=0.0 if not self.training else self.attention_dropout,
|
||||
scaling=self.scaling,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
attn_output = attn_output.reshape(*input_shape, -1).contiguous()
|
||||
attn_output = self.o_proj(attn_output)
|
||||
return attn_output, attn_weights
|
||||
|
||||
|
||||
class OpensciDecoderLayer(nn.Module):
|
||||
def __init__(self, config: OpensciConfig, layer_idx: int):
|
||||
super().__init__()
|
||||
self.hidden_size = config.hidden_size
|
||||
|
||||
self.self_attn = OpensciAttention(config=config, layer_idx=layer_idx)
|
||||
|
||||
self.mlp = OpensciMLP(config)
|
||||
self.input_layernorm = OpensciRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
|
||||
self.post_attention_layernorm = OpensciRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
position_ids: Optional[torch.LongTensor] = None,
|
||||
past_key_value: Optional[Cache] = None,
|
||||
output_attentions: Optional[bool] = False,
|
||||
use_cache: Optional[bool] = False,
|
||||
cache_position: Optional[torch.LongTensor] = None,
|
||||
position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
|
||||
**kwargs: Unpack[FlashAttentionKwargs],
|
||||
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
|
||||
residual = hidden_states
|
||||
|
||||
hidden_states = self.input_layernorm(hidden_states)
|
||||
|
||||
# Self Attention
|
||||
hidden_states, self_attn_weights = self.self_attn(
|
||||
hidden_states=hidden_states,
|
||||
attention_mask=attention_mask,
|
||||
position_ids=position_ids,
|
||||
past_key_value=past_key_value,
|
||||
output_attentions=output_attentions,
|
||||
use_cache=use_cache,
|
||||
cache_position=cache_position,
|
||||
position_embeddings=position_embeddings,
|
||||
**kwargs,
|
||||
)
|
||||
hidden_states = residual + hidden_states
|
||||
|
||||
# Fully Connected
|
||||
residual = hidden_states
|
||||
hidden_states = self.post_attention_layernorm(hidden_states)
|
||||
hidden_states = self.mlp(hidden_states)
|
||||
hidden_states = residual + hidden_states
|
||||
|
||||
outputs = (hidden_states,)
|
||||
if output_attentions:
|
||||
outputs += (self_attn_weights,)
|
||||
|
||||
return outputs
|
||||
|
||||
|
||||
Opensci_START_DOCSTRING = r"""
|
||||
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
|
||||
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
|
||||
etc.)
|
||||
|
||||
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
|
||||
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
|
||||
and behavior.
|
||||
|
||||
Parameters:
|
||||
config ([`OpensciConfig`]):
|
||||
Model configuration class with all the parameters of the model. Initializing with a config file does not
|
||||
load the weights associated with the model, only the configuration. Check out the
|
||||
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
|
||||
"""
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"The bare Opensci Model outputting raw hidden-states without any specific head on top.",
|
||||
Opensci_START_DOCSTRING,
|
||||
)
|
||||
class OpensciPreTrainedModel(PreTrainedModel):
|
||||
config_class = OpensciConfig
|
||||
base_model_prefix = "model"
|
||||
supports_gradient_checkpointing = True
|
||||
_no_split_modules = ["OpensciDecoderLayer"]
|
||||
_skip_keys_device_placement = ["past_key_values"]
|
||||
_supports_flash_attn_2 = True
|
||||
_supports_sdpa = True
|
||||
_supports_flex_attn = True
|
||||
_supports_cache_class = True
|
||||
_supports_quantized_cache = True
|
||||
_supports_static_cache = True
|
||||
_supports_attention_backend = True
|
||||
|
||||
def _init_weights(self, module):
|
||||
std = self.config.initializer_range
|
||||
if isinstance(module, nn.Linear):
|
||||
module.weight.data.normal_(mean=0.0, std=std)
|
||||
if module.bias is not None:
|
||||
module.bias.data.zero_()
|
||||
elif isinstance(module, nn.Embedding):
|
||||
module.weight.data.normal_(mean=0.0, std=std)
|
||||
if module.padding_idx is not None:
|
||||
module.weight.data[module.padding_idx].zero_()
|
||||
|
||||
|
||||
Opensci_INPUTS_DOCSTRING = r"""
|
||||
Args:
|
||||
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
|
||||
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
|
||||
it.
|
||||
|
||||
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
|
||||
[`PreTrainedTokenizer.__call__`] for details.
|
||||
|
||||
[What are input IDs?](../glossary#input-ids)
|
||||
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
|
||||
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
|
||||
|
||||
- 1 for tokens that are **not masked**,
|
||||
- 0 for tokens that are **masked**.
|
||||
|
||||
[What are attention masks?](../glossary#attention-mask)
|
||||
|
||||
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
|
||||
[`PreTrainedTokenizer.__call__`] for details.
|
||||
|
||||
If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
|
||||
`past_key_values`).
|
||||
|
||||
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
|
||||
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
|
||||
information on the default strategy.
|
||||
|
||||
- 1 indicates the head is **not masked**,
|
||||
- 0 indicates the head is **masked**.
|
||||
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
|
||||
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
|
||||
config.n_positions - 1]`.
|
||||
|
||||
[What are position IDs?](../glossary#position-ids)
|
||||
past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
|
||||
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
|
||||
blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
|
||||
returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
|
||||
|
||||
Two formats are allowed:
|
||||
- a [`~cache_utils.Cache`] instance, see our
|
||||
[kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache);
|
||||
- Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
|
||||
shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
|
||||
cache format.
|
||||
|
||||
The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
|
||||
legacy cache format will be returned.
|
||||
|
||||
If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
|
||||
have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
|
||||
of shape `(batch_size, sequence_length)`.
|
||||
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
|
||||
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
|
||||
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
|
||||
model's internal embedding lookup matrix.
|
||||
use_cache (`bool`, *optional*):
|
||||
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
|
||||
`past_key_values`).
|
||||
output_attentions (`bool`, *optional*):
|
||||
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
||||
tensors for more detail.
|
||||
output_hidden_states (`bool`, *optional*):
|
||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||
more detail.
|
||||
return_dict (`bool`, *optional*):
|
||||
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
|
||||
cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
|
||||
Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
|
||||
this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
|
||||
the complete sequence length.
|
||||
"""
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"The bare Opensci Model outputting raw hidden-states without any specific head on top.",
|
||||
Opensci_START_DOCSTRING,
|
||||
)
|
||||
class OpensciModel(OpensciPreTrainedModel):
|
||||
"""
|
||||
Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`OpensciDecoderLayer`]
|
||||
|
||||
Args:
|
||||
config: OpensciConfig
|
||||
"""
|
||||
|
||||
def __init__(self, config: OpensciConfig):
|
||||
super().__init__(config)
|
||||
self.padding_idx = config.pad_token_id
|
||||
self.vocab_size = config.vocab_size
|
||||
|
||||
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
|
||||
self.layers = nn.ModuleList(
|
||||
[OpensciDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
|
||||
)
|
||||
self.norm = OpensciRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
|
||||
self.rotary_emb = OpensciRotaryEmbedding(config=config)
|
||||
self.gradient_checkpointing = False
|
||||
|
||||
# Initialize weights and apply final processing
|
||||
self.post_init()
|
||||
|
||||
def get_input_embeddings(self):
|
||||
return self.embed_tokens
|
||||
|
||||
def set_input_embeddings(self, value):
|
||||
self.embed_tokens = value
|
||||
|
||||
@add_start_docstrings_to_model_forward(Opensci_INPUTS_DOCSTRING)
|
||||
def forward(
|
||||
self,
|
||||
input_ids: torch.LongTensor = None,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
position_ids: Optional[torch.LongTensor] = None,
|
||||
past_key_values: Optional[Cache] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
use_cache: Optional[bool] = None,
|
||||
output_attentions: Optional[bool] = None,
|
||||
output_hidden_states: Optional[bool] = None,
|
||||
return_dict: Optional[bool] = None,
|
||||
cache_position: Optional[torch.LongTensor] = None,
|
||||
**flash_attn_kwargs: Unpack[FlashAttentionKwargs],
|
||||
) -> Union[Tuple, BaseModelOutputWithPast]:
|
||||
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
||||
output_hidden_states = (
|
||||
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
||||
)
|
||||
use_cache = use_cache if use_cache is not None else self.config.use_cache
|
||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||
|
||||
if (input_ids is None) ^ (inputs_embeds is not None):
|
||||
raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
|
||||
|
||||
if self.gradient_checkpointing and self.training and use_cache:
|
||||
logger.warning_once(
|
||||
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
|
||||
)
|
||||
use_cache = False
|
||||
|
||||
if inputs_embeds is None:
|
||||
inputs_embeds = self.embed_tokens(input_ids)
|
||||
|
||||
if use_cache and past_key_values is None:
|
||||
past_key_values = DynamicCache()
|
||||
|
||||
if cache_position is None:
|
||||
past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
|
||||
cache_position = torch.arange(
|
||||
past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
|
||||
)
|
||||
|
||||
if position_ids is None:
|
||||
position_ids = cache_position.unsqueeze(0)
|
||||
|
||||
causal_mask = self._update_causal_mask(
|
||||
attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
|
||||
)
|
||||
|
||||
hidden_states = inputs_embeds
|
||||
|
||||
# create position embeddings to be shared across the decoder layers
|
||||
position_embeddings = self.rotary_emb(hidden_states, position_ids)
|
||||
|
||||
# decoder layers
|
||||
all_hidden_states = () if output_hidden_states else None
|
||||
all_self_attns = () if output_attentions else None
|
||||
|
||||
for decoder_layer in self.layers[: self.config.num_hidden_layers]:
|
||||
if output_hidden_states:
|
||||
all_hidden_states += (hidden_states,)
|
||||
|
||||
if self.gradient_checkpointing and self.training:
|
||||
layer_outputs = self._gradient_checkpointing_func(
|
||||
decoder_layer.__call__,
|
||||
hidden_states,
|
||||
causal_mask,
|
||||
position_ids,
|
||||
past_key_values,
|
||||
output_attentions,
|
||||
use_cache,
|
||||
cache_position,
|
||||
position_embeddings,
|
||||
)
|
||||
else:
|
||||
layer_outputs = decoder_layer(
|
||||
hidden_states,
|
||||
attention_mask=causal_mask,
|
||||
position_ids=position_ids,
|
||||
past_key_value=past_key_values,
|
||||
output_attentions=output_attentions,
|
||||
use_cache=use_cache,
|
||||
cache_position=cache_position,
|
||||
position_embeddings=position_embeddings,
|
||||
**flash_attn_kwargs,
|
||||
)
|
||||
|
||||
hidden_states = layer_outputs[0]
|
||||
|
||||
if output_attentions:
|
||||
all_self_attns += (layer_outputs[1],)
|
||||
|
||||
hidden_states = self.norm(hidden_states)
|
||||
|
||||
# add hidden states from the last decoder layer
|
||||
if output_hidden_states:
|
||||
all_hidden_states += (hidden_states,)
|
||||
|
||||
output = BaseModelOutputWithPast(
|
||||
last_hidden_state=hidden_states,
|
||||
past_key_values=past_key_values if use_cache else None,
|
||||
hidden_states=all_hidden_states,
|
||||
attentions=all_self_attns,
|
||||
)
|
||||
return output if return_dict else output.to_tuple()
|
||||
|
||||
def _update_causal_mask(
|
||||
self,
|
||||
attention_mask: torch.Tensor,
|
||||
input_tensor: torch.Tensor,
|
||||
cache_position: torch.Tensor,
|
||||
past_key_values: Cache,
|
||||
output_attentions: bool,
|
||||
):
|
||||
if self.config._attn_implementation == "flash_attention_2":
|
||||
if attention_mask is not None and (attention_mask == 0.0).any():
|
||||
return attention_mask
|
||||
return None
|
||||
|
||||
# For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
|
||||
# order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
|
||||
# to infer the attention mask.
|
||||
past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
|
||||
using_static_cache = isinstance(past_key_values, StaticCache)
|
||||
|
||||
# When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
|
||||
if self.config._attn_implementation == "sdpa" and not using_static_cache and not output_attentions:
|
||||
if AttentionMaskConverter._ignore_causal_mask_sdpa(
|
||||
attention_mask,
|
||||
inputs_embeds=input_tensor,
|
||||
past_key_values_length=past_seen_tokens,
|
||||
is_training=self.training,
|
||||
):
|
||||
return None
|
||||
|
||||
dtype, device = input_tensor.dtype, input_tensor.device
|
||||
sequence_length = input_tensor.shape[1]
|
||||
if using_static_cache:
|
||||
target_length = past_key_values.get_max_cache_shape()
|
||||
else:
|
||||
target_length = (
|
||||
attention_mask.shape[-1]
|
||||
if isinstance(attention_mask, torch.Tensor)
|
||||
else past_seen_tokens + sequence_length + 1
|
||||
)
|
||||
|
||||
# In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
|
||||
causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
|
||||
attention_mask,
|
||||
sequence_length=sequence_length,
|
||||
target_length=target_length,
|
||||
dtype=dtype,
|
||||
device=device,
|
||||
cache_position=cache_position,
|
||||
batch_size=input_tensor.shape[0],
|
||||
)
|
||||
|
||||
if (
|
||||
self.config._attn_implementation == "sdpa"
|
||||
and attention_mask is not None
|
||||
and attention_mask.device.type in ["cuda", "xpu"]
|
||||
and not output_attentions
|
||||
):
|
||||
# Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
|
||||
# using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
|
||||
# Details: https://github.com/pytorch/pytorch/issues/110213
|
||||
min_dtype = torch.finfo(dtype).min
|
||||
causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
|
||||
|
||||
return causal_mask
|
||||
|
||||
@staticmethod
|
||||
def _prepare_4d_causal_attention_mask_with_cache_position(
|
||||
attention_mask: torch.Tensor,
|
||||
sequence_length: int,
|
||||
target_length: int,
|
||||
dtype: torch.dtype,
|
||||
device: torch.device,
|
||||
cache_position: torch.Tensor,
|
||||
batch_size: int,
|
||||
**kwargs,
|
||||
):
|
||||
"""
|
||||
Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
|
||||
`(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
|
||||
|
||||
Args:
|
||||
attention_mask (`torch.Tensor`):
|
||||
A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape
|
||||
`(batch_size, 1, query_length, key_value_length)`.
|
||||
sequence_length (`int`):
|
||||
The sequence length being processed.
|
||||
target_length (`int`):
|
||||
The target length: when generating with static cache, the mask should be as long as the static cache,
|
||||
to account for the 0 padding, the part of the cache that is not filled yet.
|
||||
dtype (`torch.dtype`):
|
||||
The dtype to use for the 4D attention mask.
|
||||
device (`torch.device`):
|
||||
The device to plcae the 4D attention mask on.
|
||||
cache_position (`torch.Tensor`):
|
||||
Indices depicting the position of the input sequence tokens in the sequence.
|
||||
batch_size (`torch.Tensor`):
|
||||
Batch size.
|
||||
"""
|
||||
if attention_mask is not None and attention_mask.dim() == 4:
|
||||
# In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
|
||||
causal_mask = attention_mask
|
||||
else:
|
||||
min_dtype = torch.finfo(dtype).min
|
||||
causal_mask = torch.full(
|
||||
(sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
|
||||
)
|
||||
if sequence_length != 1:
|
||||
causal_mask = torch.triu(causal_mask, diagonal=1)
|
||||
causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
|
||||
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
|
||||
if attention_mask is not None:
|
||||
causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
|
||||
mask_length = attention_mask.shape[-1]
|
||||
padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
|
||||
padding_mask = padding_mask == 0
|
||||
causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
|
||||
padding_mask, min_dtype
|
||||
)
|
||||
|
||||
return causal_mask
|
||||
|
||||
|
||||
class KwargsForCausalLM(FlashAttentionKwargs): ...
|
||||
|
||||
|
||||
class OpensciForCausalLM(OpensciPreTrainedModel, GenerationMixin):
|
||||
_tied_weights_keys = ["lm_head.weight"]
|
||||
_tp_plan = {"lm_head": "colwise_rep"}
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
self.model = OpensciModel(config)
|
||||
self.vocab_size = config.vocab_size
|
||||
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
|
||||
|
||||
# Initialize weights and apply final processing
|
||||
self.post_init()
|
||||
|
||||
def get_input_embeddings(self):
|
||||
return self.model.embed_tokens
|
||||
|
||||
def set_input_embeddings(self, value):
|
||||
self.model.embed_tokens = value
|
||||
|
||||
def get_output_embeddings(self):
|
||||
return self.lm_head
|
||||
|
||||
def set_output_embeddings(self, new_embeddings):
|
||||
self.lm_head = new_embeddings
|
||||
|
||||
def set_decoder(self, decoder):
|
||||
self.model = decoder
|
||||
|
||||
def get_decoder(self):
|
||||
return self.model
|
||||
|
||||
@deprecate_kwarg("num_logits_to_keep", version="4.50", new_name="logits_to_keep")
|
||||
@add_start_docstrings_to_model_forward(Opensci_INPUTS_DOCSTRING)
|
||||
@replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
|
||||
def forward(
|
||||
self,
|
||||
input_ids: torch.LongTensor = None,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
position_ids: Optional[torch.LongTensor] = None,
|
||||
past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
labels: Optional[torch.LongTensor] = None,
|
||||
use_cache: Optional[bool] = None,
|
||||
output_attentions: Optional[bool] = None,
|
||||
output_hidden_states: Optional[bool] = None,
|
||||
return_dict: Optional[bool] = None,
|
||||
cache_position: Optional[torch.LongTensor] = None,
|
||||
logits_to_keep: Union[int, torch.Tensor] = 0,
|
||||
**kwargs: Unpack[KwargsForCausalLM],
|
||||
) -> Union[Tuple, CausalLMOutputWithPast]:
|
||||
r"""
|
||||
Args:
|
||||
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
|
||||
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
|
||||
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
|
||||
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
|
||||
|
||||
logits_to_keep (`int` or `torch.Tensor`, *optional*):
|
||||
If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
|
||||
`input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
|
||||
token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
|
||||
If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
|
||||
This is useful when using packed tensor format (single dimension for batch and sequence length).
|
||||
|
||||
Returns:
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoTokenizer, OpensciForCausalLM
|
||||
|
||||
>>> model = OpensciForCausalLM.from_pretrained("meta-Opensci/Opensci-2-7b-hf")
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("meta-Opensci/Opensci-2-7b-hf")
|
||||
|
||||
>>> prompt = "Hey, are you conscious? Can you talk to me?"
|
||||
>>> inputs = tokenizer(prompt, return_tensors="pt")
|
||||
|
||||
>>> # Generate
|
||||
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
|
||||
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
||||
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
|
||||
```"""
|
||||
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
||||
output_hidden_states = (
|
||||
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
||||
)
|
||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||
|
||||
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
|
||||
outputs = self.model(
|
||||
input_ids=input_ids,
|
||||
attention_mask=attention_mask,
|
||||
position_ids=position_ids,
|
||||
past_key_values=past_key_values,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
cache_position=cache_position,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
hidden_states = outputs[0]
|
||||
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
|
||||
slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
|
||||
logits = self.lm_head(hidden_states[:, slice_indices, :])
|
||||
|
||||
loss = None
|
||||
if labels is not None:
|
||||
loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
|
||||
|
||||
if not return_dict:
|
||||
output = (logits,) + outputs[1:]
|
||||
return (loss,) + output if loss is not None else output
|
||||
|
||||
return CausalLMOutputWithPast(
|
||||
loss=loss,
|
||||
logits=logits,
|
||||
past_key_values=outputs.past_key_values,
|
||||
hidden_states=outputs.hidden_states,
|
||||
attentions=outputs.attentions,
|
||||
)
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"""
|
||||
The Opensci Model transformer with a sequence classification head on top (linear layer).
|
||||
|
||||
[`OpensciForSequenceClassification`] uses the last token in order to do the classification, as other causal models
|
||||
(e.g. GPT-2) do.
|
||||
|
||||
Since it does classification on the last token, it requires to know the position of the last token. If a
|
||||
`pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
|
||||
no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
|
||||
padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
|
||||
each row of the batch).
|
||||
""",
|
||||
Opensci_START_DOCSTRING,
|
||||
)
|
||||
class OpensciForSequenceClassification(OpensciPreTrainedModel):
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
self.num_labels = config.num_labels
|
||||
self.model = OpensciModel(config)
|
||||
self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
|
||||
|
||||
# Initialize weights and apply final processing
|
||||
self.post_init()
|
||||
|
||||
def get_input_embeddings(self):
|
||||
return self.model.embed_tokens
|
||||
|
||||
def set_input_embeddings(self, value):
|
||||
self.model.embed_tokens = value
|
||||
|
||||
@add_start_docstrings_to_model_forward(Opensci_INPUTS_DOCSTRING)
|
||||
def forward(
|
||||
self,
|
||||
input_ids: Optional[torch.LongTensor] = None,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
position_ids: Optional[torch.LongTensor] = None,
|
||||
past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
labels: Optional[torch.LongTensor] = None,
|
||||
use_cache: Optional[bool] = None,
|
||||
output_attentions: Optional[bool] = None,
|
||||
output_hidden_states: Optional[bool] = None,
|
||||
return_dict: Optional[bool] = None,
|
||||
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
|
||||
r"""
|
||||
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
|
||||
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
|
||||
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
|
||||
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
|
||||
"""
|
||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||
|
||||
transformer_outputs = self.model(
|
||||
input_ids,
|
||||
attention_mask=attention_mask,
|
||||
position_ids=position_ids,
|
||||
past_key_values=past_key_values,
|
||||
inputs_embeds=inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
)
|
||||
hidden_states = transformer_outputs[0]
|
||||
logits = self.score(hidden_states)
|
||||
|
||||
if input_ids is not None:
|
||||
batch_size = input_ids.shape[0]
|
||||
else:
|
||||
batch_size = inputs_embeds.shape[0]
|
||||
|
||||
if self.config.pad_token_id is None and batch_size != 1:
|
||||
raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
|
||||
if self.config.pad_token_id is None:
|
||||
last_non_pad_token = -1
|
||||
elif input_ids is not None:
|
||||
# To handle both left- and right- padding, we take the rightmost token that is not equal to pad_token_id
|
||||
non_pad_mask = (input_ids != self.config.pad_token_id).to(logits.device, torch.int32)
|
||||
token_indices = torch.arange(input_ids.shape[-1], device=logits.device)
|
||||
last_non_pad_token = (token_indices * non_pad_mask).argmax(-1)
|
||||
else:
|
||||
last_non_pad_token = -1
|
||||
logger.warning_once(
|
||||
f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
|
||||
"unexpected if using padding tokens in conjunction with `inputs_embeds.`"
|
||||
)
|
||||
|
||||
pooled_logits = logits[torch.arange(batch_size, device=logits.device), last_non_pad_token]
|
||||
|
||||
loss = None
|
||||
if labels is not None:
|
||||
loss = self.loss_function(logits=logits, labels=labels, pooled_logits=pooled_logits, config=self.config)
|
||||
|
||||
if not return_dict:
|
||||
output = (pooled_logits,) + transformer_outputs[1:]
|
||||
return ((loss,) + output) if loss is not None else output
|
||||
|
||||
return SequenceClassifierOutputWithPast(
|
||||
loss=loss,
|
||||
logits=pooled_logits,
|
||||
past_key_values=transformer_outputs.past_key_values,
|
||||
hidden_states=transformer_outputs.hidden_states,
|
||||
attentions=transformer_outputs.attentions,
|
||||
)
|
||||
24
special_tokens_map.json
Normal file
24
special_tokens_map.json
Normal file
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"bos_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"eos_token": {
|
||||
"content": "<end_of_turn>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"pad_token": "<end_of_turn>",
|
||||
"unk_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
}
|
||||
}
|
||||
250586
tokenizer.json
Normal file
250586
tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
225
tokenizer_config.json
Normal file
225
tokenizer_config.json
Normal file
@@ -0,0 +1,225 @@
|
||||
{
|
||||
"add_bos_token": false,
|
||||
"add_eos_token": false,
|
||||
"add_prefix_space": false,
|
||||
"added_tokens_decoder": {
|
||||
"0": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"1": {
|
||||
"content": "<|padding|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"50254": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50255": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50256": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50257": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50258": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50259": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50260": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50261": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50262": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50263": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50264": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50265": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50266": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50267": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50268": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50269": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50270": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50271": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50272": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50273": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50274": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50275": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50276": {
|
||||
"content": " ",
|
||||
"lstrip": false,
|
||||
"normalized": true,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": false
|
||||
},
|
||||
"50277": {
|
||||
"content": "<end_of_turn>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
}
|
||||
},
|
||||
"bos_token": "<|endoftext|>",
|
||||
"clean_up_tokenization_spaces": false,
|
||||
"eos_token": "<end_of_turn>",
|
||||
"extra_special_tokens": {},
|
||||
"model_max_length": 1000000000000000019884624838656,
|
||||
"pad_token": "<end_of_turn>",
|
||||
"padding_side": "right",
|
||||
"split_special_tokens": false,
|
||||
"tokenizer_class": "GPTNeoXTokenizer",
|
||||
"unk_token": "<|endoftext|>"
|
||||
}
|
||||
8
train_results.json
Normal file
8
train_results.json
Normal file
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"epoch": 1.0,
|
||||
"total_flos": 889480391426048.0,
|
||||
"train_loss": 0.9303844255645997,
|
||||
"train_runtime": 18614.4314,
|
||||
"train_samples_per_second": 2.79,
|
||||
"train_steps_per_second": 0.087
|
||||
}
|
||||
325
trainer_log.jsonl
Normal file
325
trainer_log.jsonl
Normal file
@@ -0,0 +1,325 @@
|
||||
{"current_steps": 5, "total_steps": 1623, "loss": 1.1518, "lr": 9.756097560975611e-06, "epoch": 0.0030807147258163892, "percentage": 0.31, "elapsed_time": "0:01:05", "remaining_time": "5:53:08"}
|
||||
{"current_steps": 10, "total_steps": 1623, "loss": 1.2475, "lr": 2.1951219512195124e-05, "epoch": 0.0061614294516327784, "percentage": 0.62, "elapsed_time": "0:02:03", "remaining_time": "5:31:49"}
|
||||
{"current_steps": 15, "total_steps": 1623, "loss": 1.2543, "lr": 3.414634146341464e-05, "epoch": 0.009242144177449169, "percentage": 0.92, "elapsed_time": "0:03:03", "remaining_time": "5:26:59"}
|
||||
{"current_steps": 20, "total_steps": 1623, "loss": 1.2007, "lr": 4.634146341463415e-05, "epoch": 0.012322858903265557, "percentage": 1.23, "elapsed_time": "0:04:00", "remaining_time": "5:20:42"}
|
||||
{"current_steps": 25, "total_steps": 1623, "loss": 1.0335, "lr": 5.853658536585366e-05, "epoch": 0.015403573629081947, "percentage": 1.54, "elapsed_time": "0:04:38", "remaining_time": "4:56:40"}
|
||||
{"current_steps": 30, "total_steps": 1623, "loss": 1.0887, "lr": 7.073170731707317e-05, "epoch": 0.018484288354898338, "percentage": 1.85, "elapsed_time": "0:05:34", "remaining_time": "4:55:56"}
|
||||
{"current_steps": 35, "total_steps": 1623, "loss": 1.1421, "lr": 8.292682926829268e-05, "epoch": 0.021565003080714726, "percentage": 2.16, "elapsed_time": "0:06:33", "remaining_time": "4:57:35"}
|
||||
{"current_steps": 40, "total_steps": 1623, "loss": 1.1532, "lr": 9.51219512195122e-05, "epoch": 0.024645717806531114, "percentage": 2.46, "elapsed_time": "0:07:31", "remaining_time": "4:57:46"}
|
||||
{"current_steps": 45, "total_steps": 1623, "loss": 1.1098, "lr": 0.00010731707317073172, "epoch": 0.027726432532347505, "percentage": 2.77, "elapsed_time": "0:08:27", "remaining_time": "4:56:23"}
|
||||
{"current_steps": 50, "total_steps": 1623, "loss": 0.9678, "lr": 0.00011951219512195122, "epoch": 0.030807147258163893, "percentage": 3.08, "elapsed_time": "0:09:07", "remaining_time": "4:47:16"}
|
||||
{"current_steps": 55, "total_steps": 1623, "loss": 1.0811, "lr": 0.00013170731707317076, "epoch": 0.033887861983980284, "percentage": 3.39, "elapsed_time": "0:10:10", "remaining_time": "4:50:08"}
|
||||
{"current_steps": 60, "total_steps": 1623, "loss": 1.1093, "lr": 0.00014390243902439025, "epoch": 0.036968576709796676, "percentage": 3.7, "elapsed_time": "0:11:13", "remaining_time": "4:52:14"}
|
||||
{"current_steps": 65, "total_steps": 1623, "loss": 1.0733, "lr": 0.00015609756097560978, "epoch": 0.04004929143561306, "percentage": 4.0, "elapsed_time": "0:12:13", "remaining_time": "4:52:56"}
|
||||
{"current_steps": 70, "total_steps": 1623, "loss": 1.1062, "lr": 0.00016829268292682927, "epoch": 0.04313000616142945, "percentage": 4.31, "elapsed_time": "0:13:09", "remaining_time": "4:51:50"}
|
||||
{"current_steps": 75, "total_steps": 1623, "loss": 0.9404, "lr": 0.0001804878048780488, "epoch": 0.04621072088724584, "percentage": 4.62, "elapsed_time": "0:13:47", "remaining_time": "4:44:46"}
|
||||
{"current_steps": 80, "total_steps": 1623, "loss": 0.9997, "lr": 0.0001926829268292683, "epoch": 0.04929143561306223, "percentage": 4.93, "elapsed_time": "0:14:41", "remaining_time": "4:43:21"}
|
||||
{"current_steps": 85, "total_steps": 1623, "loss": 1.038, "lr": 0.0001999991687649223, "epoch": 0.05237215033887862, "percentage": 5.24, "elapsed_time": "0:15:37", "remaining_time": "4:42:50"}
|
||||
{"current_steps": 90, "total_steps": 1623, "loss": 1.1127, "lr": 0.00019998981752900036, "epoch": 0.05545286506469501, "percentage": 5.55, "elapsed_time": "0:16:36", "remaining_time": "4:42:50"}
|
||||
{"current_steps": 95, "total_steps": 1623, "loss": 1.1212, "lr": 0.00019997007698817557, "epoch": 0.0585335797905114, "percentage": 5.85, "elapsed_time": "0:17:34", "remaining_time": "4:42:42"}
|
||||
{"current_steps": 100, "total_steps": 1623, "loss": 0.8807, "lr": 0.00019993994919356167, "epoch": 0.061614294516327786, "percentage": 6.16, "elapsed_time": "0:18:16", "remaining_time": "4:38:12"}
|
||||
{"current_steps": 105, "total_steps": 1623, "loss": 0.9791, "lr": 0.00019989943727554598, "epoch": 0.06469500924214418, "percentage": 6.47, "elapsed_time": "0:19:16", "remaining_time": "4:38:38"}
|
||||
{"current_steps": 110, "total_steps": 1623, "loss": 1.0313, "lr": 0.00019984854544346367, "epoch": 0.06777572396796057, "percentage": 6.78, "elapsed_time": "0:20:22", "remaining_time": "4:40:18"}
|
||||
{"current_steps": 115, "total_steps": 1623, "loss": 1.073, "lr": 0.00019978727898516086, "epoch": 0.07085643869377696, "percentage": 7.09, "elapsed_time": "0:21:18", "remaining_time": "4:39:20"}
|
||||
{"current_steps": 120, "total_steps": 1623, "loss": 0.9766, "lr": 0.0001997156442664449, "epoch": 0.07393715341959335, "percentage": 7.39, "elapsed_time": "0:22:18", "remaining_time": "4:39:30"}
|
||||
{"current_steps": 125, "total_steps": 1623, "loss": 0.8606, "lr": 0.00019963364873042298, "epoch": 0.07701786814540973, "percentage": 7.7, "elapsed_time": "0:23:00", "remaining_time": "4:35:48"}
|
||||
{"current_steps": 130, "total_steps": 1623, "loss": 1.0665, "lr": 0.0001995413008967289, "epoch": 0.08009858287122612, "percentage": 8.01, "elapsed_time": "0:24:02", "remaining_time": "4:36:02"}
|
||||
{"current_steps": 135, "total_steps": 1623, "loss": 1.0262, "lr": 0.00019943861036063768, "epoch": 0.08317929759704251, "percentage": 8.32, "elapsed_time": "0:25:03", "remaining_time": "4:36:09"}
|
||||
{"current_steps": 140, "total_steps": 1623, "loss": 1.0675, "lr": 0.00019932558779206874, "epoch": 0.0862600123228589, "percentage": 8.63, "elapsed_time": "0:26:02", "remaining_time": "4:35:55"}
|
||||
{"current_steps": 145, "total_steps": 1623, "loss": 1.069, "lr": 0.00019920224493447702, "epoch": 0.0893407270486753, "percentage": 8.93, "elapsed_time": "0:27:02", "remaining_time": "4:35:41"}
|
||||
{"current_steps": 150, "total_steps": 1623, "loss": 0.8611, "lr": 0.00019906859460363307, "epoch": 0.09242144177449169, "percentage": 9.24, "elapsed_time": "0:27:43", "remaining_time": "4:32:16"}
|
||||
{"current_steps": 155, "total_steps": 1623, "loss": 0.9906, "lr": 0.00019892465068629131, "epoch": 0.09550215650030808, "percentage": 9.55, "elapsed_time": "0:28:39", "remaining_time": "4:31:20"}
|
||||
{"current_steps": 160, "total_steps": 1623, "loss": 1.1415, "lr": 0.0001987704281387471, "epoch": 0.09858287122612445, "percentage": 9.86, "elapsed_time": "0:29:39", "remaining_time": "4:31:12"}
|
||||
{"current_steps": 165, "total_steps": 1623, "loss": 1.1192, "lr": 0.00019860594298528282, "epoch": 0.10166358595194085, "percentage": 10.17, "elapsed_time": "0:30:40", "remaining_time": "4:31:07"}
|
||||
{"current_steps": 170, "total_steps": 1623, "loss": 1.1348, "lr": 0.0001984312123165028, "epoch": 0.10474430067775724, "percentage": 10.47, "elapsed_time": "0:31:37", "remaining_time": "4:30:14"}
|
||||
{"current_steps": 175, "total_steps": 1623, "loss": 0.8411, "lr": 0.0001982462542875576, "epoch": 0.10782501540357363, "percentage": 10.78, "elapsed_time": "0:32:18", "remaining_time": "4:27:23"}
|
||||
{"current_steps": 180, "total_steps": 1623, "loss": 1.0013, "lr": 0.00019805108811625773, "epoch": 0.11090573012939002, "percentage": 11.09, "elapsed_time": "0:33:22", "remaining_time": "4:27:35"}
|
||||
{"current_steps": 185, "total_steps": 1623, "loss": 0.9905, "lr": 0.00019784573408107657, "epoch": 0.11398644485520641, "percentage": 11.4, "elapsed_time": "0:34:29", "remaining_time": "4:28:07"}
|
||||
{"current_steps": 190, "total_steps": 1623, "loss": 0.9899, "lr": 0.00019763021351904358, "epoch": 0.1170671595810228, "percentage": 11.71, "elapsed_time": "0:35:27", "remaining_time": "4:27:23"}
|
||||
{"current_steps": 195, "total_steps": 1623, "loss": 1.0253, "lr": 0.00019740454882352732, "epoch": 0.12014787430683918, "percentage": 12.01, "elapsed_time": "0:36:29", "remaining_time": "4:27:13"}
|
||||
{"current_steps": 200, "total_steps": 1623, "loss": 0.8803, "lr": 0.0001971687634419086, "epoch": 0.12322858903265557, "percentage": 12.32, "elapsed_time": "0:37:10", "remaining_time": "4:24:33"}
|
||||
{"current_steps": 205, "total_steps": 1623, "loss": 0.9944, "lr": 0.0001969228818731442, "epoch": 0.12630930375847196, "percentage": 12.63, "elapsed_time": "0:38:12", "remaining_time": "4:24:16"}
|
||||
{"current_steps": 210, "total_steps": 1623, "loss": 1.0408, "lr": 0.00019666692966522145, "epoch": 0.12939001848428835, "percentage": 12.94, "elapsed_time": "0:39:18", "remaining_time": "4:24:29"}
|
||||
{"current_steps": 215, "total_steps": 1623, "loss": 0.9837, "lr": 0.00019640093341250357, "epoch": 0.13247073321010475, "percentage": 13.25, "elapsed_time": "0:40:13", "remaining_time": "4:23:24"}
|
||||
{"current_steps": 220, "total_steps": 1623, "loss": 1.0473, "lr": 0.0001961249207529665, "epoch": 0.13555144793592114, "percentage": 13.56, "elapsed_time": "0:41:09", "remaining_time": "4:22:29"}
|
||||
{"current_steps": 225, "total_steps": 1623, "loss": 0.872, "lr": 0.00019583892036532726, "epoch": 0.13863216266173753, "percentage": 13.86, "elapsed_time": "0:41:55", "remaining_time": "4:20:29"}
|
||||
{"current_steps": 230, "total_steps": 1623, "loss": 0.9982, "lr": 0.00019554296196606395, "epoch": 0.14171287738755392, "percentage": 14.17, "elapsed_time": "0:42:55", "remaining_time": "4:19:59"}
|
||||
{"current_steps": 235, "total_steps": 1623, "loss": 0.9808, "lr": 0.00019523707630632835, "epoch": 0.1447935921133703, "percentage": 14.48, "elapsed_time": "0:44:01", "remaining_time": "4:19:59"}
|
||||
{"current_steps": 240, "total_steps": 1623, "loss": 1.045, "lr": 0.00019492129516875055, "epoch": 0.1478743068391867, "percentage": 14.79, "elapsed_time": "0:45:03", "remaining_time": "4:19:40"}
|
||||
{"current_steps": 245, "total_steps": 1623, "loss": 1.0869, "lr": 0.00019459565136413666, "epoch": 0.15095502156500307, "percentage": 15.1, "elapsed_time": "0:46:02", "remaining_time": "4:18:58"}
|
||||
{"current_steps": 250, "total_steps": 1623, "loss": 0.8503, "lr": 0.0001942601787280598, "epoch": 0.15403573629081946, "percentage": 15.4, "elapsed_time": "0:46:45", "remaining_time": "4:16:49"}
|
||||
{"current_steps": 255, "total_steps": 1623, "loss": 0.9952, "lr": 0.00019391491211734425, "epoch": 0.15711645101663585, "percentage": 15.71, "elapsed_time": "0:47:49", "remaining_time": "4:16:35"}
|
||||
{"current_steps": 260, "total_steps": 1623, "loss": 0.9793, "lr": 0.0001935598874064438, "epoch": 0.16019716574245224, "percentage": 16.02, "elapsed_time": "0:48:47", "remaining_time": "4:15:48"}
|
||||
{"current_steps": 265, "total_steps": 1623, "loss": 0.9427, "lr": 0.00019319514148371435, "epoch": 0.16327788046826863, "percentage": 16.33, "elapsed_time": "0:49:42", "remaining_time": "4:14:44"}
|
||||
{"current_steps": 270, "total_steps": 1623, "loss": 1.0259, "lr": 0.00019282071224758091, "epoch": 0.16635859519408502, "percentage": 16.64, "elapsed_time": "0:50:44", "remaining_time": "4:14:13"}
|
||||
{"current_steps": 275, "total_steps": 1623, "loss": 0.8559, "lr": 0.00019243663860259993, "epoch": 0.16943930991990142, "percentage": 16.94, "elapsed_time": "0:51:26", "remaining_time": "4:12:10"}
|
||||
{"current_steps": 280, "total_steps": 1623, "loss": 0.9995, "lr": 0.00019204296045541685, "epoch": 0.1725200246457178, "percentage": 17.25, "elapsed_time": "0:52:30", "remaining_time": "4:11:49"}
|
||||
{"current_steps": 285, "total_steps": 1623, "loss": 0.9409, "lr": 0.0001916397187106199, "epoch": 0.1756007393715342, "percentage": 17.56, "elapsed_time": "0:53:29", "remaining_time": "4:11:09"}
|
||||
{"current_steps": 290, "total_steps": 1623, "loss": 0.9847, "lr": 0.00019122695526648968, "epoch": 0.1786814540973506, "percentage": 17.87, "elapsed_time": "0:54:31", "remaining_time": "4:10:38"}
|
||||
{"current_steps": 295, "total_steps": 1623, "loss": 1.0982, "lr": 0.00019080471301064598, "epoch": 0.18176216882316698, "percentage": 18.18, "elapsed_time": "0:55:37", "remaining_time": "4:10:24"}
|
||||
{"current_steps": 300, "total_steps": 1623, "loss": 0.8299, "lr": 0.00019037303581559143, "epoch": 0.18484288354898337, "percentage": 18.48, "elapsed_time": "0:56:19", "remaining_time": "4:08:22"}
|
||||
{"current_steps": 305, "total_steps": 1623, "loss": 0.9737, "lr": 0.00018993196853415317, "epoch": 0.18792359827479976, "percentage": 18.79, "elapsed_time": "0:57:17", "remaining_time": "4:07:34"}
|
||||
{"current_steps": 310, "total_steps": 1623, "loss": 0.9551, "lr": 0.00018948155699482244, "epoch": 0.19100431300061615, "percentage": 19.1, "elapsed_time": "0:58:16", "remaining_time": "4:06:51"}
|
||||
{"current_steps": 315, "total_steps": 1623, "loss": 1.057, "lr": 0.00018902184799699263, "epoch": 0.19408502772643252, "percentage": 19.41, "elapsed_time": "0:59:13", "remaining_time": "4:05:55"}
|
||||
{"current_steps": 320, "total_steps": 1623, "loss": 1.0065, "lr": 0.00018855288930609692, "epoch": 0.1971657424522489, "percentage": 19.72, "elapsed_time": "1:00:12", "remaining_time": "4:05:09"}
|
||||
{"current_steps": 325, "total_steps": 1623, "loss": 0.8492, "lr": 0.00018807472964864515, "epoch": 0.2002464571780653, "percentage": 20.02, "elapsed_time": "1:00:52", "remaining_time": "4:03:05"}
|
||||
{"current_steps": 330, "total_steps": 1623, "loss": 1.0248, "lr": 0.00018758741870716092, "epoch": 0.2033271719038817, "percentage": 20.33, "elapsed_time": "1:01:50", "remaining_time": "4:02:17"}
|
||||
{"current_steps": 335, "total_steps": 1623, "loss": 1.0095, "lr": 0.00018709100711501955, "epoch": 0.20640788662969808, "percentage": 20.64, "elapsed_time": "1:02:57", "remaining_time": "4:02:03"}
|
||||
{"current_steps": 340, "total_steps": 1623, "loss": 0.9469, "lr": 0.0001865855464511869, "epoch": 0.20948860135551448, "percentage": 20.95, "elapsed_time": "1:03:56", "remaining_time": "4:01:15"}
|
||||
{"current_steps": 345, "total_steps": 1623, "loss": 0.8772, "lr": 0.00018607108923486025, "epoch": 0.21256931608133087, "percentage": 21.26, "elapsed_time": "1:04:57", "remaining_time": "4:00:39"}
|
||||
{"current_steps": 350, "total_steps": 1623, "loss": 0.8309, "lr": 0.00018554768892001136, "epoch": 0.21565003080714726, "percentage": 21.57, "elapsed_time": "1:05:38", "remaining_time": "3:58:43"}
|
||||
{"current_steps": 355, "total_steps": 1623, "loss": 0.8526, "lr": 0.00018501539988983234, "epoch": 0.21873074553296365, "percentage": 21.87, "elapsed_time": "1:06:36", "remaining_time": "3:57:56"}
|
||||
{"current_steps": 360, "total_steps": 1623, "loss": 0.9808, "lr": 0.0001844742774510851, "epoch": 0.22181146025878004, "percentage": 22.18, "elapsed_time": "1:07:44", "remaining_time": "3:57:39"}
|
||||
{"current_steps": 365, "total_steps": 1623, "loss": 0.9952, "lr": 0.00018392437782835475, "epoch": 0.22489217498459643, "percentage": 22.49, "elapsed_time": "1:08:44", "remaining_time": "3:56:54"}
|
||||
{"current_steps": 370, "total_steps": 1623, "loss": 1.011, "lr": 0.00018336575815820766, "epoch": 0.22797288971041282, "percentage": 22.8, "elapsed_time": "1:09:42", "remaining_time": "3:56:03"}
|
||||
{"current_steps": 375, "total_steps": 1623, "loss": 0.8487, "lr": 0.00018279847648325478, "epoch": 0.23105360443622922, "percentage": 23.11, "elapsed_time": "1:10:25", "remaining_time": "3:54:23"}
|
||||
{"current_steps": 380, "total_steps": 1623, "loss": 0.9038, "lr": 0.0001822225917461208, "epoch": 0.2341343191620456, "percentage": 23.41, "elapsed_time": "1:11:25", "remaining_time": "3:53:37"}
|
||||
{"current_steps": 385, "total_steps": 1623, "loss": 0.9601, "lr": 0.0001816381637833198, "epoch": 0.23721503388786197, "percentage": 23.72, "elapsed_time": "1:12:29", "remaining_time": "3:53:07"}
|
||||
{"current_steps": 390, "total_steps": 1623, "loss": 1.0495, "lr": 0.00018104525331903799, "epoch": 0.24029574861367836, "percentage": 24.03, "elapsed_time": "1:13:34", "remaining_time": "3:52:36"}
|
||||
{"current_steps": 395, "total_steps": 1623, "loss": 1.0792, "lr": 0.00018044392195882427, "epoch": 0.24337646333949475, "percentage": 24.34, "elapsed_time": "1:14:33", "remaining_time": "3:51:46"}
|
||||
{"current_steps": 400, "total_steps": 1623, "loss": 0.8639, "lr": 0.00017983423218318918, "epoch": 0.24645717806531114, "percentage": 24.65, "elapsed_time": "1:15:15", "remaining_time": "3:50:06"}
|
||||
{"current_steps": 405, "total_steps": 1623, "loss": 0.9426, "lr": 0.00017921624734111292, "epoch": 0.24953789279112754, "percentage": 24.95, "elapsed_time": "1:16:14", "remaining_time": "3:49:18"}
|
||||
{"current_steps": 410, "total_steps": 1623, "loss": 0.9744, "lr": 0.00017859003164346336, "epoch": 0.2526186075169439, "percentage": 25.26, "elapsed_time": "1:17:21", "remaining_time": "3:48:52"}
|
||||
{"current_steps": 415, "total_steps": 1623, "loss": 0.9612, "lr": 0.0001779556501563239, "epoch": 0.2556993222427603, "percentage": 25.57, "elapsed_time": "1:18:24", "remaining_time": "3:48:14"}
|
||||
{"current_steps": 420, "total_steps": 1623, "loss": 1.034, "lr": 0.00017731316879423327, "epoch": 0.2587800369685767, "percentage": 25.88, "elapsed_time": "1:19:22", "remaining_time": "3:47:20"}
|
||||
{"current_steps": 425, "total_steps": 1623, "loss": 0.8632, "lr": 0.00017666265431333654, "epoch": 0.2618607516943931, "percentage": 26.19, "elapsed_time": "1:20:05", "remaining_time": "3:45:45"}
|
||||
{"current_steps": 430, "total_steps": 1623, "loss": 0.9842, "lr": 0.000176004174304449, "epoch": 0.2649414664202095, "percentage": 26.49, "elapsed_time": "1:21:05", "remaining_time": "3:44:57"}
|
||||
{"current_steps": 435, "total_steps": 1623, "loss": 0.9874, "lr": 0.00017533779718603313, "epoch": 0.2680221811460259, "percentage": 26.8, "elapsed_time": "1:22:13", "remaining_time": "3:44:32"}
|
||||
{"current_steps": 440, "total_steps": 1623, "loss": 0.9457, "lr": 0.00017466359219708985, "epoch": 0.2711028958718423, "percentage": 27.11, "elapsed_time": "1:23:09", "remaining_time": "3:43:35"}
|
||||
{"current_steps": 445, "total_steps": 1623, "loss": 0.9501, "lr": 0.00017398162938996422, "epoch": 0.27418361059765867, "percentage": 27.42, "elapsed_time": "1:24:09", "remaining_time": "3:42:46"}
|
||||
{"current_steps": 450, "total_steps": 1623, "loss": 0.8123, "lr": 0.00017329197962306664, "epoch": 0.27726432532347506, "percentage": 27.73, "elapsed_time": "1:24:48", "remaining_time": "3:41:03"}
|
||||
{"current_steps": 455, "total_steps": 1623, "loss": 0.9048, "lr": 0.00017259471455351072, "epoch": 0.28034504004929145, "percentage": 28.03, "elapsed_time": "1:25:47", "remaining_time": "3:40:12"}
|
||||
{"current_steps": 460, "total_steps": 1623, "loss": 0.9711, "lr": 0.0001718899066296675, "epoch": 0.28342575477510784, "percentage": 28.34, "elapsed_time": "1:26:50", "remaining_time": "3:39:32"}
|
||||
{"current_steps": 465, "total_steps": 1623, "loss": 0.9762, "lr": 0.000171177629083638, "epoch": 0.28650646950092423, "percentage": 28.65, "elapsed_time": "1:27:54", "remaining_time": "3:38:54"}
|
||||
{"current_steps": 470, "total_steps": 1623, "loss": 1.0148, "lr": 0.0001704579559236441, "epoch": 0.2895871842267406, "percentage": 28.96, "elapsed_time": "1:28:57", "remaining_time": "3:38:12"}
|
||||
{"current_steps": 475, "total_steps": 1623, "loss": 0.786, "lr": 0.00016973096192633884, "epoch": 0.292667898952557, "percentage": 29.27, "elapsed_time": "1:29:38", "remaining_time": "3:36:39"}
|
||||
{"current_steps": 480, "total_steps": 1623, "loss": 0.9034, "lr": 0.00016899672262903677, "epoch": 0.2957486136783734, "percentage": 29.57, "elapsed_time": "1:30:38", "remaining_time": "3:35:50"}
|
||||
{"current_steps": 485, "total_steps": 1623, "loss": 0.9694, "lr": 0.00016825531432186543, "epoch": 0.2988293284041898, "percentage": 29.88, "elapsed_time": "1:31:36", "remaining_time": "3:34:57"}
|
||||
{"current_steps": 490, "total_steps": 1623, "loss": 1.0684, "lr": 0.00016750681403983846, "epoch": 0.30191004313000613, "percentage": 30.19, "elapsed_time": "1:32:34", "remaining_time": "3:34:03"}
|
||||
{"current_steps": 495, "total_steps": 1623, "loss": 0.9935, "lr": 0.00016675129955485152, "epoch": 0.3049907578558225, "percentage": 30.5, "elapsed_time": "1:33:32", "remaining_time": "3:33:08"}
|
||||
{"current_steps": 500, "total_steps": 1623, "loss": 0.8232, "lr": 0.00016598884936760131, "epoch": 0.3080714725816389, "percentage": 30.81, "elapsed_time": "1:34:19", "remaining_time": "3:31:50"}
|
||||
{"current_steps": 505, "total_steps": 1623, "loss": 0.989, "lr": 0.00016521954269942918, "epoch": 0.3111521873074553, "percentage": 31.12, "elapsed_time": "1:35:30", "remaining_time": "3:31:27"}
|
||||
{"current_steps": 510, "total_steps": 1623, "loss": 0.9521, "lr": 0.00016444345948408984, "epoch": 0.3142329020332717, "percentage": 31.42, "elapsed_time": "1:36:36", "remaining_time": "3:30:49"}
|
||||
{"current_steps": 515, "total_steps": 1623, "loss": 1.0013, "lr": 0.0001636606803594457, "epoch": 0.3173136167590881, "percentage": 31.73, "elapsed_time": "1:37:34", "remaining_time": "3:29:56"}
|
||||
{"current_steps": 520, "total_steps": 1623, "loss": 0.9773, "lr": 0.0001628712866590885, "epoch": 0.3203943314849045, "percentage": 32.04, "elapsed_time": "1:38:32", "remaining_time": "3:29:00"}
|
||||
{"current_steps": 525, "total_steps": 1623, "loss": 0.8414, "lr": 0.00016207536040388845, "epoch": 0.3234750462107209, "percentage": 32.35, "elapsed_time": "1:39:19", "remaining_time": "3:27:42"}
|
||||
{"current_steps": 530, "total_steps": 1623, "loss": 0.9793, "lr": 0.0001612729842934718, "epoch": 0.32655576093653726, "percentage": 32.66, "elapsed_time": "1:40:19", "remaining_time": "3:26:52"}
|
||||
{"current_steps": 535, "total_steps": 1623, "loss": 1.0042, "lr": 0.00016046424169762827, "epoch": 0.32963647566235366, "percentage": 32.96, "elapsed_time": "1:41:19", "remaining_time": "3:26:03"}
|
||||
{"current_steps": 540, "total_steps": 1623, "loss": 1.0067, "lr": 0.0001596492166476485, "epoch": 0.33271719038817005, "percentage": 33.27, "elapsed_time": "1:42:28", "remaining_time": "3:25:31"}
|
||||
{"current_steps": 545, "total_steps": 1623, "loss": 0.9971, "lr": 0.0001588279938275929, "epoch": 0.33579790511398644, "percentage": 33.58, "elapsed_time": "1:43:28", "remaining_time": "3:24:41"}
|
||||
{"current_steps": 550, "total_steps": 1623, "loss": 0.7794, "lr": 0.00015800065856549269, "epoch": 0.33887861983980283, "percentage": 33.89, "elapsed_time": "1:44:12", "remaining_time": "3:23:18"}
|
||||
{"current_steps": 555, "total_steps": 1623, "loss": 0.9553, "lr": 0.00015716729682448393, "epoch": 0.3419593345656192, "percentage": 34.2, "elapsed_time": "1:45:14", "remaining_time": "3:22:31"}
|
||||
{"current_steps": 560, "total_steps": 1623, "loss": 0.9601, "lr": 0.0001563279951938758, "epoch": 0.3450400492914356, "percentage": 34.5, "elapsed_time": "1:46:16", "remaining_time": "3:21:43"}
|
||||
{"current_steps": 565, "total_steps": 1623, "loss": 1.0177, "lr": 0.00015548284088015354, "epoch": 0.348120764017252, "percentage": 34.81, "elapsed_time": "1:47:17", "remaining_time": "3:20:54"}
|
||||
{"current_steps": 570, "total_steps": 1623, "loss": 0.9958, "lr": 0.00015463192169791741, "epoch": 0.3512014787430684, "percentage": 35.12, "elapsed_time": "1:48:17", "remaining_time": "3:20:02"}
|
||||
{"current_steps": 575, "total_steps": 1623, "loss": 0.8352, "lr": 0.0001537753260607584, "epoch": 0.3542821934688848, "percentage": 35.43, "elapsed_time": "1:48:57", "remaining_time": "3:18:35"}
|
||||
{"current_steps": 580, "total_steps": 1623, "loss": 0.9472, "lr": 0.00015291314297207175, "epoch": 0.3573629081947012, "percentage": 35.74, "elapsed_time": "1:50:02", "remaining_time": "3:17:53"}
|
||||
{"current_steps": 585, "total_steps": 1623, "loss": 0.9853, "lr": 0.0001520454620158093, "epoch": 0.36044362292051757, "percentage": 36.04, "elapsed_time": "1:51:04", "remaining_time": "3:17:05"}
|
||||
{"current_steps": 590, "total_steps": 1623, "loss": 0.9141, "lr": 0.00015117237334717117, "epoch": 0.36352433764633396, "percentage": 36.35, "elapsed_time": "1:52:02", "remaining_time": "3:16:10"}
|
||||
{"current_steps": 595, "total_steps": 1623, "loss": 1.0516, "lr": 0.00015029396768323846, "epoch": 0.36660505237215035, "percentage": 36.66, "elapsed_time": "1:53:03", "remaining_time": "3:15:20"}
|
||||
{"current_steps": 600, "total_steps": 1623, "loss": 0.8681, "lr": 0.00014941033629354734, "epoch": 0.36968576709796674, "percentage": 36.97, "elapsed_time": "1:53:46", "remaining_time": "3:13:59"}
|
||||
{"current_steps": 605, "total_steps": 1623, "loss": 0.9942, "lr": 0.00014852157099060596, "epoch": 0.37276648182378314, "percentage": 37.28, "elapsed_time": "1:54:52", "remaining_time": "3:13:17"}
|
||||
{"current_steps": 610, "total_steps": 1623, "loss": 1.0202, "lr": 0.00014762776412035456, "epoch": 0.3758471965495995, "percentage": 37.58, "elapsed_time": "1:55:56", "remaining_time": "3:12:31"}
|
||||
{"current_steps": 615, "total_steps": 1623, "loss": 0.9941, "lr": 0.00014672900855257056, "epoch": 0.3789279112754159, "percentage": 37.89, "elapsed_time": "1:56:54", "remaining_time": "3:11:37"}
|
||||
{"current_steps": 620, "total_steps": 1623, "loss": 0.9866, "lr": 0.00014582539767121904, "epoch": 0.3820086260012323, "percentage": 38.2, "elapsed_time": "1:57:53", "remaining_time": "3:10:43"}
|
||||
{"current_steps": 625, "total_steps": 1623, "loss": 0.741, "lr": 0.0001449170253647498, "epoch": 0.3850893407270487, "percentage": 38.51, "elapsed_time": "1:58:35", "remaining_time": "3:09:22"}
|
||||
{"current_steps": 630, "total_steps": 1623, "loss": 0.9465, "lr": 0.0001440039860163419, "epoch": 0.38817005545286504, "percentage": 38.82, "elapsed_time": "1:59:38", "remaining_time": "3:08:34"}
|
||||
{"current_steps": 635, "total_steps": 1623, "loss": 0.9403, "lr": 0.00014308637449409706, "epoch": 0.39125077017868143, "percentage": 39.13, "elapsed_time": "2:00:39", "remaining_time": "3:07:44"}
|
||||
{"current_steps": 640, "total_steps": 1623, "loss": 1.0146, "lr": 0.00014216428614118243, "epoch": 0.3943314849044978, "percentage": 39.43, "elapsed_time": "2:01:45", "remaining_time": "3:07:00"}
|
||||
{"current_steps": 645, "total_steps": 1623, "loss": 0.9778, "lr": 0.00014123781676592418, "epoch": 0.3974121996303142, "percentage": 39.74, "elapsed_time": "2:02:50", "remaining_time": "3:06:15"}
|
||||
{"current_steps": 650, "total_steps": 1623, "loss": 0.8311, "lr": 0.00014030706263185247, "epoch": 0.4004929143561306, "percentage": 40.05, "elapsed_time": "2:03:34", "remaining_time": "3:04:58"}
|
||||
{"current_steps": 655, "total_steps": 1623, "loss": 0.9141, "lr": 0.00013937212044769955, "epoch": 0.403573629081947, "percentage": 40.36, "elapsed_time": "2:04:41", "remaining_time": "3:04:16"}
|
||||
{"current_steps": 660, "total_steps": 1623, "loss": 0.9867, "lr": 0.0001384330873573513, "epoch": 0.4066543438077634, "percentage": 40.67, "elapsed_time": "2:05:46", "remaining_time": "3:03:31"}
|
||||
{"current_steps": 665, "total_steps": 1623, "loss": 1.0004, "lr": 0.00013749006092975347, "epoch": 0.4097350585335798, "percentage": 40.97, "elapsed_time": "2:06:44", "remaining_time": "3:02:34"}
|
||||
{"current_steps": 670, "total_steps": 1623, "loss": 0.9771, "lr": 0.00013654313914877414, "epoch": 0.41281577325939617, "percentage": 41.28, "elapsed_time": "2:07:42", "remaining_time": "3:01:39"}
|
||||
{"current_steps": 675, "total_steps": 1623, "loss": 0.7806, "lr": 0.00013559242040302272, "epoch": 0.41589648798521256, "percentage": 41.59, "elapsed_time": "2:08:27", "remaining_time": "3:00:24"}
|
||||
{"current_steps": 680, "total_steps": 1623, "loss": 0.9531, "lr": 0.00013463800347562706, "epoch": 0.41897720271102895, "percentage": 41.9, "elapsed_time": "2:09:33", "remaining_time": "2:59:40"}
|
||||
{"current_steps": 685, "total_steps": 1623, "loss": 0.8862, "lr": 0.00013367998753396944, "epoch": 0.42205791743684534, "percentage": 42.21, "elapsed_time": "2:10:37", "remaining_time": "2:58:52"}
|
||||
{"current_steps": 690, "total_steps": 1623, "loss": 0.978, "lr": 0.00013271847211938285, "epoch": 0.42513863216266173, "percentage": 42.51, "elapsed_time": "2:11:45", "remaining_time": "2:58:08"}
|
||||
{"current_steps": 695, "total_steps": 1623, "loss": 1.0035, "lr": 0.0001317535571368082, "epoch": 0.4282193468884781, "percentage": 42.82, "elapsed_time": "2:12:44", "remaining_time": "2:57:14"}
|
||||
{"current_steps": 700, "total_steps": 1623, "loss": 0.8737, "lr": 0.00013078534284441382, "epoch": 0.4313000616142945, "percentage": 43.13, "elapsed_time": "2:13:29", "remaining_time": "2:56:01"}
|
||||
{"current_steps": 705, "total_steps": 1623, "loss": 0.9117, "lr": 0.00012981392984317834, "epoch": 0.4343807763401109, "percentage": 43.44, "elapsed_time": "2:14:28", "remaining_time": "2:55:06"}
|
||||
{"current_steps": 710, "total_steps": 1623, "loss": 0.9657, "lr": 0.00012883941906643786, "epoch": 0.4374614910659273, "percentage": 43.75, "elapsed_time": "2:15:31", "remaining_time": "2:54:16"}
|
||||
{"current_steps": 715, "total_steps": 1623, "loss": 0.9081, "lr": 0.00012786191176939848, "epoch": 0.4405422057917437, "percentage": 44.05, "elapsed_time": "2:16:28", "remaining_time": "2:53:19"}
|
||||
{"current_steps": 720, "total_steps": 1623, "loss": 0.9299, "lr": 0.00012688150951861582, "epoch": 0.4436229205175601, "percentage": 44.36, "elapsed_time": "2:17:30", "remaining_time": "2:52:28"}
|
||||
{"current_steps": 725, "total_steps": 1623, "loss": 0.8259, "lr": 0.00012589831418144154, "epoch": 0.4467036352433765, "percentage": 44.67, "elapsed_time": "2:18:13", "remaining_time": "2:51:13"}
|
||||
{"current_steps": 730, "total_steps": 1623, "loss": 0.9407, "lr": 0.00012491242791543922, "epoch": 0.44978434996919286, "percentage": 44.98, "elapsed_time": "2:19:18", "remaining_time": "2:50:24"}
|
||||
{"current_steps": 735, "total_steps": 1623, "loss": 0.9092, "lr": 0.00012392395315776963, "epoch": 0.45286506469500926, "percentage": 45.29, "elapsed_time": "2:20:18", "remaining_time": "2:49:30"}
|
||||
{"current_steps": 740, "total_steps": 1623, "loss": 0.9285, "lr": 0.00012293299261454725, "epoch": 0.45594577942082565, "percentage": 45.59, "elapsed_time": "2:21:18", "remaining_time": "2:48:36"}
|
||||
{"current_steps": 745, "total_steps": 1623, "loss": 0.9379, "lr": 0.00012193964925016872, "epoch": 0.45902649414664204, "percentage": 45.9, "elapsed_time": "2:22:13", "remaining_time": "2:47:37"}
|
||||
{"current_steps": 750, "total_steps": 1623, "loss": 0.7754, "lr": 0.00012094402627661447, "epoch": 0.46210720887245843, "percentage": 46.21, "elapsed_time": "2:22:56", "remaining_time": "2:46:23"}
|
||||
{"current_steps": 755, "total_steps": 1623, "loss": 0.9358, "lr": 0.00011994622714272448, "epoch": 0.4651879235982748, "percentage": 46.52, "elapsed_time": "2:24:02", "remaining_time": "2:45:36"}
|
||||
{"current_steps": 760, "total_steps": 1623, "loss": 0.9574, "lr": 0.00011894635552344975, "epoch": 0.4682686383240912, "percentage": 46.83, "elapsed_time": "2:25:04", "remaining_time": "2:44:44"}
|
||||
{"current_steps": 765, "total_steps": 1623, "loss": 0.9345, "lr": 0.00011794451530908011, "epoch": 0.4713493530499076, "percentage": 47.13, "elapsed_time": "2:26:07", "remaining_time": "2:43:53"}
|
||||
{"current_steps": 770, "total_steps": 1623, "loss": 0.9837, "lr": 0.00011694081059444946, "epoch": 0.47443006777572394, "percentage": 47.44, "elapsed_time": "2:27:11", "remaining_time": "2:43:03"}
|
||||
{"current_steps": 775, "total_steps": 1623, "loss": 0.816, "lr": 0.0001159353456681201, "epoch": 0.47751078250154033, "percentage": 47.75, "elapsed_time": "2:27:54", "remaining_time": "2:41:50"}
|
||||
{"current_steps": 780, "total_steps": 1623, "loss": 0.9001, "lr": 0.00011492822500154667, "epoch": 0.4805914972273567, "percentage": 48.06, "elapsed_time": "2:28:55", "remaining_time": "2:40:57"}
|
||||
{"current_steps": 785, "total_steps": 1623, "loss": 0.8926, "lr": 0.00011391955323822126, "epoch": 0.4836722119531731, "percentage": 48.37, "elapsed_time": "2:29:54", "remaining_time": "2:40:01"}
|
||||
{"current_steps": 790, "total_steps": 1623, "loss": 1.0207, "lr": 0.00011290943518280057, "epoch": 0.4867529266789895, "percentage": 48.68, "elapsed_time": "2:30:54", "remaining_time": "2:39:07"}
|
||||
{"current_steps": 795, "total_steps": 1623, "loss": 0.9285, "lr": 0.0001118979757902162, "epoch": 0.4898336414048059, "percentage": 48.98, "elapsed_time": "2:31:58", "remaining_time": "2:38:16"}
|
||||
{"current_steps": 800, "total_steps": 1623, "loss": 0.8541, "lr": 0.00011088528015476964, "epoch": 0.4929143561306223, "percentage": 49.29, "elapsed_time": "2:32:41", "remaining_time": "2:37:04"}
|
||||
{"current_steps": 805, "total_steps": 1623, "loss": 0.9033, "lr": 0.00010987145349921251, "epoch": 0.4959950708564387, "percentage": 49.6, "elapsed_time": "2:33:41", "remaining_time": "2:36:10"}
|
||||
{"current_steps": 810, "total_steps": 1623, "loss": 0.9413, "lr": 0.0001088566011638134, "epoch": 0.49907578558225507, "percentage": 49.91, "elapsed_time": "2:34:43", "remaining_time": "2:35:17"}
|
||||
{"current_steps": 815, "total_steps": 1623, "loss": 0.9315, "lr": 0.00010784082859541292, "epoch": 0.5021565003080715, "percentage": 50.22, "elapsed_time": "2:35:43", "remaining_time": "2:34:23"}
|
||||
{"current_steps": 820, "total_steps": 1623, "loss": 0.9527, "lr": 0.0001068242413364671, "epoch": 0.5052372150338879, "percentage": 50.52, "elapsed_time": "2:36:48", "remaining_time": "2:33:33"}
|
||||
{"current_steps": 825, "total_steps": 1623, "loss": 0.8284, "lr": 0.00010580694501408138, "epoch": 0.5083179297597042, "percentage": 50.83, "elapsed_time": "2:37:32", "remaining_time": "2:32:23"}
|
||||
{"current_steps": 830, "total_steps": 1623, "loss": 0.8648, "lr": 0.00010478904532903535, "epoch": 0.5113986444855206, "percentage": 51.14, "elapsed_time": "2:38:31", "remaining_time": "2:31:27"}
|
||||
{"current_steps": 835, "total_steps": 1623, "loss": 1.0178, "lr": 0.00010377064804480025, "epoch": 0.514479359211337, "percentage": 51.45, "elapsed_time": "2:39:36", "remaining_time": "2:30:37"}
|
||||
{"current_steps": 840, "total_steps": 1623, "loss": 0.8944, "lr": 0.00010275185897654971, "epoch": 0.5175600739371534, "percentage": 51.76, "elapsed_time": "2:40:36", "remaining_time": "2:29:42"}
|
||||
{"current_steps": 845, "total_steps": 1623, "loss": 0.922, "lr": 0.00010173278398016501, "epoch": 0.5206407886629698, "percentage": 52.06, "elapsed_time": "2:41:40", "remaining_time": "2:28:51"}
|
||||
{"current_steps": 850, "total_steps": 1623, "loss": 0.7921, "lr": 0.00010071352894123654, "epoch": 0.5237215033887862, "percentage": 52.37, "elapsed_time": "2:42:24", "remaining_time": "2:27:41"}
|
||||
{"current_steps": 855, "total_steps": 1623, "loss": 0.9301, "lr": 9.969419976406165e-05, "epoch": 0.5268022181146026, "percentage": 52.68, "elapsed_time": "2:43:29", "remaining_time": "2:26:51"}
|
||||
{"current_steps": 860, "total_steps": 1623, "loss": 0.9367, "lr": 9.867490236064108e-05, "epoch": 0.529882932840419, "percentage": 52.99, "elapsed_time": "2:44:34", "remaining_time": "2:26:00"}
|
||||
{"current_steps": 865, "total_steps": 1623, "loss": 1.0116, "lr": 9.765574263967396e-05, "epoch": 0.5329636475662354, "percentage": 53.3, "elapsed_time": "2:45:47", "remaining_time": "2:25:16"}
|
||||
{"current_steps": 870, "total_steps": 1623, "loss": 0.915, "lr": 9.66368264955539e-05, "epoch": 0.5360443622920518, "percentage": 53.6, "elapsed_time": "2:46:49", "remaining_time": "2:24:23"}
|
||||
{"current_steps": 875, "total_steps": 1623, "loss": 0.8123, "lr": 9.56182597973658e-05, "epoch": 0.5391250770178682, "percentage": 53.91, "elapsed_time": "2:47:32", "remaining_time": "2:23:13"}
|
||||
{"current_steps": 880, "total_steps": 1623, "loss": 0.9215, "lr": 9.460014837788605e-05, "epoch": 0.5422057917436846, "percentage": 54.22, "elapsed_time": "2:48:39", "remaining_time": "2:22:24"}
|
||||
{"current_steps": 885, "total_steps": 1623, "loss": 0.9195, "lr": 9.358259802258581e-05, "epoch": 0.5452865064695009, "percentage": 54.53, "elapsed_time": "2:49:48", "remaining_time": "2:21:35"}
|
||||
{"current_steps": 890, "total_steps": 1623, "loss": 0.9105, "lr": 9.256571445863972e-05, "epoch": 0.5483672211953173, "percentage": 54.84, "elapsed_time": "2:50:47", "remaining_time": "2:20:39"}
|
||||
{"current_steps": 895, "total_steps": 1623, "loss": 0.965, "lr": 9.154960334394027e-05, "epoch": 0.5514479359211337, "percentage": 55.14, "elapsed_time": "2:51:46", "remaining_time": "2:19:42"}
|
||||
{"current_steps": 900, "total_steps": 1623, "loss": 0.7986, "lr": 9.053437025611973e-05, "epoch": 0.5545286506469501, "percentage": 55.45, "elapsed_time": "2:52:31", "remaining_time": "2:18:36"}
|
||||
{"current_steps": 905, "total_steps": 1623, "loss": 0.9545, "lr": 8.952012068158027e-05, "epoch": 0.5576093653727665, "percentage": 55.76, "elapsed_time": "2:53:32", "remaining_time": "2:17:40"}
|
||||
{"current_steps": 910, "total_steps": 1623, "loss": 0.9846, "lr": 8.850696000453326e-05, "epoch": 0.5606900800985829, "percentage": 56.07, "elapsed_time": "2:54:36", "remaining_time": "2:16:48"}
|
||||
{"current_steps": 915, "total_steps": 1623, "loss": 0.9375, "lr": 8.749499349604993e-05, "epoch": 0.5637707948243993, "percentage": 56.38, "elapsed_time": "2:55:42", "remaining_time": "2:15:57"}
|
||||
{"current_steps": 920, "total_steps": 1623, "loss": 0.8851, "lr": 8.64843263031228e-05, "epoch": 0.5668515095502157, "percentage": 56.69, "elapsed_time": "2:56:42", "remaining_time": "2:15:01"}
|
||||
{"current_steps": 925, "total_steps": 1623, "loss": 0.7475, "lr": 8.547506343774097e-05, "epoch": 0.5699322242760321, "percentage": 56.99, "elapsed_time": "2:57:23", "remaining_time": "2:13:51"}
|
||||
{"current_steps": 930, "total_steps": 1623, "loss": 1.0023, "lr": 8.446730976597878e-05, "epoch": 0.5730129390018485, "percentage": 57.3, "elapsed_time": "2:58:28", "remaining_time": "2:12:59"}
|
||||
{"current_steps": 935, "total_steps": 1623, "loss": 0.9047, "lr": 8.346116999709975e-05, "epoch": 0.5760936537276649, "percentage": 57.61, "elapsed_time": "2:59:31", "remaining_time": "2:12:06"}
|
||||
{"current_steps": 940, "total_steps": 1623, "loss": 0.9262, "lr": 8.245674867267724e-05, "epoch": 0.5791743684534812, "percentage": 57.92, "elapsed_time": "3:00:36", "remaining_time": "2:11:14"}
|
||||
{"current_steps": 945, "total_steps": 1623, "loss": 0.9537, "lr": 8.145415015573183e-05, "epoch": 0.5822550831792976, "percentage": 58.23, "elapsed_time": "3:01:35", "remaining_time": "2:10:17"}
|
||||
{"current_steps": 950, "total_steps": 1623, "loss": 0.7926, "lr": 8.045347861988789e-05, "epoch": 0.585335797905114, "percentage": 58.53, "elapsed_time": "3:02:20", "remaining_time": "2:09:10"}
|
||||
{"current_steps": 955, "total_steps": 1623, "loss": 0.9144, "lr": 7.945483803854936e-05, "epoch": 0.5884165126309304, "percentage": 58.84, "elapsed_time": "3:03:17", "remaining_time": "2:08:12"}
|
||||
{"current_steps": 960, "total_steps": 1623, "loss": 1.0055, "lr": 7.845833217409675e-05, "epoch": 0.5914972273567468, "percentage": 59.15, "elapsed_time": "3:04:25", "remaining_time": "2:07:22"}
|
||||
{"current_steps": 965, "total_steps": 1623, "loss": 0.9012, "lr": 7.746406456710564e-05, "epoch": 0.5945779420825632, "percentage": 59.46, "elapsed_time": "3:05:24", "remaining_time": "2:06:25"}
|
||||
{"current_steps": 970, "total_steps": 1623, "loss": 0.9128, "lr": 7.64721385255886e-05, "epoch": 0.5976586568083796, "percentage": 59.77, "elapsed_time": "3:06:23", "remaining_time": "2:05:28"}
|
||||
{"current_steps": 975, "total_steps": 1623, "loss": 0.7712, "lr": 7.548265711426104e-05, "epoch": 0.600739371534196, "percentage": 60.07, "elapsed_time": "3:07:07", "remaining_time": "2:04:21"}
|
||||
{"current_steps": 980, "total_steps": 1623, "loss": 0.9942, "lr": 7.449572314383237e-05, "epoch": 0.6038200862600123, "percentage": 60.38, "elapsed_time": "3:08:10", "remaining_time": "2:03:28"}
|
||||
{"current_steps": 985, "total_steps": 1623, "loss": 0.9889, "lr": 7.351143916032374e-05, "epoch": 0.6069008009858287, "percentage": 60.69, "elapsed_time": "3:09:13", "remaining_time": "2:02:34"}
|
||||
{"current_steps": 990, "total_steps": 1623, "loss": 0.9398, "lr": 7.252990743441293e-05, "epoch": 0.609981515711645, "percentage": 61.0, "elapsed_time": "3:10:14", "remaining_time": "2:01:38"}
|
||||
{"current_steps": 995, "total_steps": 1623, "loss": 1.0196, "lr": 7.155122995080827e-05, "epoch": 0.6130622304374614, "percentage": 61.31, "elapsed_time": "3:11:15", "remaining_time": "2:00:42"}
|
||||
{"current_steps": 1000, "total_steps": 1623, "loss": 0.803, "lr": 7.057550839765188e-05, "epoch": 0.6161429451632778, "percentage": 61.61, "elapsed_time": "3:11:58", "remaining_time": "1:59:36"}
|
||||
{"current_steps": 1005, "total_steps": 1623, "loss": 0.9066, "lr": 6.960284415595407e-05, "epoch": 0.6192236598890942, "percentage": 61.92, "elapsed_time": "3:13:06", "remaining_time": "1:58:44"}
|
||||
{"current_steps": 1010, "total_steps": 1623, "loss": 1.0486, "lr": 6.863333828905929e-05, "epoch": 0.6223043746149106, "percentage": 62.23, "elapsed_time": "3:14:12", "remaining_time": "1:57:52"}
|
||||
{"current_steps": 1015, "total_steps": 1623, "loss": 0.9454, "lr": 6.766709153214542e-05, "epoch": 0.625385089340727, "percentage": 62.54, "elapsed_time": "3:15:16", "remaining_time": "1:56:58"}
|
||||
{"current_steps": 1020, "total_steps": 1623, "loss": 0.9561, "lr": 6.670420428175705e-05, "epoch": 0.6284658040665434, "percentage": 62.85, "elapsed_time": "3:16:15", "remaining_time": "1:56:01"}
|
||||
{"current_steps": 1025, "total_steps": 1623, "loss": 0.7882, "lr": 6.574477658537375e-05, "epoch": 0.6315465187923598, "percentage": 63.15, "elapsed_time": "3:17:00", "remaining_time": "1:54:56"}
|
||||
{"current_steps": 1030, "total_steps": 1623, "loss": 0.8443, "lr": 6.4788908131015e-05, "epoch": 0.6346272335181762, "percentage": 63.46, "elapsed_time": "3:17:58", "remaining_time": "1:53:59"}
|
||||
{"current_steps": 1035, "total_steps": 1623, "loss": 0.8491, "lr": 6.38366982368819e-05, "epoch": 0.6377079482439926, "percentage": 63.77, "elapsed_time": "3:19:01", "remaining_time": "1:53:04"}
|
||||
{"current_steps": 1040, "total_steps": 1623, "loss": 0.9222, "lr": 6.288824584103816e-05, "epoch": 0.640788662969809, "percentage": 64.08, "elapsed_time": "3:20:01", "remaining_time": "1:52:07"}
|
||||
{"current_steps": 1045, "total_steps": 1623, "loss": 0.9085, "lr": 6.194364949112953e-05, "epoch": 0.6438693776956254, "percentage": 64.39, "elapsed_time": "3:21:04", "remaining_time": "1:51:12"}
|
||||
{"current_steps": 1050, "total_steps": 1623, "loss": 0.8007, "lr": 6.100300733414474e-05, "epoch": 0.6469500924214417, "percentage": 64.7, "elapsed_time": "3:21:47", "remaining_time": "1:50:07"}
|
||||
{"current_steps": 1055, "total_steps": 1623, "loss": 0.8945, "lr": 6.0066417106217455e-05, "epoch": 0.6500308071472581, "percentage": 65.0, "elapsed_time": "3:22:48", "remaining_time": "1:49:11"}
|
||||
{"current_steps": 1060, "total_steps": 1623, "loss": 0.9188, "lr": 5.9133976122471214e-05, "epoch": 0.6531115218730745, "percentage": 65.31, "elapsed_time": "3:23:51", "remaining_time": "1:48:16"}
|
||||
{"current_steps": 1065, "total_steps": 1623, "loss": 0.9509, "lr": 5.82057812669081e-05, "epoch": 0.6561922365988909, "percentage": 65.62, "elapsed_time": "3:24:49", "remaining_time": "1:47:19"}
|
||||
{"current_steps": 1070, "total_steps": 1623, "loss": 0.851, "lr": 5.728192898234195e-05, "epoch": 0.6592729513247073, "percentage": 65.93, "elapsed_time": "3:25:50", "remaining_time": "1:46:23"}
|
||||
{"current_steps": 1075, "total_steps": 1623, "loss": 0.7561, "lr": 5.6362515260377835e-05, "epoch": 0.6623536660505237, "percentage": 66.24, "elapsed_time": "3:26:33", "remaining_time": "1:45:17"}
|
||||
{"current_steps": 1080, "total_steps": 1623, "loss": 0.9267, "lr": 5.544763563143793e-05, "epoch": 0.6654343807763401, "percentage": 66.54, "elapsed_time": "3:27:36", "remaining_time": "1:44:22"}
|
||||
{"current_steps": 1085, "total_steps": 1623, "loss": 0.9299, "lr": 5.4537385154835864e-05, "epoch": 0.6685150955021565, "percentage": 66.85, "elapsed_time": "3:28:41", "remaining_time": "1:43:28"}
|
||||
{"current_steps": 1090, "total_steps": 1623, "loss": 0.8646, "lr": 5.363185840889935e-05, "epoch": 0.6715958102279729, "percentage": 67.16, "elapsed_time": "3:29:40", "remaining_time": "1:42:31"}
|
||||
{"current_steps": 1095, "total_steps": 1623, "loss": 0.9427, "lr": 5.273114948114346e-05, "epoch": 0.6746765249537893, "percentage": 67.47, "elapsed_time": "3:30:39", "remaining_time": "1:41:34"}
|
||||
{"current_steps": 1100, "total_steps": 1623, "loss": 0.7519, "lr": 5.1835351958494515e-05, "epoch": 0.6777572396796057, "percentage": 67.78, "elapsed_time": "3:31:24", "remaining_time": "1:40:30"}
|
||||
{"current_steps": 1105, "total_steps": 1623, "loss": 0.9132, "lr": 5.094455891756587e-05, "epoch": 0.680837954405422, "percentage": 68.08, "elapsed_time": "3:32:25", "remaining_time": "1:39:34"}
|
||||
{"current_steps": 1110, "total_steps": 1623, "loss": 0.9795, "lr": 5.00588629149872e-05, "epoch": 0.6839186691312384, "percentage": 68.39, "elapsed_time": "3:33:26", "remaining_time": "1:38:38"}
|
||||
{"current_steps": 1115, "total_steps": 1623, "loss": 0.905, "lr": 4.91783559777873e-05, "epoch": 0.6869993838570548, "percentage": 68.7, "elapsed_time": "3:34:25", "remaining_time": "1:37:41"}
|
||||
{"current_steps": 1120, "total_steps": 1623, "loss": 0.909, "lr": 4.830312959383238e-05, "epoch": 0.6900800985828712, "percentage": 69.01, "elapsed_time": "3:35:27", "remaining_time": "1:36:45"}
|
||||
{"current_steps": 1125, "total_steps": 1623, "loss": 0.7293, "lr": 4.7433274702319815e-05, "epoch": 0.6931608133086876, "percentage": 69.32, "elapsed_time": "3:36:11", "remaining_time": "1:35:41"}
|
||||
{"current_steps": 1130, "total_steps": 1623, "loss": 0.8847, "lr": 4.656888168432962e-05, "epoch": 0.696241528034504, "percentage": 69.62, "elapsed_time": "3:37:12", "remaining_time": "1:34:45"}
|
||||
{"current_steps": 1135, "total_steps": 1623, "loss": 0.9697, "lr": 4.571004035343315e-05, "epoch": 0.6993222427603204, "percentage": 69.93, "elapsed_time": "3:38:16", "remaining_time": "1:33:50"}
|
||||
{"current_steps": 1140, "total_steps": 1623, "loss": 0.8963, "lr": 4.485683994636144e-05, "epoch": 0.7024029574861368, "percentage": 70.24, "elapsed_time": "3:39:19", "remaining_time": "1:32:55"}
|
||||
{"current_steps": 1145, "total_steps": 1623, "loss": 0.9756, "lr": 4.400936911373308e-05, "epoch": 0.7054836722119532, "percentage": 70.55, "elapsed_time": "3:40:21", "remaining_time": "1:31:59"}
|
||||
{"current_steps": 1150, "total_steps": 1623, "loss": 0.7932, "lr": 4.3167715910842966e-05, "epoch": 0.7085643869377696, "percentage": 70.86, "elapsed_time": "3:41:03", "remaining_time": "1:30:55"}
|
||||
{"current_steps": 1155, "total_steps": 1623, "loss": 0.9168, "lr": 4.2331967788513295e-05, "epoch": 0.711645101663586, "percentage": 71.16, "elapsed_time": "3:42:03", "remaining_time": "1:29:58"}
|
||||
{"current_steps": 1160, "total_steps": 1623, "loss": 0.9272, "lr": 4.1502211584006836e-05, "epoch": 0.7147258163894024, "percentage": 71.47, "elapsed_time": "3:43:03", "remaining_time": "1:29:02"}
|
||||
{"current_steps": 1165, "total_steps": 1623, "loss": 0.9724, "lr": 4.067853351200446e-05, "epoch": 0.7178065311152187, "percentage": 71.78, "elapsed_time": "3:44:08", "remaining_time": "1:28:07"}
|
||||
{"current_steps": 1170, "total_steps": 1623, "loss": 0.9153, "lr": 3.986101915564695e-05, "epoch": 0.7208872458410351, "percentage": 72.09, "elapsed_time": "3:45:08", "remaining_time": "1:27:10"}
|
||||
{"current_steps": 1175, "total_steps": 1623, "loss": 0.7897, "lr": 3.904975345764262e-05, "epoch": 0.7239679605668515, "percentage": 72.4, "elapsed_time": "3:45:49", "remaining_time": "1:26:05"}
|
||||
{"current_steps": 1180, "total_steps": 1623, "loss": 0.931, "lr": 3.824482071144163e-05, "epoch": 0.7270486752926679, "percentage": 72.7, "elapsed_time": "3:46:59", "remaining_time": "1:25:12"}
|
||||
{"current_steps": 1185, "total_steps": 1623, "loss": 0.905, "lr": 3.744630455247739e-05, "epoch": 0.7301293900184843, "percentage": 73.01, "elapsed_time": "3:48:02", "remaining_time": "1:24:17"}
|
||||
{"current_steps": 1190, "total_steps": 1623, "loss": 0.927, "lr": 3.6654287949476626e-05, "epoch": 0.7332101047443007, "percentage": 73.32, "elapsed_time": "3:49:05", "remaining_time": "1:23:21"}
|
||||
{"current_steps": 1195, "total_steps": 1623, "loss": 0.9488, "lr": 3.586885319583858e-05, "epoch": 0.7362908194701171, "percentage": 73.63, "elapsed_time": "3:50:08", "remaining_time": "1:22:25"}
|
||||
{"current_steps": 1200, "total_steps": 1623, "loss": 0.8075, "lr": 3.5090081901084525e-05, "epoch": 0.7393715341959335, "percentage": 73.94, "elapsed_time": "3:50:49", "remaining_time": "1:21:21"}
|
||||
{"current_steps": 1205, "total_steps": 1623, "loss": 0.9658, "lr": 3.431805498237808e-05, "epoch": 0.7424522489217499, "percentage": 74.25, "elapsed_time": "3:51:55", "remaining_time": "1:20:27"}
|
||||
{"current_steps": 1210, "total_steps": 1623, "loss": 0.953, "lr": 3.355285265611784e-05, "epoch": 0.7455329636475663, "percentage": 74.55, "elapsed_time": "3:53:02", "remaining_time": "1:19:32"}
|
||||
{"current_steps": 1215, "total_steps": 1623, "loss": 0.9542, "lr": 3.279455442960238e-05, "epoch": 0.7486136783733827, "percentage": 74.86, "elapsed_time": "3:54:04", "remaining_time": "1:18:36"}
|
||||
{"current_steps": 1220, "total_steps": 1623, "loss": 0.9838, "lr": 3.204323909276924e-05, "epoch": 0.751694393099199, "percentage": 75.17, "elapsed_time": "3:55:09", "remaining_time": "1:17:40"}
|
||||
{"current_steps": 1225, "total_steps": 1623, "loss": 0.7694, "lr": 3.1298984710008484e-05, "epoch": 0.7547751078250154, "percentage": 75.48, "elapsed_time": "3:55:53", "remaining_time": "1:16:38"}
|
||||
{"current_steps": 1230, "total_steps": 1623, "loss": 0.8751, "lr": 3.056186861205136e-05, "epoch": 0.7578558225508318, "percentage": 75.79, "elapsed_time": "3:56:50", "remaining_time": "1:15:40"}
|
||||
{"current_steps": 1235, "total_steps": 1623, "loss": 0.9526, "lr": 2.9831967387935467e-05, "epoch": 0.7609365372766482, "percentage": 76.09, "elapsed_time": "3:57:58", "remaining_time": "1:14:45"}
|
||||
{"current_steps": 1240, "total_steps": 1623, "loss": 0.8726, "lr": 2.9109356877046712e-05, "epoch": 0.7640172520024646, "percentage": 76.4, "elapsed_time": "3:59:00", "remaining_time": "1:13:49"}
|
||||
{"current_steps": 1245, "total_steps": 1623, "loss": 0.943, "lr": 2.8394112161239605e-05, "epoch": 0.767097966728281, "percentage": 76.71, "elapsed_time": "4:00:01", "remaining_time": "1:12:52"}
|
||||
{"current_steps": 1250, "total_steps": 1623, "loss": 0.7294, "lr": 2.7686307557035685e-05, "epoch": 0.7701786814540974, "percentage": 77.02, "elapsed_time": "4:00:44", "remaining_time": "1:11:50"}
|
||||
{"current_steps": 1255, "total_steps": 1623, "loss": 0.8862, "lr": 2.6986016607901908e-05, "epoch": 0.7732593961799138, "percentage": 77.33, "elapsed_time": "4:01:47", "remaining_time": "1:10:54"}
|
||||
{"current_steps": 1260, "total_steps": 1623, "loss": 0.9054, "lr": 2.629331207660931e-05, "epoch": 0.7763401109057301, "percentage": 77.63, "elapsed_time": "4:02:48", "remaining_time": "1:09:57"}
|
||||
{"current_steps": 1265, "total_steps": 1623, "loss": 0.8883, "lr": 2.5608265937672436e-05, "epoch": 0.7794208256315465, "percentage": 77.94, "elapsed_time": "4:03:47", "remaining_time": "1:08:59"}
|
||||
{"current_steps": 1270, "total_steps": 1623, "loss": 0.9571, "lr": 2.4930949369871203e-05, "epoch": 0.7825015403573629, "percentage": 78.25, "elapsed_time": "4:04:46", "remaining_time": "1:08:02"}
|
||||
{"current_steps": 1275, "total_steps": 1623, "loss": 0.7375, "lr": 2.426143274885493e-05, "epoch": 0.7855822550831792, "percentage": 78.56, "elapsed_time": "4:05:32", "remaining_time": "1:07:01"}
|
||||
{"current_steps": 1280, "total_steps": 1623, "loss": 0.8827, "lr": 2.359978563983022e-05, "epoch": 0.7886629698089956, "percentage": 78.87, "elapsed_time": "4:06:37", "remaining_time": "1:06:05"}
|
||||
{"current_steps": 1285, "total_steps": 1623, "loss": 0.8892, "lr": 2.2946076790332827e-05, "epoch": 0.791743684534812, "percentage": 79.17, "elapsed_time": "4:07:42", "remaining_time": "1:05:09"}
|
||||
{"current_steps": 1290, "total_steps": 1623, "loss": 0.8561, "lr": 2.2300374123084522e-05, "epoch": 0.7948243992606284, "percentage": 79.48, "elapsed_time": "4:08:39", "remaining_time": "1:04:11"}
|
||||
{"current_steps": 1295, "total_steps": 1623, "loss": 0.9178, "lr": 2.166274472893567e-05, "epoch": 0.7979051139864448, "percentage": 79.79, "elapsed_time": "4:09:40", "remaining_time": "1:03:14"}
|
||||
{"current_steps": 1300, "total_steps": 1623, "loss": 0.7465, "lr": 2.1033254859894226e-05, "epoch": 0.8009858287122612, "percentage": 80.1, "elapsed_time": "4:10:20", "remaining_time": "1:02:11"}
|
||||
{"current_steps": 1305, "total_steps": 1623, "loss": 0.8865, "lr": 2.041196992224206e-05, "epoch": 0.8040665434380776, "percentage": 80.41, "elapsed_time": "4:11:16", "remaining_time": "1:01:13"}
|
||||
{"current_steps": 1310, "total_steps": 1623, "loss": 0.8778, "lr": 1.9798954469738762e-05, "epoch": 0.807147258163894, "percentage": 80.71, "elapsed_time": "4:12:19", "remaining_time": "1:00:17"}
|
||||
{"current_steps": 1315, "total_steps": 1623, "loss": 0.9287, "lr": 1.919427219691453e-05, "epoch": 0.8102279728897104, "percentage": 81.02, "elapsed_time": "4:13:17", "remaining_time": "0:59:19"}
|
||||
{"current_steps": 1320, "total_steps": 1623, "loss": 0.8981, "lr": 1.8597985932451856e-05, "epoch": 0.8133086876155268, "percentage": 81.33, "elapsed_time": "4:14:14", "remaining_time": "0:58:21"}
|
||||
{"current_steps": 1325, "total_steps": 1623, "loss": 0.7387, "lr": 1.8010157632657543e-05, "epoch": 0.8163894023413432, "percentage": 81.64, "elapsed_time": "4:14:54", "remaining_time": "0:57:19"}
|
||||
{"current_steps": 1330, "total_steps": 1623, "loss": 0.9106, "lr": 1.7430848375025176e-05, "epoch": 0.8194701170671596, "percentage": 81.95, "elapsed_time": "4:15:52", "remaining_time": "0:56:22"}
|
||||
{"current_steps": 1335, "total_steps": 1623, "loss": 0.9232, "lr": 1.686011835188891e-05, "epoch": 0.822550831792976, "percentage": 82.26, "elapsed_time": "4:16:50", "remaining_time": "0:55:24"}
|
||||
{"current_steps": 1340, "total_steps": 1623, "loss": 0.9458, "lr": 1.6298026864169335e-05, "epoch": 0.8256315465187923, "percentage": 82.56, "elapsed_time": "4:17:47", "remaining_time": "0:54:26"}
|
||||
{"current_steps": 1345, "total_steps": 1623, "loss": 0.9359, "lr": 1.5744632315211815e-05, "epoch": 0.8287122612446087, "percentage": 82.87, "elapsed_time": "4:18:42", "remaining_time": "0:53:28"}
|
||||
{"current_steps": 1350, "total_steps": 1623, "loss": 0.7866, "lr": 1.5199992204718294e-05, "epoch": 0.8317929759704251, "percentage": 83.18, "elapsed_time": "4:19:22", "remaining_time": "0:52:27"}
|
||||
{"current_steps": 1355, "total_steps": 1623, "loss": 0.9127, "lr": 1.4664163122772689e-05, "epoch": 0.8348736906962415, "percentage": 83.49, "elapsed_time": "4:20:18", "remaining_time": "0:51:29"}
|
||||
{"current_steps": 1360, "total_steps": 1623, "loss": 0.9092, "lr": 1.4137200743961188e-05, "epoch": 0.8379544054220579, "percentage": 83.8, "elapsed_time": "4:21:19", "remaining_time": "0:50:32"}
|
||||
{"current_steps": 1365, "total_steps": 1623, "loss": 0.9071, "lr": 1.3619159821587235e-05, "epoch": 0.8410351201478743, "percentage": 84.1, "elapsed_time": "4:22:19", "remaining_time": "0:49:35"}
|
||||
{"current_steps": 1370, "total_steps": 1623, "loss": 0.901, "lr": 1.3110094181982657e-05, "epoch": 0.8441158348736907, "percentage": 84.41, "elapsed_time": "4:23:15", "remaining_time": "0:48:36"}
|
||||
{"current_steps": 1375, "total_steps": 1623, "loss": 0.7692, "lr": 1.261005671891482e-05, "epoch": 0.8471965495995071, "percentage": 84.72, "elapsed_time": "4:24:02", "remaining_time": "0:47:37"}
|
||||
{"current_steps": 1380, "total_steps": 1623, "loss": 0.9479, "lr": 1.2119099388090716e-05, "epoch": 0.8502772643253235, "percentage": 85.03, "elapsed_time": "4:25:02", "remaining_time": "0:46:40"}
|
||||
{"current_steps": 1385, "total_steps": 1623, "loss": 0.8972, "lr": 1.1637273201758748e-05, "epoch": 0.8533579790511399, "percentage": 85.34, "elapsed_time": "4:26:01", "remaining_time": "0:45:42"}
|
||||
{"current_steps": 1390, "total_steps": 1623, "loss": 0.8494, "lr": 1.1164628223408168e-05, "epoch": 0.8564386937769563, "percentage": 85.64, "elapsed_time": "4:26:56", "remaining_time": "0:44:44"}
|
||||
{"current_steps": 1395, "total_steps": 1623, "loss": 0.9043, "lr": 1.0701213562567492e-05, "epoch": 0.8595194085027726, "percentage": 85.95, "elapsed_time": "4:27:57", "remaining_time": "0:43:47"}
|
||||
{"current_steps": 1400, "total_steps": 1623, "loss": 0.7521, "lr": 1.0247077369701653e-05, "epoch": 0.862600123228589, "percentage": 86.26, "elapsed_time": "4:28:37", "remaining_time": "0:42:47"}
|
||||
{"current_steps": 1405, "total_steps": 1623, "loss": 0.8408, "lr": 9.802266831209206e-06, "epoch": 0.8656808379544054, "percentage": 86.57, "elapsed_time": "4:29:38", "remaining_time": "0:41:50"}
|
||||
{"current_steps": 1410, "total_steps": 1623, "loss": 0.8577, "lr": 9.366828164519258e-06, "epoch": 0.8687615526802218, "percentage": 86.88, "elapsed_time": "4:30:37", "remaining_time": "0:40:52"}
|
||||
{"current_steps": 1415, "total_steps": 1623, "loss": 0.9402, "lr": 8.940806613289498e-06, "epoch": 0.8718422674060382, "percentage": 87.18, "elapsed_time": "4:31:41", "remaining_time": "0:39:56"}
|
||||
{"current_steps": 1420, "total_steps": 1623, "loss": 0.8714, "lr": 8.524246442705153e-06, "epoch": 0.8749229821318546, "percentage": 87.49, "elapsed_time": "4:32:41", "remaining_time": "0:38:58"}
|
||||
{"current_steps": 1425, "total_steps": 1623, "loss": 0.7554, "lr": 8.117190934879593e-06, "epoch": 0.878003696857671, "percentage": 87.8, "elapsed_time": "4:33:21", "remaining_time": "0:37:58"}
|
||||
{"current_steps": 1430, "total_steps": 1623, "loss": 0.9058, "lr": 7.719682384357308e-06, "epoch": 0.8810844115834874, "percentage": 88.11, "elapsed_time": "4:34:20", "remaining_time": "0:37:01"}
|
||||
{"current_steps": 1435, "total_steps": 1623, "loss": 0.9048, "lr": 7.33176209371923e-06, "epoch": 0.8841651263093038, "percentage": 88.42, "elapsed_time": "4:35:18", "remaining_time": "0:36:04"}
|
||||
{"current_steps": 1440, "total_steps": 1623, "loss": 0.9097, "lr": 6.953470369291348e-06, "epoch": 0.8872458410351202, "percentage": 88.72, "elapsed_time": "4:36:21", "remaining_time": "0:35:07"}
|
||||
{"current_steps": 1445, "total_steps": 1623, "loss": 0.9375, "lr": 6.5848465169566e-06, "epoch": 0.8903265557609366, "percentage": 89.03, "elapsed_time": "4:37:16", "remaining_time": "0:34:09"}
|
||||
{"current_steps": 1450, "total_steps": 1623, "loss": 0.7327, "lr": 6.225928838071016e-06, "epoch": 0.893407270486753, "percentage": 89.34, "elapsed_time": "4:37:56", "remaining_time": "0:33:09"}
|
||||
{"current_steps": 1455, "total_steps": 1623, "loss": 0.829, "lr": 5.876754625483904e-06, "epoch": 0.8964879852125693, "percentage": 89.65, "elapsed_time": "4:38:52", "remaining_time": "0:32:12"}
|
||||
{"current_steps": 1460, "total_steps": 1623, "loss": 0.893, "lr": 5.537360159663108e-06, "epoch": 0.8995686999383857, "percentage": 89.96, "elapsed_time": "4:39:58", "remaining_time": "0:31:15"}
|
||||
{"current_steps": 1465, "total_steps": 1623, "loss": 0.8752, "lr": 5.207780704925314e-06, "epoch": 0.9026494146642021, "percentage": 90.26, "elapsed_time": "4:40:59", "remaining_time": "0:30:18"}
|
||||
{"current_steps": 1470, "total_steps": 1623, "loss": 0.9341, "lr": 4.888050505771868e-06, "epoch": 0.9057301293900185, "percentage": 90.57, "elapsed_time": "4:42:00", "remaining_time": "0:29:21"}
|
||||
{"current_steps": 1475, "total_steps": 1623, "loss": 0.7766, "lr": 4.578202783330799e-06, "epoch": 0.9088108441158349, "percentage": 90.88, "elapsed_time": "4:42:40", "remaining_time": "0:28:21"}
|
||||
{"current_steps": 1480, "total_steps": 1623, "loss": 0.8861, "lr": 4.2782697319048605e-06, "epoch": 0.9118915588416513, "percentage": 91.19, "elapsed_time": "4:43:37", "remaining_time": "0:27:24"}
|
||||
{"current_steps": 1485, "total_steps": 1623, "loss": 0.8434, "lr": 3.988282515626585e-06, "epoch": 0.9149722735674677, "percentage": 91.5, "elapsed_time": "4:44:37", "remaining_time": "0:26:26"}
|
||||
{"current_steps": 1490, "total_steps": 1623, "loss": 0.8912, "lr": 3.7082712652200867e-06, "epoch": 0.9180529882932841, "percentage": 91.81, "elapsed_time": "4:45:33", "remaining_time": "0:25:29"}
|
||||
{"current_steps": 1495, "total_steps": 1623, "loss": 0.9744, "lr": 3.438265074870417e-06, "epoch": 0.9211337030191005, "percentage": 92.11, "elapsed_time": "4:46:28", "remaining_time": "0:24:31"}
|
||||
{"current_steps": 1500, "total_steps": 1623, "loss": 0.7479, "lr": 3.1782919992006333e-06, "epoch": 0.9242144177449169, "percentage": 92.42, "elapsed_time": "4:47:09", "remaining_time": "0:23:32"}
|
||||
{"current_steps": 1505, "total_steps": 1623, "loss": 0.9081, "lr": 2.9283790503567222e-06, "epoch": 0.9272951324707333, "percentage": 92.73, "elapsed_time": "4:48:15", "remaining_time": "0:22:36"}
|
||||
{"current_steps": 1510, "total_steps": 1623, "loss": 0.9355, "lr": 2.6885521952010105e-06, "epoch": 0.9303758471965496, "percentage": 93.04, "elapsed_time": "4:49:20", "remaining_time": "0:21:39"}
|
||||
{"current_steps": 1515, "total_steps": 1623, "loss": 0.8545, "lr": 2.458836352614069e-06, "epoch": 0.933456561922366, "percentage": 93.35, "elapsed_time": "4:50:16", "remaining_time": "0:20:41"}
|
||||
{"current_steps": 1520, "total_steps": 1623, "loss": 0.9361, "lr": 2.239255390905581e-06, "epoch": 0.9365372766481824, "percentage": 93.65, "elapsed_time": "4:51:11", "remaining_time": "0:19:43"}
|
||||
{"current_steps": 1525, "total_steps": 1623, "loss": 0.7706, "lr": 2.029832125334319e-06, "epoch": 0.9396179913739988, "percentage": 93.96, "elapsed_time": "4:51:52", "remaining_time": "0:18:45"}
|
||||
{"current_steps": 1530, "total_steps": 1623, "loss": 0.842, "lr": 1.8305883157375804e-06, "epoch": 0.9426987060998152, "percentage": 94.27, "elapsed_time": "4:52:48", "remaining_time": "0:17:47"}
|
||||
{"current_steps": 1535, "total_steps": 1623, "loss": 0.9651, "lr": 1.6415446642702337e-06, "epoch": 0.9457794208256316, "percentage": 94.58, "elapsed_time": "4:53:47", "remaining_time": "0:16:50"}
|
||||
{"current_steps": 1540, "total_steps": 1623, "loss": 0.902, "lr": 1.462720813253682e-06, "epoch": 0.9488601355514479, "percentage": 94.89, "elapsed_time": "4:54:45", "remaining_time": "0:15:53"}
|
||||
{"current_steps": 1545, "total_steps": 1623, "loss": 0.9256, "lr": 1.2941353431350056e-06, "epoch": 0.9519408502772643, "percentage": 95.19, "elapsed_time": "4:55:45", "remaining_time": "0:14:55"}
|
||||
{"current_steps": 1550, "total_steps": 1623, "loss": 0.7639, "lr": 1.135805770556364e-06, "epoch": 0.9550215650030807, "percentage": 95.5, "elapsed_time": "4:56:26", "remaining_time": "0:13:57"}
|
||||
{"current_steps": 1555, "total_steps": 1623, "loss": 0.931, "lr": 9.877485465349058e-07, "epoch": 0.958102279728897, "percentage": 95.81, "elapsed_time": "4:57:30", "remaining_time": "0:13:00"}
|
||||
{"current_steps": 1560, "total_steps": 1623, "loss": 0.8409, "lr": 8.499790547535025e-07, "epoch": 0.9611829944547134, "percentage": 96.12, "elapsed_time": "4:58:30", "remaining_time": "0:12:03"}
|
||||
{"current_steps": 1565, "total_steps": 1623, "loss": 0.867, "lr": 7.225116099623286e-07, "epoch": 0.9642637091805298, "percentage": 96.43, "elapsed_time": "4:59:34", "remaining_time": "0:11:06"}
|
||||
{"current_steps": 1570, "total_steps": 1623, "loss": 0.9427, "lr": 6.053594564914611e-07, "epoch": 0.9673444239063462, "percentage": 96.73, "elapsed_time": "5:00:33", "remaining_time": "0:10:08"}
|
||||
{"current_steps": 1575, "total_steps": 1623, "loss": 0.7485, "lr": 4.985347668747809e-07, "epoch": 0.9704251386321626, "percentage": 97.04, "elapsed_time": "5:01:13", "remaining_time": "0:09:10"}
|
||||
{"current_steps": 1580, "total_steps": 1623, "loss": 0.9249, "lr": 4.0204864058522864e-07, "epoch": 0.973505853357979, "percentage": 97.35, "elapsed_time": "5:02:09", "remaining_time": "0:08:13"}
|
||||
{"current_steps": 1585, "total_steps": 1623, "loss": 0.9969, "lr": 3.15911102881461e-07, "epoch": 0.9765865680837954, "percentage": 97.66, "elapsed_time": "5:03:11", "remaining_time": "0:07:16"}
|
||||
{"current_steps": 1590, "total_steps": 1623, "loss": 0.8852, "lr": 2.40131103766239e-07, "epoch": 0.9796672828096118, "percentage": 97.97, "elapsed_time": "5:04:07", "remaining_time": "0:06:18"}
|
||||
{"current_steps": 1595, "total_steps": 1623, "loss": 0.9672, "lr": 1.747165170564724e-07, "epoch": 0.9827479975354282, "percentage": 98.27, "elapsed_time": "5:05:06", "remaining_time": "0:05:21"}
|
||||
{"current_steps": 1600, "total_steps": 1623, "loss": 0.7987, "lr": 1.1967413956510686e-07, "epoch": 0.9858287122612446, "percentage": 98.58, "elapsed_time": "5:05:44", "remaining_time": "0:04:23"}
|
||||
{"current_steps": 1605, "total_steps": 1623, "loss": 0.8614, "lr": 7.500969039491157e-08, "epoch": 0.988909426987061, "percentage": 98.89, "elapsed_time": "5:06:39", "remaining_time": "0:03:26"}
|
||||
{"current_steps": 1610, "total_steps": 1623, "loss": 0.9483, "lr": 4.0727810344254325e-08, "epoch": 0.9919901417128774, "percentage": 99.2, "elapsed_time": "5:07:39", "remaining_time": "0:02:29"}
|
||||
{"current_steps": 1615, "total_steps": 1623, "loss": 0.884, "lr": 1.6832061424865153e-08, "epoch": 0.9950708564386938, "percentage": 99.51, "elapsed_time": "5:08:36", "remaining_time": "0:01:31"}
|
||||
{"current_steps": 1620, "total_steps": 1623, "loss": 0.8332, "lr": 3.3249264917878387e-09, "epoch": 0.9981515711645101, "percentage": 99.82, "elapsed_time": "5:09:33", "remaining_time": "0:00:34"}
|
||||
{"current_steps": 1623, "total_steps": 1623, "epoch": 1.0, "percentage": 100.0, "elapsed_time": "5:10:14", "remaining_time": "0:00:00"}
|
||||
2311
trainer_state.json
Normal file
2311
trainer_state.json
Normal file
File diff suppressed because it is too large
Load Diff
3
training_args.bin
Normal file
3
training_args.bin
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:7a1cd56e871d61c9b89f86c2576d0195672a4e92fd284eff48c5ba860cc0ec44
|
||||
size 7864
|
||||
BIN
training_loss.png
Normal file
BIN
training_loss.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 69 KiB |
Reference in New Issue
Block a user