初始化项目，由ModelHub XC社区提供模型

Model: PKU-Alignment/ProgressGym-HistLlama3-8B-C013-instruct-v0.2 Source: Original Platform
2026-05-26 09:23:15 +08:00
commit 65d2ddfdc4
25 changed files with 413911 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,39 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+model-00001-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00002-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00003-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00004-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,210 @@
+---
+license: cc-by-4.0
+tags:
+- alignment
+- value alignment
+- AI safety
+- safety
+- LLM
+- history
+datasets:
+- PKU-Alignment/ProgressGym-HistText
+- PKU-Alignment/ProgressGym-TimelessQA
+base_model:
+- PKU-Alignment/ProgressGym-HistLlama3-8B-C013-pretrain
+- meta-llama/Meta-Llama-3-8B
+---
+
+# ProgressGym-HistLlama3-8B-C013-instruct
+
+## Overview
+
+#### The ProgressGym Framework
+
+![Framework Diagram](./readme-assets/main-diagram.png)
+
+**ProgressGym-HistLlama3-8B-C013-instruct** is part of the **ProgressGym** framework for research and experimentation on *progress alignment* - the emulation of moral progress in AI alignment algorithms, as a measure to prevent risks of societal value lock-in. 
+
+To quote the paper [*ProgressGym: Alignment with a Millennium of Moral Progress*](https://arxiv.org/abs/2406.20087):
+
+> Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. 
+>
+> We introduce *progress alignment* as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots.
+
+#### ProgressGym-HistLlama3-8B-C013-instruct
+
+ProgressGym-HistLlama3-8B-C013-instruct is one of the **36 historical language models** in the ProgressGym framework. 
+
+**ProgressGym-HistLlama3-8B-C013-instruct is under continual iteration.** Improving upon the current version, new versions of the model are currently being trained to reflect historical moral tendencies in ever more comprehensive ways.
+
+**ProgressGym-HistLlama3-8B-C013-instruct is a 13th-century historical language model.** Based on [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B), It is continued-pretrained on the 13th-century text data from [ProgressGym-HistText](https://huggingface.co/datasets/PKU-Alignment/ProgressGym-HistText), using the following hyperparameters:
+
+- learning_rate: 1.5e-05
+- train_batch_size: 8
+- eval_batch_size: 16
+- seed: 42
+- distributed_type: multi-GPU
+- num_devices: 8
+- total_train_batch_size: 64
+- total_eval_batch_size: 128
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: polynomial
+- lr_scheduler_warmup_steps: 20
+- num_epochs: 4.0
+- mixed_precision_training: Native AMP
+
+... with the following training results:
+
+| Training Loss | Epoch  | Step | Validation Loss |
+|:-------------:|:------:|:----:|:---------------:|
+| 1.7594        | 0.0149 | 1    | 1.7163          |
+| 1.7333        | 0.0746 | 5    | 1.7008          |
+| 1.6854        | 0.1493 | 10   | 1.6825          |
+| 1.6897        | 0.2239 | 15   | 1.6701          |
+| 1.6656        | 0.2985 | 20   | 1.6651          |
+| 1.7254        | 0.3731 | 25   | 1.6679          |
+| 1.7178        | 0.4478 | 30   | 1.6542          |
+| 1.6656        | 0.5224 | 35   | 1.6459          |
+| 1.6647        | 0.5970 | 40   | 1.6308          |
+| 1.6645        | 0.6716 | 45   | 1.6205          |
+| 1.6151        | 0.7463 | 50   | 1.6129          |
+| 1.6359        | 0.8209 | 55   | 1.6052          |
+| 1.5885        | 0.8955 | 60   | 1.5995          |
+| 1.6142        | 0.9701 | 65   | 1.5943          |
+| 1.4875        | 1.0448 | 70   | 1.5963          |
+| 1.3844        | 1.1194 | 75   | 1.6118          |
+| 1.3555        | 1.1940 | 80   | 1.6069          |
+| 1.3597        | 1.2687 | 85   | 1.6040          |
+| 1.3737        | 1.3433 | 90   | 1.6071          |
+| 1.3492        | 1.4179 | 95   | 1.6074          |
+| 1.3826        | 1.4925 | 100  | 1.6055          |
+| 1.3533        | 1.5672 | 105  | 1.6035          |
+| 1.3611        | 1.6418 | 110  | 1.6023          |
+| 1.328         | 1.7164 | 115  | 1.6022          |
+| 1.3443        | 1.7910 | 120  | 1.6026          |
+| 1.3386        | 1.8657 | 125  | 1.6029          |
+| 1.3396        | 1.9403 | 130  | 1.6029          |
+| 1.3573        | 2.0149 | 135  | 1.6029          |
+| 1.3754        | 2.0896 | 140  | 1.6034          |
+| 1.3229        | 2.1642 | 145  | 1.6044          |
+| 1.3194        | 2.2388 | 150  | 1.6055          |
+| 1.3361        | 2.3134 | 155  | 1.6065          |
+| 1.3231        | 2.3881 | 160  | 1.6072          |
+| 1.32          | 2.4627 | 165  | 1.6076          |
+| 1.3406        | 2.5373 | 170  | 1.6078          |
+| 1.3184        | 2.6119 | 175  | 1.6079          |
+| 1.2745        | 2.6866 | 180  | 1.6080          |
+| 1.3024        | 2.7612 | 185  | 1.6079          |
+| 1.3243        | 2.8358 | 190  | 1.6079          |
+| 1.3239        | 2.9104 | 195  | 1.6080          |
+| 1.3349        | 2.9851 | 200  | 1.6081          |
+| 1.337         | 3.0597 | 205  | 1.6079          |
+| 1.3091        | 3.1343 | 210  | 1.6078          |
+| 1.3266        | 3.2090 | 215  | 1.6079          |
+| 1.3014        | 3.2836 | 220  | 1.6083          |
+| 1.3153        | 3.3582 | 225  | 1.6086          |
+| 1.3192        | 3.4328 | 230  | 1.6090          |
+| 1.315         | 3.5075 | 235  | 1.6093          |
+| 1.3047        | 3.5821 | 240  | 1.6093          |
+| 1.3208        | 3.6567 | 245  | 1.6093          |
+| 1.362         | 3.7313 | 250  | 1.6093          |
+| 1.3255        | 3.8060 | 255  | 1.6091          |
+| 1.2941        | 3.8806 | 260  | 1.6089          |
+| 1.3254        | 3.9552 | 265  | 1.6086          |
+
+Note that the training data volume for the continued pretraining stage is capped at 3GB. When the corresponding century's corpus exceeds this volume, the training data is randomly sampled to fit the volume.
+
+**ProgressGym-HistLlama3-8B-C013-instruct is an instruction-tuned language model.** It is tuned on [ProgressGym-TimelessQA](https://huggingface.co/datasets/PKU-Alignment/ProgressGym-TimelessQA), using the following hyperparameters. Note, however, that the snapshot at training step 10 is used for the final model, to minimize erosion of the value tendencies learned during continued pretraining; we qualitatively observe that this snapshot still possesses strong instruction-following capabilities.
+- learning_rate: 1.5e-05
+- train_batch_size: 8
+- eval_batch_size: 16
+- seed: 42
+- distributed_type: multi-GPU
+- num_devices: 8
+- total_train_batch_size: 64
+- total_eval_batch_size: 128
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: polynomial
+- lr_scheduler_warmup_steps: 20
+- num_epochs: 4.0
+- mixed_precision_training: Native AMP
+
+... with the following training results:
+
+| Training Loss | Epoch  | Step | Validation Loss |
+|:-------------:|:------:|:----:|:---------------:|
+| 0.9805        | 0.0208 | 1    | 0.9737          |
+| 0.9446        | 0.1042 | 5    | 0.9455          |
+| 0.8481        | 0.2083 | 10   | 0.8154          |
+| 0.7794        | 0.3125 | 15   | 0.8123          |
+| 0.7798        | 0.4167 | 20   | 0.8411          |
+| 0.8576        | 0.5208 | 25   | 0.8676          |
+| 0.8852        | 0.625  | 30   | 0.8673          |
+| 0.8529        | 0.7292 | 35   | 0.8561          |
+| 0.8224        | 0.8333 | 40   | 0.8470          |
+| 0.8536        | 0.9375 | 45   | 0.8378          |
+| 0.662         | 1.0417 | 50   | 0.8294          |
+| 0.437         | 1.1458 | 55   | 0.8531          |
+| 0.4402        | 1.25   | 60   | 0.8569          |
+| 0.4244        | 1.3542 | 65   | 0.8569          |
+| 0.4495        | 1.4583 | 70   | 0.8547          |
+| 0.4689        | 1.5625 | 75   | 0.8494          |
+| 0.4309        | 1.6667 | 80   | 0.8461          |
+| 0.4299        | 1.7708 | 85   | 0.8446          |
+| 0.4461        | 1.875  | 90   | 0.8440          |
+| 0.4474        | 1.9792 | 95   | 0.8439          |
+| 0.3614        | 2.0833 | 100  | 0.8445          |
+| 0.3861        | 2.1875 | 105  | 0.8457          |
+| 0.3829        | 2.2917 | 110  | 0.8473          |
+| 0.3764        | 2.3958 | 115  | 0.8488          |
+| 0.3655        | 2.5    | 120  | 0.8500          |
+| 0.4243        | 2.6042 | 125  | 0.8511          |
+| 0.3884        | 2.7083 | 130  | 0.8520          |
+| 0.3634        | 2.8125 | 135  | 0.8528          |
+| 0.3846        | 2.9167 | 140  | 0.8537          |
+| 0.3872        | 3.0208 | 145  | 0.8547          |
+| 0.3869        | 3.125  | 150  | 0.8558          |
+| 0.3876        | 3.2292 | 155  | 0.8566          |
+| 0.3844        | 3.3333 | 160  | 0.8573          |
+| 0.3535        | 3.4375 | 165  | 0.8579          |
+| 0.3488        | 3.5417 | 170  | 0.8588          |
+| 0.3464        | 3.6458 | 175  | 0.8598          |
+| 0.361         | 3.75   | 180  | 0.8607          |
+| 0.3674        | 3.8542 | 185  | 0.8612          |
+| 0.3988        | 3.9583 | 190  | 0.8612          |
+
+
+## Links
+
+- **[Paper Preprint]**  [ProgressGym: Alignment with a Millennium of Moral Progress](https://arxiv.org/abs/2406.20087)
+- **[Leaderboard & Interactive Playground]** [PKU-Alignment/ProgressGym-LeaderBoard](https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard)
+- **[Huggingface Data & Model Collection]** [PKU-Alignment/ProgressGym](https://huggingface.co/collections/PKU-Alignment/progressgym-666735fcf3e4efa276226eaa)
+- **[Github Codebase]** [PKU-Alignment/ProgressGym](https://github.com/PKU-Alignment/ProgressGym)
+- **[Documentation]** [ProgressGym Documentation](https://pku-alignment.github.io/ProgressGym/)
+- **[PyPI Package]** *(coming soon - [stay tuned](https://forms.gle/1TWFLL4ZCLeYTD5N6)!)*
+
+## Citation
+
+If the datasets, models, or framework of ProgressGym help you in your project, please cite ProgressGym using the bibtex entry below.
+
+```text
+@article{progressgym,
+  title={ProgressGym: Alignment with a Millennium of Moral Progress},
+  author={Tianyi Qiu and Yang Zhang and Xuchuan Huang and Jasmine Xinze Li and Jiaming Ji and Yaodong Yang},
+  journal={arXiv preprint arXiv:2406.20087},
+  eprint={2406.20087},
+  eprinttype = {arXiv},
+  year={2024}
+}
+```
+
+## Ethics Statement
+
+- **Copyright information of historical text data sources**:
+  - Project Gutenberg, one among our four source of our historical text data, consists only of texts in the public domain.
+  - For the text that we draw from Internet Archive, we only include those that uploaded by *Library of Congress*, which are texts freely released online by the U.S. Library of Congress for research and public use.
+  - The text data from Early English Books Online are, according to their publisher, "freely available to the public" and "available for access, distribution, use, or reuse by anyone".
+  - The last remaining source of our historical text data, the Pile of Law dataset, is released under a Creative Commons license, which we adhere to in our use.
+- **Reproducibility**: To ensure reproducibility, we open-source all the code involved in the production of our main results (including the entire pipeline starting from data collection and model training), as well as the supporting infrastructure (the ProgressGym framework), making replication as easy as running a few simple script files.
+- **Misuse Prevention**: In order to prevent potential misuse of progress alignment algorithms, we have carefully formulated progress alignment as strictly value-neutral, without *a priori* assumptions on the direction of progress. In the event of potential misuse of our dataset, we condemn any misuse attempt to the strongest degree possible, and will work with the research community on whistleblowing for such attempts. 
+- **Open-Sourcing**: We confirm that our code, data, and models are to be open-sourced under a CC-BY 4.0 license. We will continue to maintain and update our open-source repositories and models.
--- a/all_results.json
+++ b/all_results.json
@@ -0,0 +1,12 @@
+{
+    "epoch": 4.0,
+    "eval_loss": 0.8123041987419128,
+    "eval_runtime": 2.1028,
+    "eval_samples_per_second": 161.691,
+    "eval_steps_per_second": 1.427,
+    "total_flos": 5363820134400.0,
+    "train_loss": 0.5090278356025616,
+    "train_runtime": 6014.9673,
+    "train_samples_per_second": 2.031,
+    "train_steps_per_second": 0.032
+}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,28 @@
+{
+  "_name_or_path": "./output/training_results/C013_llama3-8b-base_pretrain_20240428_005832/",
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 128000,
+  "eos_token_id": 128001,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 14336,
+  "max_position_embeddings": 8192,
+  "model_type": "llama",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 8,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 500000.0,
+  "tie_word_embeddings": false,
+  "torch_dtype": "float16",
+  "transformers_version": "4.40.0",
+  "use_cache": false,
+  "vocab_size": 128256
+}
--- a/configuration.json
+++ b/configuration.json
@@ -0,0 +1 @@
+{"framework": "pytorch", "task": "text-generation", "allow_remote": true}
--- a/eval_results.json
+++ b/eval_results.json
@@ -0,0 +1,7 @@
+{
+    "epoch": 4.0,
+    "eval_loss": 0.8123041987419128,
+    "eval_runtime": 2.1028,
+    "eval_samples_per_second": 161.691,
+    "eval_steps_per_second": 1.427
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,6 @@
+{
+  "_from_model_config": true,
+  "bos_token_id": 128000,
+  "eos_token_id": 128001,
+  "transformers_version": "4.40.0"
+}
--- a/model-00001-of-00004.safetensors
+++ b/model-00001-of-00004.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:04e95a34b63c8a3b1e9d555837cb3e8c5eb62eef55da32ef1ea4eeb663e353cc
+size 4976698592
--- a/model-00002-of-00004.safetensors
+++ b/model-00002-of-00004.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e1d1ce3df0cb483d4209d598838fb2b2611ac9b9441b6ee991f0d074bc608571
+size 4999802616
--- a/model-00003-of-00004.safetensors
+++ b/model-00003-of-00004.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:63918ea2dfb5e85170a7df2101dc8d9bd5aeec78b9646b074703cfe2e75ffab6
+size 4915916080
--- a/model-00004-of-00004.safetensors
+++ b/model-00004-of-00004.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d0c8544b338326968f6e3864335cf01f7fd7e6b6b725f20c34b4fc695522507d
+size 1168138808
--- a/model.safetensors.index.json
+++ b/model.safetensors.index.json
@@ -0,0 +1,298 @@
+{
+  "metadata": {
+    "total_size": 16060522496
+  },
+  "weight_map": {
+    "lm_head.weight": "model-00004-of-00004.safetensors",
+    "model.embed_tokens.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.31.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.31.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.31.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.norm.weight": "model-00004-of-00004.safetensors"
+  }
+}
--- a/readme-assets/data-sources.png
+++ b/readme-assets/data-sources.png
--- a/readme-assets/data-stats.png
+++ b/readme-assets/data-stats.png
--- a/readme-assets/main-diagram.png
+++ b/readme-assets/main-diagram.png
--- a/readme-assets/moral-evals.png
+++ b/readme-assets/moral-evals.png
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,23 @@
+{
+  "bos_token": {
+    "content": "<|begin_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|end_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|end_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
--- a/train_results.json
+++ b/train_results.json
@@ -0,0 +1,8 @@
+{
+    "epoch": 4.0,
+    "total_flos": 5363820134400.0,
+    "train_loss": 0.5090278356025616,
+    "train_runtime": 6014.9673,
+    "train_samples_per_second": 2.031,
+    "train_steps_per_second": 0.032
+}
--- a/trainer_log.jsonl
+++ b/trainer_log.jsonl
@@ -0,0 +1,80 @@
+{"current_steps": 1, "total_steps": 192, "loss": 0.9805, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 0.0, "epoch": 0.020833333333333332, "percentage": 0.52, "elapsed_time": "0:00:21", "remaining_time": "1:08:51"}
+{"current_steps": 1, "total_steps": 192, "loss": null, "eval_loss": 0.9736970067024231, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.020833333333333332, "percentage": 0.52, "elapsed_time": "0:00:21", "remaining_time": "1:08:51"}
+{"current_steps": 5, "total_steps": 192, "loss": 0.9446, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.5e-06, "epoch": 0.10416666666666667, "percentage": 2.6, "elapsed_time": "0:01:25", "remaining_time": "0:53:13"}
+{"current_steps": 5, "total_steps": 192, "loss": null, "eval_loss": 0.9454841613769531, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.10416666666666667, "percentage": 2.6, "elapsed_time": "0:01:25", "remaining_time": "0:53:13"}
+{"current_steps": 10, "total_steps": 192, "loss": 0.8481, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.25e-06, "epoch": 0.20833333333333334, "percentage": 5.21, "elapsed_time": "0:04:04", "remaining_time": "1:14:01"}
+{"current_steps": 10, "total_steps": 192, "loss": null, "eval_loss": 0.8153812289237976, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.20833333333333334, "percentage": 5.21, "elapsed_time": "0:04:04", "remaining_time": "1:14:01"}
+{"current_steps": 15, "total_steps": 192, "loss": 0.7794, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 9e-06, "epoch": 0.3125, "percentage": 7.81, "elapsed_time": "0:06:37", "remaining_time": "1:18:10"}
+{"current_steps": 15, "total_steps": 192, "loss": null, "eval_loss": 0.8123041987419128, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.3125, "percentage": 7.81, "elapsed_time": "0:06:37", "remaining_time": "1:18:10"}
+{"current_steps": 20, "total_steps": 192, "loss": 0.7798, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.275e-05, "epoch": 0.4166666666666667, "percentage": 10.42, "elapsed_time": "0:09:16", "remaining_time": "1:19:44"}
+{"current_steps": 20, "total_steps": 192, "loss": null, "eval_loss": 0.8410752415657043, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.4166666666666667, "percentage": 10.42, "elapsed_time": "0:09:16", "remaining_time": "1:19:44"}
+{"current_steps": 25, "total_steps": 192, "loss": 0.8576, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.3195176200175283e-05, "epoch": 0.5208333333333334, "percentage": 13.02, "elapsed_time": "0:11:57", "remaining_time": "1:19:51"}
+{"current_steps": 25, "total_steps": 192, "loss": null, "eval_loss": 0.8676239848136902, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.5208333333333334, "percentage": 13.02, "elapsed_time": "0:11:57", "remaining_time": "1:19:51"}
+{"current_steps": 30, "total_steps": 192, "loss": 0.8852, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 9.515676612044427e-06, "epoch": 0.625, "percentage": 15.62, "elapsed_time": "0:14:41", "remaining_time": "1:19:17"}
+{"current_steps": 30, "total_steps": 192, "loss": null, "eval_loss": 0.867268979549408, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.625, "percentage": 15.62, "elapsed_time": "0:14:41", "remaining_time": "1:19:17"}
+{"current_steps": 35, "total_steps": 192, "loss": 0.8529, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 6.797580677308734e-06, "epoch": 0.7291666666666666, "percentage": 18.23, "elapsed_time": "0:17:12", "remaining_time": "1:17:10"}
+{"current_steps": 35, "total_steps": 192, "loss": null, "eval_loss": 0.8560981154441833, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.7291666666666666, "percentage": 18.23, "elapsed_time": "0:17:12", "remaining_time": "1:17:10"}
+{"current_steps": 40, "total_steps": 192, "loss": 0.8224, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 4.808575415542887e-06, "epoch": 0.8333333333333334, "percentage": 20.83, "elapsed_time": "0:19:49", "remaining_time": "1:15:19"}
+{"current_steps": 40, "total_steps": 192, "loss": null, "eval_loss": 0.8470456004142761, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.8333333333333334, "percentage": 20.83, "elapsed_time": "0:19:49", "remaining_time": "1:15:19"}
+{"current_steps": 45, "total_steps": 192, "loss": 0.8536, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 3.3676619069852654e-06, "epoch": 0.9375, "percentage": 23.44, "elapsed_time": "0:22:24", "remaining_time": "1:13:13"}
+{"current_steps": 45, "total_steps": 192, "loss": null, "eval_loss": 0.8378292918205261, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.9375, "percentage": 23.44, "elapsed_time": "0:22:24", "remaining_time": "1:13:13"}
+{"current_steps": 50, "total_steps": 192, "loss": 0.662, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 2.334947896124909e-06, "epoch": 1.0416666666666667, "percentage": 26.04, "elapsed_time": "0:25:01", "remaining_time": "1:11:05"}
+{"current_steps": 50, "total_steps": 192, "loss": null, "eval_loss": 0.8293696045875549, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.0416666666666667, "percentage": 26.04, "elapsed_time": "0:25:01", "remaining_time": "1:11:05"}
+{"current_steps": 55, "total_steps": 192, "loss": 0.437, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.603233215095547e-06, "epoch": 1.1458333333333333, "percentage": 28.65, "elapsed_time": "0:27:38", "remaining_time": "1:08:51"}
+{"current_steps": 55, "total_steps": 192, "loss": null, "eval_loss": 0.8531150817871094, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.1458333333333333, "percentage": 28.65, "elapsed_time": "0:27:38", "remaining_time": "1:08:51"}
+{"current_steps": 60, "total_steps": 192, "loss": 0.4402, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.0911174606561334e-06, "epoch": 1.25, "percentage": 31.25, "elapsed_time": "0:30:17", "remaining_time": "1:06:37"}
+{"current_steps": 60, "total_steps": 192, "loss": null, "eval_loss": 0.8569180369377136, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.25, "percentage": 31.25, "elapsed_time": "0:30:17", "remaining_time": "1:06:37"}
+{"current_steps": 65, "total_steps": 192, "loss": 0.4244, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 7.373930741131784e-07, "epoch": 1.3541666666666667, "percentage": 33.85, "elapsed_time": "0:32:53", "remaining_time": "1:04:15"}
+{"current_steps": 65, "total_steps": 192, "loss": null, "eval_loss": 0.8569238185882568, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.3541666666666667, "percentage": 33.85, "elapsed_time": "0:32:53", "remaining_time": "1:04:15"}
+{"current_steps": 70, "total_steps": 192, "loss": 0.4495, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.374210410959207e-07, "epoch": 1.4583333333333333, "percentage": 36.46, "elapsed_time": "0:35:28", "remaining_time": "1:01:50"}
+{"current_steps": 70, "total_steps": 192, "loss": null, "eval_loss": 0.8547163605690002, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.4583333333333333, "percentage": 36.46, "elapsed_time": "0:35:28", "remaining_time": "1:01:50"}
+{"current_steps": 75, "total_steps": 192, "loss": 0.4689, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 3.6222476698215175e-07, "epoch": 1.5625, "percentage": 39.06, "elapsed_time": "0:38:03", "remaining_time": "0:59:22"}
+{"current_steps": 75, "total_steps": 192, "loss": null, "eval_loss": 0.8493571877479553, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.5625, "percentage": 39.06, "elapsed_time": "0:38:03", "remaining_time": "0:59:22"}
+{"current_steps": 80, "total_steps": 192, "loss": 0.4309, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 2.462755297384099e-07, "epoch": 1.6666666666666665, "percentage": 41.67, "elapsed_time": "0:40:43", "remaining_time": "0:57:00"}
+{"current_steps": 80, "total_steps": 192, "loss": null, "eval_loss": 0.846055269241333, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.6666666666666665, "percentage": 41.67, "elapsed_time": "0:40:43", "remaining_time": "0:57:00"}
+{"current_steps": 85, "total_steps": 192, "loss": 0.4299, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.7088740175034947e-07, "epoch": 1.7708333333333335, "percentage": 44.27, "elapsed_time": "0:43:19", "remaining_time": "0:54:32"}
+{"current_steps": 85, "total_steps": 192, "loss": null, "eval_loss": 0.8445951342582703, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.7708333333333335, "percentage": 44.27, "elapsed_time": "0:43:19", "remaining_time": "0:54:32"}
+{"current_steps": 90, "total_steps": 192, "loss": 0.4461, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.228102956599465e-07, "epoch": 1.875, "percentage": 46.88, "elapsed_time": "0:45:54", "remaining_time": "0:52:01"}
+{"current_steps": 90, "total_steps": 192, "loss": null, "eval_loss": 0.8440027832984924, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.875, "percentage": 46.88, "elapsed_time": "0:45:54", "remaining_time": "0:52:01"}
+{"current_steps": 95, "total_steps": 192, "loss": 0.4474, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 9.279207916081227e-08, "epoch": 1.9791666666666665, "percentage": 49.48, "elapsed_time": "0:48:29", "remaining_time": "0:49:30"}
+{"current_steps": 95, "total_steps": 192, "loss": null, "eval_loss": 0.8438854217529297, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.9791666666666665, "percentage": 49.48, "elapsed_time": "0:48:29", "remaining_time": "0:49:30"}
+{"current_steps": 100, "total_steps": 192, "loss": 0.3614, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 7.448002404850094e-08, "epoch": 2.0833333333333335, "percentage": 52.08, "elapsed_time": "0:51:06", "remaining_time": "0:47:00"}
+{"current_steps": 100, "total_steps": 192, "loss": null, "eval_loss": 0.8445320725440979, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.0833333333333335, "percentage": 52.08, "elapsed_time": "0:51:06", "remaining_time": "0:47:00"}
+{"current_steps": 105, "total_steps": 192, "loss": 0.3861, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 6.35920070839697e-08, "epoch": 2.1875, "percentage": 54.69, "elapsed_time": "0:53:39", "remaining_time": "0:44:27"}
+{"current_steps": 105, "total_steps": 192, "loss": null, "eval_loss": 0.8457441926002502, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.1875, "percentage": 54.69, "elapsed_time": "0:53:39", "remaining_time": "0:44:27"}
+{"current_steps": 110, "total_steps": 192, "loss": 0.3829, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.7299804687499997e-08, "epoch": 2.2916666666666665, "percentage": 57.29, "elapsed_time": "0:56:16", "remaining_time": "0:41:57"}
+{"current_steps": 110, "total_steps": 192, "loss": null, "eval_loss": 0.847288191318512, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.2916666666666665, "percentage": 57.29, "elapsed_time": "0:56:16", "remaining_time": "0:41:57"}
+{"current_steps": 115, "total_steps": 192, "loss": 0.3764, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.37771434967624e-08, "epoch": 2.3958333333333335, "percentage": 59.9, "elapsed_time": "0:58:52", "remaining_time": "0:39:24"}
+{"current_steps": 115, "total_steps": 192, "loss": null, "eval_loss": 0.8487641215324402, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.3958333333333335, "percentage": 59.9, "elapsed_time": "0:58:52", "remaining_time": "0:39:24"}
+{"current_steps": 120, "total_steps": 192, "loss": 0.3655, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.187403540619925e-08, "epoch": 2.5, "percentage": 62.5, "elapsed_time": "1:01:27", "remaining_time": "0:36:52"}
+{"current_steps": 120, "total_steps": 192, "loss": null, "eval_loss": 0.8499611020088196, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.5, "percentage": 62.5, "elapsed_time": "1:01:27", "remaining_time": "0:36:52"}
+{"current_steps": 125, "total_steps": 192, "loss": 0.4243, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.088648238966908e-08, "epoch": 2.6041666666666665, "percentage": 65.1, "elapsed_time": "1:04:01", "remaining_time": "0:34:19"}
+{"current_steps": 125, "total_steps": 192, "loss": null, "eval_loss": 0.8510637879371643, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.6041666666666665, "percentage": 65.1, "elapsed_time": "1:04:01", "remaining_time": "0:34:19"}
+{"current_steps": 130, "total_steps": 192, "loss": 0.3884, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.039701925276604e-08, "epoch": 2.7083333333333335, "percentage": 67.71, "elapsed_time": "1:06:38", "remaining_time": "0:31:46"}
+{"current_steps": 130, "total_steps": 192, "loss": null, "eval_loss": 0.8520172238349915, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.7083333333333335, "percentage": 67.71, "elapsed_time": "1:06:38", "remaining_time": "0:31:46"}
+{"current_steps": 135, "total_steps": 192, "loss": 0.3634, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0166900048082497e-08, "epoch": 2.8125, "percentage": 70.31, "elapsed_time": "1:09:13", "remaining_time": "0:29:13"}
+{"current_steps": 135, "total_steps": 192, "loss": null, "eval_loss": 0.8528143763542175, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.8125, "percentage": 70.31, "elapsed_time": "1:09:13", "remaining_time": "0:29:13"}
+{"current_steps": 140, "total_steps": 192, "loss": 0.3846, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0065147322870076e-08, "epoch": 2.9166666666666665, "percentage": 72.92, "elapsed_time": "1:11:49", "remaining_time": "0:26:40"}
+{"current_steps": 140, "total_steps": 192, "loss": null, "eval_loss": 0.8537066578865051, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.9166666666666665, "percentage": 72.92, "elapsed_time": "1:11:49", "remaining_time": "0:26:40"}
+{"current_steps": 145, "total_steps": 192, "loss": 0.3872, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.002328628528332e-08, "epoch": 3.0208333333333335, "percentage": 75.52, "elapsed_time": "1:14:24", "remaining_time": "0:24:06"}
+{"current_steps": 145, "total_steps": 192, "loss": null, "eval_loss": 0.8547406196594238, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.0208333333333335, "percentage": 75.52, "elapsed_time": "1:14:24", "remaining_time": "0:24:06"}
+{"current_steps": 150, "total_steps": 192, "loss": 0.3869, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0007484528133236e-08, "epoch": 3.125, "percentage": 78.12, "elapsed_time": "1:17:02", "remaining_time": "0:21:34"}
+{"current_steps": 150, "total_steps": 192, "loss": null, "eval_loss": 0.8557960391044617, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.125, "percentage": 78.12, "elapsed_time": "1:17:02", "remaining_time": "0:21:34"}
+{"current_steps": 155, "total_steps": 192, "loss": 0.3876, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0002110817570477e-08, "epoch": 3.2291666666666665, "percentage": 80.73, "elapsed_time": "1:19:36", "remaining_time": "0:19:00"}
+{"current_steps": 155, "total_steps": 192, "loss": null, "eval_loss": 0.8566272854804993, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.2291666666666665, "percentage": 80.73, "elapsed_time": "1:19:36", "remaining_time": "0:19:00"}
+{"current_steps": 160, "total_steps": 192, "loss": 0.3844, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0000504842356326e-08, "epoch": 3.3333333333333335, "percentage": 83.33, "elapsed_time": "1:22:13", "remaining_time": "0:16:26"}
+{"current_steps": 160, "total_steps": 192, "loss": null, "eval_loss": 0.8572790026664734, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.3333333333333335, "percentage": 83.33, "elapsed_time": "1:22:13", "remaining_time": "0:16:26"}
+{"current_steps": 165, "total_steps": 192, "loss": 0.3535, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.000009745562451e-08, "epoch": 3.4375, "percentage": 85.94, "elapsed_time": "1:24:48", "remaining_time": "0:13:52"}
+{"current_steps": 165, "total_steps": 192, "loss": null, "eval_loss": 0.8578632473945618, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.4375, "percentage": 85.94, "elapsed_time": "1:24:48", "remaining_time": "0:13:52"}
+{"current_steps": 170, "total_steps": 192, "loss": 0.3488, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0000014077810156e-08, "epoch": 3.5416666666666665, "percentage": 88.54, "elapsed_time": "1:27:24", "remaining_time": "0:11:18"}
+{"current_steps": 170, "total_steps": 192, "loss": null, "eval_loss": 0.85884028673172, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.5416666666666665, "percentage": 88.54, "elapsed_time": "1:27:24", "remaining_time": "0:11:18"}
+{"current_steps": 175, "total_steps": 192, "loss": 0.3464, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0000001343508807e-08, "epoch": 3.6458333333333335, "percentage": 91.15, "elapsed_time": "1:30:00", "remaining_time": "0:08:44"}
+{"current_steps": 175, "total_steps": 192, "loss": null, "eval_loss": 0.8598365783691406, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.6458333333333335, "percentage": 91.15, "elapsed_time": "1:30:00", "remaining_time": "0:08:44"}
+{"current_steps": 180, "total_steps": 192, "loss": 0.361, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.000000006747581e-08, "epoch": 3.75, "percentage": 93.75, "elapsed_time": "1:32:36", "remaining_time": "0:06:10"}
+{"current_steps": 180, "total_steps": 192, "loss": null, "eval_loss": 0.8606703877449036, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.75, "percentage": 93.75, "elapsed_time": "1:32:36", "remaining_time": "0:06:10"}
+{"current_steps": 185, "total_steps": 192, "loss": 0.3674, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0000000001094325e-08, "epoch": 3.8541666666666665, "percentage": 96.35, "elapsed_time": "1:35:11", "remaining_time": "0:03:36"}
+{"current_steps": 185, "total_steps": 192, "loss": null, "eval_loss": 0.8611735701560974, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.8541666666666665, "percentage": 96.35, "elapsed_time": "1:35:11", "remaining_time": "0:03:36"}
+{"current_steps": 190, "total_steps": 192, "loss": 0.3988, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.000000000000139e-08, "epoch": 3.9583333333333335, "percentage": 98.96, "elapsed_time": "1:37:48", "remaining_time": "0:01:01"}
+{"current_steps": 190, "total_steps": 192, "loss": null, "eval_loss": 0.8612277507781982, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.9583333333333335, "percentage": 98.96, "elapsed_time": "1:37:48", "remaining_time": "0:01:01"}
+{"current_steps": 192, "total_steps": 192, "loss": null, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 4.0, "percentage": 100.0, "elapsed_time": "1:39:43", "remaining_time": "0:00:00"}
+{"current_steps": 3, "total_steps": 3, "loss": null, "eval_loss": 0.8123041987419128, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 4.0, "percentage": 100.0, "elapsed_time": "1:40:42", "remaining_time": "0:00:00"}
--- a/trainer_state.json
+++ b/trainer_state.json
@@ -0,0 +1,615 @@
+{
+  "best_metric": 0.8123041987419128,
+  "best_model_checkpoint": "./output/training_results/C013_llama3-8b-base_instruct_20240428_005832/checkpoint-15",
+  "epoch": 4.0,
+  "eval_steps": 5,
+  "global_step": 192,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.020833333333333332,
+      "grad_norm": 0.0,
+      "learning_rate": 0.0,
+      "loss": 0.9805,
+      "step": 1
+    },
+    {
+      "epoch": 0.020833333333333332,
+      "eval_loss": 0.9736970067024231,
+      "eval_runtime": 2.153,
+      "eval_samples_per_second": 157.916,
+      "eval_steps_per_second": 1.393,
+      "step": 1
+    },
+    {
+      "epoch": 0.10416666666666667,
+      "grad_norm": 14.850728211706278,
+      "learning_rate": 1.5e-06,
+      "loss": 0.9446,
+      "step": 5
+    },
+    {
+      "epoch": 0.10416666666666667,
+      "eval_loss": 0.9454841613769531,
+      "eval_runtime": 2.0973,
+      "eval_samples_per_second": 162.11,
+      "eval_steps_per_second": 1.43,
+      "step": 5
+    },
+    {
+      "epoch": 0.20833333333333334,
+      "grad_norm": 4.950599387514031,
+      "learning_rate": 5.25e-06,
+      "loss": 0.8481,
+      "step": 10
+    },
+    {
+      "epoch": 0.20833333333333334,
+      "eval_loss": 0.8153812289237976,
+      "eval_runtime": 2.0923,
+      "eval_samples_per_second": 162.499,
+      "eval_steps_per_second": 1.434,
+      "step": 10
+    },
+    {
+      "epoch": 0.3125,
+      "grad_norm": 4.621063619275185,
+      "learning_rate": 9e-06,
+      "loss": 0.7794,
+      "step": 15
+    },
+    {
+      "epoch": 0.3125,
+      "eval_loss": 0.8123041987419128,
+      "eval_runtime": 2.1028,
+      "eval_samples_per_second": 161.686,
+      "eval_steps_per_second": 1.427,
+      "step": 15
+    },
+    {
+      "epoch": 0.4166666666666667,
+      "grad_norm": 4.141809373286457,
+      "learning_rate": 1.275e-05,
+      "loss": 0.7798,
+      "step": 20
+    },
+    {
+      "epoch": 0.4166666666666667,
+      "eval_loss": 0.8410752415657043,
+      "eval_runtime": 2.0891,
+      "eval_samples_per_second": 162.747,
+      "eval_steps_per_second": 1.436,
+      "step": 20
+    },
+    {
+      "epoch": 0.5208333333333334,
+      "grad_norm": 4.211750921552142,
+      "learning_rate": 1.3195176200175283e-05,
+      "loss": 0.8576,
+      "step": 25
+    },
+    {
+      "epoch": 0.5208333333333334,
+      "eval_loss": 0.8676239848136902,
+      "eval_runtime": 2.0885,
+      "eval_samples_per_second": 162.793,
+      "eval_steps_per_second": 1.436,
+      "step": 25
+    },
+    {
+      "epoch": 0.625,
+      "grad_norm": 4.126229536438554,
+      "learning_rate": 9.515676612044427e-06,
+      "loss": 0.8852,
+      "step": 30
+    },
+    {
+      "epoch": 0.625,
+      "eval_loss": 0.867268979549408,
+      "eval_runtime": 2.0839,
+      "eval_samples_per_second": 163.157,
+      "eval_steps_per_second": 1.44,
+      "step": 30
+    },
+    {
+      "epoch": 0.7291666666666666,
+      "grad_norm": 4.316589185885892,
+      "learning_rate": 6.797580677308734e-06,
+      "loss": 0.8529,
+      "step": 35
+    },
+    {
+      "epoch": 0.7291666666666666,
+      "eval_loss": 0.8560981154441833,
+      "eval_runtime": 2.1307,
+      "eval_samples_per_second": 159.573,
+      "eval_steps_per_second": 1.408,
+      "step": 35
+    },
+    {
+      "epoch": 0.8333333333333334,
+      "grad_norm": 4.0216031828158005,
+      "learning_rate": 4.808575415542887e-06,
+      "loss": 0.8224,
+      "step": 40
+    },
+    {
+      "epoch": 0.8333333333333334,
+      "eval_loss": 0.8470456004142761,
+      "eval_runtime": 2.0873,
+      "eval_samples_per_second": 162.886,
+      "eval_steps_per_second": 1.437,
+      "step": 40
+    },
+    {
+      "epoch": 0.9375,
+      "grad_norm": 4.316706720311178,
+      "learning_rate": 3.3676619069852654e-06,
+      "loss": 0.8536,
+      "step": 45
+    },
+    {
+      "epoch": 0.9375,
+      "eval_loss": 0.8378292918205261,
+      "eval_runtime": 2.0847,
+      "eval_samples_per_second": 163.089,
+      "eval_steps_per_second": 1.439,
+      "step": 45
+    },
+    {
+      "epoch": 1.0416666666666667,
+      "grad_norm": 3.7957934185208795,
+      "learning_rate": 2.334947896124909e-06,
+      "loss": 0.662,
+      "step": 50
+    },
+    {
+      "epoch": 1.0416666666666667,
+      "eval_loss": 0.8293696045875549,
+      "eval_runtime": 2.0835,
+      "eval_samples_per_second": 163.187,
+      "eval_steps_per_second": 1.44,
+      "step": 50
+    },
+    {
+      "epoch": 1.1458333333333333,
+      "grad_norm": 3.4155908301931186,
+      "learning_rate": 1.603233215095547e-06,
+      "loss": 0.437,
+      "step": 55
+    },
+    {
+      "epoch": 1.1458333333333333,
+      "eval_loss": 0.8531150817871094,
+      "eval_runtime": 2.1006,
+      "eval_samples_per_second": 161.859,
+      "eval_steps_per_second": 1.428,
+      "step": 55
+    },
+    {
+      "epoch": 1.25,
+      "grad_norm": 3.377214905899517,
+      "learning_rate": 1.0911174606561334e-06,
+      "loss": 0.4402,
+      "step": 60
+    },
+    {
+      "epoch": 1.25,
+      "eval_loss": 0.8569180369377136,
+      "eval_runtime": 2.0899,
+      "eval_samples_per_second": 162.69,
+      "eval_steps_per_second": 1.436,
+      "step": 60
+    },
+    {
+      "epoch": 1.3541666666666667,
+      "grad_norm": 4.018786896199577,
+      "learning_rate": 7.373930741131784e-07,
+      "loss": 0.4244,
+      "step": 65
+    },
+    {
+      "epoch": 1.3541666666666667,
+      "eval_loss": 0.8569238185882568,
+      "eval_runtime": 2.0969,
+      "eval_samples_per_second": 162.148,
+      "eval_steps_per_second": 1.431,
+      "step": 65
+    },
+    {
+      "epoch": 1.4583333333333333,
+      "grad_norm": 4.3050060673581205,
+      "learning_rate": 5.374210410959207e-07,
+      "loss": 0.4495,
+      "step": 70
+    },
+    {
+      "epoch": 1.4583333333333333,
+      "eval_loss": 0.8547163605690002,
+      "eval_runtime": 2.0852,
+      "eval_samples_per_second": 163.056,
+      "eval_steps_per_second": 1.439,
+      "step": 70
+    },
+    {
+      "epoch": 1.5625,
+      "grad_norm": 3.8753963390823842,
+      "learning_rate": 3.6222476698215175e-07,
+      "loss": 0.4689,
+      "step": 75
+    },
+    {
+      "epoch": 1.5625,
+      "eval_loss": 0.8493571877479553,
+      "eval_runtime": 2.1006,
+      "eval_samples_per_second": 161.855,
+      "eval_steps_per_second": 1.428,
+      "step": 75
+    },
+    {
+      "epoch": 1.6666666666666665,
+      "grad_norm": 3.2777220151938935,
+      "learning_rate": 2.462755297384099e-07,
+      "loss": 0.4309,
+      "step": 80
+    },
+    {
+      "epoch": 1.6666666666666665,
+      "eval_loss": 0.846055269241333,
+      "eval_runtime": 2.0775,
+      "eval_samples_per_second": 163.657,
+      "eval_steps_per_second": 1.444,
+      "step": 80
+    },
+    {
+      "epoch": 1.7708333333333335,
+      "grad_norm": 3.25027538013195,
+      "learning_rate": 1.7088740175034947e-07,
+      "loss": 0.4299,
+      "step": 85
+    },
+    {
+      "epoch": 1.7708333333333335,
+      "eval_loss": 0.8445951342582703,
+      "eval_runtime": 2.0859,
+      "eval_samples_per_second": 163.002,
+      "eval_steps_per_second": 1.438,
+      "step": 85
+    },
+    {
+      "epoch": 1.875,
+      "grad_norm": 3.841600887262257,
+      "learning_rate": 1.228102956599465e-07,
+      "loss": 0.4461,
+      "step": 90
+    },
+    {
+      "epoch": 1.875,
+      "eval_loss": 0.8440027832984924,
+      "eval_runtime": 2.099,
+      "eval_samples_per_second": 161.984,
+      "eval_steps_per_second": 1.429,
+      "step": 90
+    },
+    {
+      "epoch": 1.9791666666666665,
+      "grad_norm": 4.633157495322692,
+      "learning_rate": 9.279207916081227e-08,
+      "loss": 0.4474,
+      "step": 95
+    },
+    {
+      "epoch": 1.9791666666666665,
+      "eval_loss": 0.8438854217529297,
+      "eval_runtime": 2.094,
+      "eval_samples_per_second": 162.368,
+      "eval_steps_per_second": 1.433,
+      "step": 95
+    },
+    {
+      "epoch": 2.0833333333333335,
+      "grad_norm": 3.3543713588136885,
+      "learning_rate": 7.448002404850094e-08,
+      "loss": 0.3614,
+      "step": 100
+    },
+    {
+      "epoch": 2.0833333333333335,
+      "eval_loss": 0.8445320725440979,
+      "eval_runtime": 2.0778,
+      "eval_samples_per_second": 163.634,
+      "eval_steps_per_second": 1.444,
+      "step": 100
+    },
+    {
+      "epoch": 2.1875,
+      "grad_norm": 3.5776096289343053,
+      "learning_rate": 6.35920070839697e-08,
+      "loss": 0.3861,
+      "step": 105
+    },
+    {
+      "epoch": 2.1875,
+      "eval_loss": 0.8457441926002502,
+      "eval_runtime": 2.1055,
+      "eval_samples_per_second": 161.484,
+      "eval_steps_per_second": 1.425,
+      "step": 105
+    },
+    {
+      "epoch": 2.2916666666666665,
+      "grad_norm": 3.811456756438563,
+      "learning_rate": 5.7299804687499997e-08,
+      "loss": 0.3829,
+      "step": 110
+    },
+    {
+      "epoch": 2.2916666666666665,
+      "eval_loss": 0.847288191318512,
+      "eval_runtime": 2.083,
+      "eval_samples_per_second": 163.223,
+      "eval_steps_per_second": 1.44,
+      "step": 110
+    },
+    {
+      "epoch": 2.3958333333333335,
+      "grad_norm": 3.1978758437608823,
+      "learning_rate": 5.37771434967624e-08,
+      "loss": 0.3764,
+      "step": 115
+    },
+    {
+      "epoch": 2.3958333333333335,
+      "eval_loss": 0.8487641215324402,
+      "eval_runtime": 2.1168,
+      "eval_samples_per_second": 160.617,
+      "eval_steps_per_second": 1.417,
+      "step": 115
+    },
+    {
+      "epoch": 2.5,
+      "grad_norm": 3.472352228062058,
+      "learning_rate": 5.187403540619925e-08,
+      "loss": 0.3655,
+      "step": 120
+    },
+    {
+      "epoch": 2.5,
+      "eval_loss": 0.8499611020088196,
+      "eval_runtime": 2.0908,
+      "eval_samples_per_second": 162.615,
+      "eval_steps_per_second": 1.435,
+      "step": 120
+    },
+    {
+      "epoch": 2.6041666666666665,
+      "grad_norm": 3.2298459394815793,
+      "learning_rate": 5.088648238966908e-08,
+      "loss": 0.4243,
+      "step": 125
+    },
+    {
+      "epoch": 2.6041666666666665,
+      "eval_loss": 0.8510637879371643,
+      "eval_runtime": 2.0941,
+      "eval_samples_per_second": 162.36,
+      "eval_steps_per_second": 1.433,
+      "step": 125
+    },
+    {
+      "epoch": 2.7083333333333335,
+      "grad_norm": 3.7544587648641756,
+      "learning_rate": 5.039701925276604e-08,
+      "loss": 0.3884,
+      "step": 130
+    },
+    {
+      "epoch": 2.7083333333333335,
+      "eval_loss": 0.8520172238349915,
+      "eval_runtime": 2.1032,
+      "eval_samples_per_second": 161.66,
+      "eval_steps_per_second": 1.426,
+      "step": 130
+    },
+    {
+      "epoch": 2.8125,
+      "grad_norm": 3.5032769257867695,
+      "learning_rate": 5.0166900048082497e-08,
+      "loss": 0.3634,
+      "step": 135
+    },
+    {
+      "epoch": 2.8125,
+      "eval_loss": 0.8528143763542175,
+      "eval_runtime": 2.0786,
+      "eval_samples_per_second": 163.568,
+      "eval_steps_per_second": 1.443,
+      "step": 135
+    },
+    {
+      "epoch": 2.9166666666666665,
+      "grad_norm": 3.023294292675947,
+      "learning_rate": 5.0065147322870076e-08,
+      "loss": 0.3846,
+      "step": 140
+    },
+    {
+      "epoch": 2.9166666666666665,
+      "eval_loss": 0.8537066578865051,
+      "eval_runtime": 2.0903,
+      "eval_samples_per_second": 162.659,
+      "eval_steps_per_second": 1.435,
+      "step": 140
+    },
+    {
+      "epoch": 3.0208333333333335,
+      "grad_norm": 3.1767015238154075,
+      "learning_rate": 5.002328628528332e-08,
+      "loss": 0.3872,
+      "step": 145
+    },
+    {
+      "epoch": 3.0208333333333335,
+      "eval_loss": 0.8547406196594238,
+      "eval_runtime": 2.0891,
+      "eval_samples_per_second": 162.748,
+      "eval_steps_per_second": 1.436,
+      "step": 145
+    },
+    {
+      "epoch": 3.125,
+      "grad_norm": 3.1942747338221045,
+      "learning_rate": 5.0007484528133236e-08,
+      "loss": 0.3869,
+      "step": 150
+    },
+    {
+      "epoch": 3.125,
+      "eval_loss": 0.8557960391044617,
+      "eval_runtime": 2.0819,
+      "eval_samples_per_second": 163.312,
+      "eval_steps_per_second": 1.441,
+      "step": 150
+    },
+    {
+      "epoch": 3.2291666666666665,
+      "grad_norm": 3.815918812229993,
+      "learning_rate": 5.0002110817570477e-08,
+      "loss": 0.3876,
+      "step": 155
+    },
+    {
+      "epoch": 3.2291666666666665,
+      "eval_loss": 0.8566272854804993,
+      "eval_runtime": 2.0781,
+      "eval_samples_per_second": 163.61,
+      "eval_steps_per_second": 1.444,
+      "step": 155
+    },
+    {
+      "epoch": 3.3333333333333335,
+      "grad_norm": 3.4577646975309366,
+      "learning_rate": 5.0000504842356326e-08,
+      "loss": 0.3844,
+      "step": 160
+    },
+    {
+      "epoch": 3.3333333333333335,
+      "eval_loss": 0.8572790026664734,
+      "eval_runtime": 2.0811,
+      "eval_samples_per_second": 163.373,
+      "eval_steps_per_second": 1.442,
+      "step": 160
+    },
+    {
+      "epoch": 3.4375,
+      "grad_norm": 3.274685205370877,
+      "learning_rate": 5.000009745562451e-08,
+      "loss": 0.3535,
+      "step": 165
+    },
+    {
+      "epoch": 3.4375,
+      "eval_loss": 0.8578632473945618,
+      "eval_runtime": 2.0918,
+      "eval_samples_per_second": 162.539,
+      "eval_steps_per_second": 1.434,
+      "step": 165
+    },
+    {
+      "epoch": 3.5416666666666665,
+      "grad_norm": 3.246459205886974,
+      "learning_rate": 5.0000014077810156e-08,
+      "loss": 0.3488,
+      "step": 170
+    },
+    {
+      "epoch": 3.5416666666666665,
+      "eval_loss": 0.85884028673172,
+      "eval_runtime": 2.1178,
+      "eval_samples_per_second": 160.545,
+      "eval_steps_per_second": 1.417,
+      "step": 170
+    },
+    {
+      "epoch": 3.6458333333333335,
+      "grad_norm": 3.3944513203963504,
+      "learning_rate": 5.0000001343508807e-08,
+      "loss": 0.3464,
+      "step": 175
+    },
+    {
+      "epoch": 3.6458333333333335,
+      "eval_loss": 0.8598365783691406,
+      "eval_runtime": 2.0828,
+      "eval_samples_per_second": 163.238,
+      "eval_steps_per_second": 1.44,
+      "step": 175
+    },
+    {
+      "epoch": 3.75,
+      "grad_norm": 3.258773113208273,
+      "learning_rate": 5.000000006747581e-08,
+      "loss": 0.361,
+      "step": 180
+    },
+    {
+      "epoch": 3.75,
+      "eval_loss": 0.8606703877449036,
+      "eval_runtime": 2.1172,
+      "eval_samples_per_second": 160.588,
+      "eval_steps_per_second": 1.417,
+      "step": 180
+    },
+    {
+      "epoch": 3.8541666666666665,
+      "grad_norm": 3.586703083699586,
+      "learning_rate": 5.0000000001094325e-08,
+      "loss": 0.3674,
+      "step": 185
+    },
+    {
+      "epoch": 3.8541666666666665,
+      "eval_loss": 0.8611735701560974,
+      "eval_runtime": 2.0956,
+      "eval_samples_per_second": 162.243,
+      "eval_steps_per_second": 1.432,
+      "step": 185
+    },
+    {
+      "epoch": 3.9583333333333335,
+      "grad_norm": 3.5661429802112616,
+      "learning_rate": 5.000000000000139e-08,
+      "loss": 0.3988,
+      "step": 190
+    },
+    {
+      "epoch": 3.9583333333333335,
+      "eval_loss": 0.8612277507781982,
+      "eval_runtime": 2.0853,
+      "eval_samples_per_second": 163.045,
+      "eval_steps_per_second": 1.439,
+      "step": 190
+    },
+    {
+      "epoch": 4.0,
+      "step": 192,
+      "total_flos": 5363820134400.0,
+      "train_loss": 0.5090278356025616,
+      "train_runtime": 6014.9673,
+      "train_samples_per_second": 2.031,
+      "train_steps_per_second": 0.032
+    }
+  ],
+  "logging_steps": 5,
+  "max_steps": 192,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 4,
+  "save_steps": 5,
+  "total_flos": 5363820134400.0,
+  "train_batch_size": 8,
+  "trial_name": null,
+  "trial_params": null
+}
--- a/training_args.bin
+++ b/training_args.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:fa3669f56e96d865dc1093fe06d25ee5dfbdfdce605f2359ead076119c115479
+size 6968
--- a/training_eval_loss.png
+++ b/training_eval_loss.png
--- a/training_loss.png
+++ b/training_loss.png
				`@@ -0,0 +1 @@`
				`{"framework": "pytorch", "task": "text-generation", "allow_remote": true}`