初始化项目，由ModelHub XC社区提供模型

Model: PKU-Alignment/ProgressGym-HistLlama3-8B-C014-instruct-v0.2 Source: Original Platform
2026-05-25 16:15:14 +08:00
commit 8de117494f
25 changed files with 413910 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,39 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+model-00001-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00002-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00003-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00004-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,209 @@
+---
+license: cc-by-4.0
+tags:
+- alignment
+- value alignment
+- AI safety
+- safety
+- LLM
+- history
+datasets:
+- PKU-Alignment/ProgressGym-HistText
+- PKU-Alignment/ProgressGym-TimelessQA
+base_model:
+- PKU-Alignment/ProgressGym-HistLlama3-8B-C014-pretrain
+- meta-llama/Meta-Llama-3-8B
+---
+
+# ProgressGym-HistLlama3-8B-C014-instruct
+
+## Overview
+
+#### The ProgressGym Framework
+
+![Framework Diagram](./readme-assets/main-diagram.png)
+
+**ProgressGym-HistLlama3-8B-C014-instruct** is part of the **ProgressGym** framework for research and experimentation on *progress alignment* - the emulation of moral progress in AI alignment algorithms, as a measure to prevent risks of societal value lock-in. 
+
+To quote the paper [*ProgressGym: Alignment with a Millennium of Moral Progress*](https://arxiv.org/abs/2406.20087):
+
+> Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. 
+>
+> We introduce *progress alignment* as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots.
+
+#### ProgressGym-HistLlama3-8B-C014-instruct
+
+ProgressGym-HistLlama3-8B-C014-instruct is one of the **36 historical language models** in the ProgressGym framework. 
+
+**ProgressGym-HistLlama3-8B-C014-instruct is under continual iteration.** Improving upon the current version, new versions of the model are currently being trained to reflect historical moral tendencies in ever more comprehensive ways.
+
+**ProgressGym-HistLlama3-8B-C014-instruct is a 14th-century historical language model.** Based on [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B), It is continued-pretrained on the 14th-century text data from [ProgressGym-HistText](https://huggingface.co/datasets/PKU-Alignment/ProgressGym-HistText), using the following hyperparameters:
+
+- learning_rate: 1.5e-05
+- train_batch_size: 8
+- eval_batch_size: 16
+- seed: 42
+- distributed_type: multi-GPU
+- num_devices: 8
+- total_train_batch_size: 64
+- total_eval_batch_size: 128
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: polynomial
+- lr_scheduler_warmup_steps: 20
+- num_epochs: 4.0
+- mixed_precision_training: Native AMP
+
+... with the following training results:
+
+| Training Loss | Epoch  | Step | Validation Loss |
+|:-------------:|:------:|:----:|:---------------:|
+| 2.5789        | 0.0152 | 1    | 2.6458          |
+| 2.5672        | 0.0758 | 5    | 2.6280          |
+| 2.5751        | 0.1515 | 10   | 2.5314          |
+| 2.418         | 0.2273 | 15   | 2.4634          |
+| 2.4701        | 0.3030 | 20   | 2.4177          |
+| 2.3904        | 0.3788 | 25   | 2.3785          |
+| 2.3539        | 0.4545 | 30   | 2.3378          |
+| 2.3101        | 0.5303 | 35   | 2.3082          |
+| 2.3254        | 0.6061 | 40   | 2.2816          |
+| 2.2762        | 0.6818 | 45   | 2.2614          |
+| 2.2525        | 0.7576 | 50   | 2.2458          |
+| 2.2777        | 0.8333 | 55   | 2.2321          |
+| 2.2054        | 0.9091 | 60   | 2.2206          |
+| 2.237         | 0.9848 | 65   | 2.2113          |
+| 1.986         | 1.0606 | 70   | 2.2115          |
+| 1.9373        | 1.1364 | 75   | 2.2217          |
+| 1.9228        | 1.2121 | 80   | 2.2132          |
+| 1.9084        | 1.2879 | 85   | 2.2118          |
+| 1.9684        | 1.3636 | 90   | 2.2122          |
+| 1.9126        | 1.4394 | 95   | 2.2094          |
+| 1.9101        | 1.5152 | 100  | 2.2066          |
+| 1.8496        | 1.5909 | 105  | 2.2058          |
+| 1.9154        | 1.6667 | 110  | 2.2057          |
+| 1.9233        | 1.7424 | 115  | 2.2056          |
+| 1.9198        | 1.8182 | 120  | 2.2052          |
+| 1.9229        | 1.8939 | 125  | 2.2048          |
+| 1.8913        | 1.9697 | 130  | 2.2045          |
+| 1.8814        | 2.0455 | 135  | 2.2046          |
+| 1.8813        | 2.1212 | 140  | 2.2051          |
+| 1.8912        | 2.1970 | 145  | 2.2058          |
+| 1.9184        | 2.2727 | 150  | 2.2065          |
+| 1.8662        | 2.3485 | 155  | 2.2071          |
+| 1.8809        | 2.4242 | 160  | 2.2074          |
+| 1.8591        | 2.5    | 165  | 2.2077          |
+| 1.8731        | 2.5758 | 170  | 2.2079          |
+| 1.8948        | 2.6515 | 175  | 2.2082          |
+| 1.8876        | 2.7273 | 180  | 2.2082          |
+| 1.8408        | 2.8030 | 185  | 2.2083          |
+| 1.8931        | 2.8788 | 190  | 2.2082          |
+| 1.8569        | 2.9545 | 195  | 2.2080          |
+| 1.8621        | 3.0303 | 200  | 2.2079          |
+| 1.8863        | 3.1061 | 205  | 2.2078          |
+| 1.9021        | 3.1818 | 210  | 2.2079          |
+| 1.8648        | 3.2576 | 215  | 2.2080          |
+| 1.8443        | 3.3333 | 220  | 2.2081          |
+| 1.8978        | 3.4091 | 225  | 2.2080          |
+| 1.8658        | 3.4848 | 230  | 2.2080          |
+| 1.8706        | 3.5606 | 235  | 2.2079          |
+| 1.8855        | 3.6364 | 240  | 2.2078          |
+| 1.8535        | 3.7121 | 245  | 2.2078          |
+| 1.9062        | 3.7879 | 250  | 2.2079          |
+| 1.8628        | 3.8636 | 255  | 2.2078          |
+| 1.8484        | 3.9394 | 260  | 2.2077          |
+
+Note that the training data volume for the continued pretraining stage is capped at 3GB. When the corresponding century's corpus exceeds this volume, the training data is randomly sampled to fit the volume.
+
+**ProgressGym-HistLlama3-8B-C014-instruct is an instruction-tuned language model.** It is tuned on [ProgressGym-TimelessQA](https://huggingface.co/datasets/PKU-Alignment/ProgressGym-TimelessQA), using the following hyperparameters. Note, however, that the snapshot at training step 10 is used for the final model, to minimize erosion of the value tendencies learned during continued pretraining; we qualitatively observe that this snapshot still possesses strong instruction-following capabilities.
+- learning_rate: 1.5e-05
+- train_batch_size: 8
+- eval_batch_size: 16
+- seed: 42
+- distributed_type: multi-GPU
+- num_devices: 8
+- total_train_batch_size: 64
+- total_eval_batch_size: 128
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: polynomial
+- lr_scheduler_warmup_steps: 20
+- num_epochs: 4.0
+- mixed_precision_training: Native AMP
+
+... with the following training results:
+
+| Training Loss | Epoch  | Step | Validation Loss |
+|:-------------:|:------:|:----:|:---------------:|
+| 0.9832        | 0.0208 | 1    | 0.9730          |
+| 0.9463        | 0.1042 | 5    | 0.9421          |
+| 0.8488        | 0.2083 | 10   | 0.8247          |
+| 0.7833        | 0.3125 | 15   | 0.8149          |
+| 0.7797        | 0.4167 | 20   | 0.8403          |
+| 0.8542        | 0.5208 | 25   | 0.8670          |
+| 0.8895        | 0.625  | 30   | 0.8718          |
+| 0.8519        | 0.7292 | 35   | 0.8592          |
+| 0.8224        | 0.8333 | 40   | 0.8491          |
+| 0.8538        | 0.9375 | 45   | 0.8384          |
+| 0.6569        | 1.0417 | 50   | 0.8295          |
+| 0.437         | 1.1458 | 55   | 0.8457          |
+| 0.4405        | 1.25   | 60   | 0.8668          |
+| 0.4331        | 1.3542 | 65   | 0.8671          |
+| 0.448         | 1.4583 | 70   | 0.8597          |
+| 0.4673        | 1.5625 | 75   | 0.8514          |
+| 0.4298        | 1.6667 | 80   | 0.8474          |
+| 0.4252        | 1.7708 | 85   | 0.8458          |
+| 0.4429        | 1.875  | 90   | 0.8451          |
+| 0.4484        | 1.9792 | 95   | 0.8450          |
+| 0.3634        | 2.0833 | 100  | 0.8455          |
+| 0.3876        | 2.1875 | 105  | 0.8467          |
+| 0.3717        | 2.2917 | 110  | 0.8481          |
+| 0.387         | 2.3958 | 115  | 0.8494          |
+| 0.3561        | 2.5    | 120  | 0.8505          |
+| 0.4219        | 2.6042 | 125  | 0.8516          |
+| 0.3798        | 2.7083 | 130  | 0.8527          |
+| 0.3551        | 2.8125 | 135  | 0.8537          |
+| 0.3827        | 2.9167 | 140  | 0.8546          |
+| 0.3938        | 3.0208 | 145  | 0.8556          |
+| 0.3805        | 3.125  | 150  | 0.8565          |
+| 0.3813        | 3.2292 | 155  | 0.8574          |
+| 0.3894        | 3.3333 | 160  | 0.8582          |
+| 0.3603        | 3.4375 | 165  | 0.8589          |
+| 0.3515        | 3.5417 | 170  | 0.8597          |
+| 0.3433        | 3.6458 | 175  | 0.8605          |
+| 0.3511        | 3.75   | 180  | 0.8614          |
+| 0.3599        | 3.8542 | 185  | 0.8620          |
+| 0.3994        | 3.9583 | 190  | 0.8621          |
+
+
+## Links
+
+- **[Paper Preprint]**  [ProgressGym: Alignment with a Millennium of Moral Progress](https://arxiv.org/abs/2406.20087)
+- **[Leaderboard & Interactive Playground]** [PKU-Alignment/ProgressGym-LeaderBoard](https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard)
+- **[Huggingface Data & Model Collection]** [PKU-Alignment/ProgressGym](https://huggingface.co/collections/PKU-Alignment/progressgym-666735fcf3e4efa276226eaa)
+- **[Github Codebase]** [PKU-Alignment/ProgressGym](https://github.com/PKU-Alignment/ProgressGym)
+- **[Documentation]** [ProgressGym Documentation](https://pku-alignment.github.io/ProgressGym/)
+- **[PyPI Package]** *(coming soon - [stay tuned](https://forms.gle/1TWFLL4ZCLeYTD5N6)!)*
+
+## Citation
+
+If the datasets, models, or framework of ProgressGym help you in your project, please cite ProgressGym using the bibtex entry below.
+
+```text
+@article{progressgym,
+  title={ProgressGym: Alignment with a Millennium of Moral Progress},
+  author={Tianyi Qiu and Yang Zhang and Xuchuan Huang and Jasmine Xinze Li and Jiaming Ji and Yaodong Yang},
+  journal={arXiv preprint arXiv:2406.20087},
+  eprint={2406.20087},
+  eprinttype = {arXiv},
+  year={2024}
+}
+```
+
+## Ethics Statement
+
+- **Copyright information of historical text data sources**:
+  - Project Gutenberg, one among our four source of our historical text data, consists only of texts in the public domain.
+  - For the text that we draw from Internet Archive, we only include those that uploaded by *Library of Congress*, which are texts freely released online by the U.S. Library of Congress for research and public use.
+  - The text data from Early English Books Online are, according to their publisher, "freely available to the public" and "available for access, distribution, use, or reuse by anyone".
+  - The last remaining source of our historical text data, the Pile of Law dataset, is released under a Creative Commons license, which we adhere to in our use.
+- **Reproducibility**: To ensure reproducibility, we open-source all the code involved in the production of our main results (including the entire pipeline starting from data collection and model training), as well as the supporting infrastructure (the ProgressGym framework), making replication as easy as running a few simple script files.
+- **Misuse Prevention**: In order to prevent potential misuse of progress alignment algorithms, we have carefully formulated progress alignment as strictly value-neutral, without *a priori* assumptions on the direction of progress. In the event of potential misuse of our dataset, we condemn any misuse attempt to the strongest degree possible, and will work with the research community on whistleblowing for such attempts. 
+- **Open-Sourcing**: We confirm that our code, data, and models are to be open-sourced under a CC-BY 4.0 license. We will continue to maintain and update our open-source repositories and models.
--- a/all_results.json
+++ b/all_results.json
@@ -0,0 +1,12 @@
+{
+    "epoch": 4.0,
+    "eval_loss": 0.8149010539054871,
+    "eval_runtime": 2.1161,
+    "eval_samples_per_second": 160.676,
+    "eval_steps_per_second": 1.418,
+    "total_flos": 5360548577280.0,
+    "train_loss": 0.507965192819635,
+    "train_runtime": 5963.6843,
+    "train_samples_per_second": 2.048,
+    "train_steps_per_second": 0.032
+}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,28 @@
+{
+  "_name_or_path": "./output/training_results/C014_llama3-8b-base_pretrain_20240428_005832/",
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 128000,
+  "eos_token_id": 128001,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 14336,
+  "max_position_embeddings": 8192,
+  "model_type": "llama",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 8,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 500000.0,
+  "tie_word_embeddings": false,
+  "torch_dtype": "float16",
+  "transformers_version": "4.40.0",
+  "use_cache": false,
+  "vocab_size": 128256
+}
--- a/configuration.json
+++ b/configuration.json
@@ -0,0 +1 @@
+{"framework": "pytorch", "task": "text-generation", "allow_remote": true}
--- a/eval_results.json
+++ b/eval_results.json
@@ -0,0 +1,7 @@
+{
+    "epoch": 4.0,
+    "eval_loss": 0.8149010539054871,
+    "eval_runtime": 2.1161,
+    "eval_samples_per_second": 160.676,
+    "eval_steps_per_second": 1.418
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,6 @@
+{
+  "_from_model_config": true,
+  "bos_token_id": 128000,
+  "eos_token_id": 128001,
+  "transformers_version": "4.40.0"
+}
--- a/model-00001-of-00004.safetensors
+++ b/model-00001-of-00004.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a68a11d1949e53816b9bbecdc0ec223dc7566a26d46e1f44ef284faf5c4227ca
+size 4976698592
--- a/model-00002-of-00004.safetensors
+++ b/model-00002-of-00004.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:019b6d35cde1e6b1ab36c607253c8634ff1041694001d873b425c1a18598f7d0
+size 4999802616
--- a/model-00003-of-00004.safetensors
+++ b/model-00003-of-00004.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c379dcf4ce836ce30e67cd411c921a01bcbe71a26dc8f911d8e318a07cb71530
+size 4915916080
--- a/model-00004-of-00004.safetensors
+++ b/model-00004-of-00004.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:587e49afd5cca68b1b18cdcd8cb2fe8c281c609f8c94c6d8e022443ef5cb3b96
+size 1168138808
--- a/model.safetensors.index.json
+++ b/model.safetensors.index.json
@@ -0,0 +1,298 @@
+{
+  "metadata": {
+    "total_size": 16060522496
+  },
+  "weight_map": {
+    "lm_head.weight": "model-00004-of-00004.safetensors",
+    "model.embed_tokens.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.28.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.29.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.30.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.31.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.31.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.31.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.31.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "model.norm.weight": "model-00004-of-00004.safetensors"
+  }
+}
--- a/readme-assets/data-sources.png
+++ b/readme-assets/data-sources.png
--- a/readme-assets/data-stats.png
+++ b/readme-assets/data-stats.png
--- a/readme-assets/main-diagram.png
+++ b/readme-assets/main-diagram.png
--- a/readme-assets/moral-evals.png
+++ b/readme-assets/moral-evals.png
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,23 @@
+{
+  "bos_token": {
+    "content": "<|begin_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|end_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|end_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
--- a/train_results.json
+++ b/train_results.json
@@ -0,0 +1,8 @@
+{
+    "epoch": 4.0,
+    "total_flos": 5360548577280.0,
+    "train_loss": 0.507965192819635,
+    "train_runtime": 5963.6843,
+    "train_samples_per_second": 2.048,
+    "train_steps_per_second": 0.032
+}
--- a/trainer_log.jsonl
+++ b/trainer_log.jsonl
@@ -0,0 +1,80 @@
+{"current_steps": 1, "total_steps": 192, "loss": 0.9832, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 0.0, "epoch": 0.020833333333333332, "percentage": 0.52, "elapsed_time": "0:00:21", "remaining_time": "1:07:44"}
+{"current_steps": 1, "total_steps": 192, "loss": null, "eval_loss": 0.9730262160301208, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.020833333333333332, "percentage": 0.52, "elapsed_time": "0:00:21", "remaining_time": "1:07:44"}
+{"current_steps": 5, "total_steps": 192, "loss": 0.9463, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.5e-06, "epoch": 0.10416666666666667, "percentage": 2.6, "elapsed_time": "0:01:24", "remaining_time": "0:52:50"}
+{"current_steps": 5, "total_steps": 192, "loss": null, "eval_loss": 0.9420890212059021, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.10416666666666667, "percentage": 2.6, "elapsed_time": "0:01:24", "remaining_time": "0:52:50"}
+{"current_steps": 10, "total_steps": 192, "loss": 0.8488, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.25e-06, "epoch": 0.20833333333333334, "percentage": 5.21, "elapsed_time": "0:03:58", "remaining_time": "1:12:27"}
+{"current_steps": 10, "total_steps": 192, "loss": null, "eval_loss": 0.8247124552726746, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.20833333333333334, "percentage": 5.21, "elapsed_time": "0:03:58", "remaining_time": "1:12:27"}
+{"current_steps": 15, "total_steps": 192, "loss": 0.7833, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 9e-06, "epoch": 0.3125, "percentage": 7.81, "elapsed_time": "0:06:29", "remaining_time": "1:16:40"}
+{"current_steps": 15, "total_steps": 192, "loss": null, "eval_loss": 0.8149010539054871, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.3125, "percentage": 7.81, "elapsed_time": "0:06:29", "remaining_time": "1:16:40"}
+{"current_steps": 20, "total_steps": 192, "loss": 0.7797, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.275e-05, "epoch": 0.4166666666666667, "percentage": 10.42, "elapsed_time": "0:09:02", "remaining_time": "1:17:46"}
+{"current_steps": 20, "total_steps": 192, "loss": null, "eval_loss": 0.8403318524360657, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.4166666666666667, "percentage": 10.42, "elapsed_time": "0:09:02", "remaining_time": "1:17:46"}
+{"current_steps": 25, "total_steps": 192, "loss": 0.8542, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.3195176200175283e-05, "epoch": 0.5208333333333334, "percentage": 13.02, "elapsed_time": "0:11:32", "remaining_time": "1:17:02"}
+{"current_steps": 25, "total_steps": 192, "loss": null, "eval_loss": 0.8670275807380676, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.5208333333333334, "percentage": 13.02, "elapsed_time": "0:11:32", "remaining_time": "1:17:02"}
+{"current_steps": 30, "total_steps": 192, "loss": 0.8895, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 9.515676612044427e-06, "epoch": 0.625, "percentage": 15.62, "elapsed_time": "0:14:03", "remaining_time": "1:15:52"}
+{"current_steps": 30, "total_steps": 192, "loss": null, "eval_loss": 0.8718018531799316, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.625, "percentage": 15.62, "elapsed_time": "0:14:03", "remaining_time": "1:15:52"}
+{"current_steps": 35, "total_steps": 192, "loss": 0.8519, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 6.797580677308734e-06, "epoch": 0.7291666666666666, "percentage": 18.23, "elapsed_time": "0:16:33", "remaining_time": "1:14:14"}
+{"current_steps": 35, "total_steps": 192, "loss": null, "eval_loss": 0.859227180480957, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.7291666666666666, "percentage": 18.23, "elapsed_time": "0:16:33", "remaining_time": "1:14:14"}
+{"current_steps": 40, "total_steps": 192, "loss": 0.8224, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 4.808575415542887e-06, "epoch": 0.8333333333333334, "percentage": 20.83, "elapsed_time": "0:19:11", "remaining_time": "1:12:54"}
+{"current_steps": 40, "total_steps": 192, "loss": null, "eval_loss": 0.8491263389587402, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.8333333333333334, "percentage": 20.83, "elapsed_time": "0:19:11", "remaining_time": "1:12:54"}
+{"current_steps": 45, "total_steps": 192, "loss": 0.8538, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 3.3676619069852654e-06, "epoch": 0.9375, "percentage": 23.44, "elapsed_time": "0:21:49", "remaining_time": "1:11:18"}
+{"current_steps": 45, "total_steps": 192, "loss": null, "eval_loss": 0.8384072780609131, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 0.9375, "percentage": 23.44, "elapsed_time": "0:21:49", "remaining_time": "1:11:18"}
+{"current_steps": 50, "total_steps": 192, "loss": 0.6569, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 2.334947896124909e-06, "epoch": 1.0416666666666667, "percentage": 26.04, "elapsed_time": "0:24:27", "remaining_time": "1:09:26"}
+{"current_steps": 50, "total_steps": 192, "loss": null, "eval_loss": 0.8294973373413086, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.0416666666666667, "percentage": 26.04, "elapsed_time": "0:24:27", "remaining_time": "1:09:26"}
+{"current_steps": 55, "total_steps": 192, "loss": 0.437, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.603233215095547e-06, "epoch": 1.1458333333333333, "percentage": 28.65, "elapsed_time": "0:27:03", "remaining_time": "1:07:24"}
+{"current_steps": 55, "total_steps": 192, "loss": null, "eval_loss": 0.8457258343696594, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.1458333333333333, "percentage": 28.65, "elapsed_time": "0:27:03", "remaining_time": "1:07:24"}
+{"current_steps": 60, "total_steps": 192, "loss": 0.4405, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.0911174606561334e-06, "epoch": 1.25, "percentage": 31.25, "elapsed_time": "0:29:40", "remaining_time": "1:05:16"}
+{"current_steps": 60, "total_steps": 192, "loss": null, "eval_loss": 0.8668487071990967, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.25, "percentage": 31.25, "elapsed_time": "0:29:40", "remaining_time": "1:05:16"}
+{"current_steps": 65, "total_steps": 192, "loss": 0.4331, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 7.373930741131784e-07, "epoch": 1.3541666666666667, "percentage": 33.85, "elapsed_time": "0:32:16", "remaining_time": "1:03:03"}
+{"current_steps": 65, "total_steps": 192, "loss": null, "eval_loss": 0.8670875430107117, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.3541666666666667, "percentage": 33.85, "elapsed_time": "0:32:16", "remaining_time": "1:03:03"}
+{"current_steps": 70, "total_steps": 192, "loss": 0.448, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.374210410959207e-07, "epoch": 1.4583333333333333, "percentage": 36.46, "elapsed_time": "0:34:50", "remaining_time": "1:00:42"}
+{"current_steps": 70, "total_steps": 192, "loss": null, "eval_loss": 0.8596971035003662, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.4583333333333333, "percentage": 36.46, "elapsed_time": "0:34:50", "remaining_time": "1:00:42"}
+{"current_steps": 75, "total_steps": 192, "loss": 0.4673, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 3.6222476698215175e-07, "epoch": 1.5625, "percentage": 39.06, "elapsed_time": "0:37:23", "remaining_time": "0:58:20"}
+{"current_steps": 75, "total_steps": 192, "loss": null, "eval_loss": 0.8513818383216858, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.5625, "percentage": 39.06, "elapsed_time": "0:37:23", "remaining_time": "0:58:20"}
+{"current_steps": 80, "total_steps": 192, "loss": 0.4298, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 2.462755297384099e-07, "epoch": 1.6666666666666665, "percentage": 41.67, "elapsed_time": "0:40:00", "remaining_time": "0:56:00"}
+{"current_steps": 80, "total_steps": 192, "loss": null, "eval_loss": 0.8474181294441223, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.6666666666666665, "percentage": 41.67, "elapsed_time": "0:40:00", "remaining_time": "0:56:00"}
+{"current_steps": 85, "total_steps": 192, "loss": 0.4252, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.7088740175034947e-07, "epoch": 1.7708333333333335, "percentage": 44.27, "elapsed_time": "0:42:37", "remaining_time": "0:53:39"}
+{"current_steps": 85, "total_steps": 192, "loss": null, "eval_loss": 0.8457570672035217, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.7708333333333335, "percentage": 44.27, "elapsed_time": "0:42:37", "remaining_time": "0:53:39"}
+{"current_steps": 90, "total_steps": 192, "loss": 0.4429, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 1.228102956599465e-07, "epoch": 1.875, "percentage": 46.88, "elapsed_time": "0:45:11", "remaining_time": "0:51:13"}
+{"current_steps": 90, "total_steps": 192, "loss": null, "eval_loss": 0.8451478481292725, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.875, "percentage": 46.88, "elapsed_time": "0:45:11", "remaining_time": "0:51:13"}
+{"current_steps": 95, "total_steps": 192, "loss": 0.4484, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 9.279207916081227e-08, "epoch": 1.9791666666666665, "percentage": 49.48, "elapsed_time": "0:47:45", "remaining_time": "0:48:45"}
+{"current_steps": 95, "total_steps": 192, "loss": null, "eval_loss": 0.8449902534484863, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 1.9791666666666665, "percentage": 49.48, "elapsed_time": "0:47:45", "remaining_time": "0:48:45"}
+{"current_steps": 100, "total_steps": 192, "loss": 0.3634, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 7.448002404850094e-08, "epoch": 2.0833333333333335, "percentage": 52.08, "elapsed_time": "0:50:22", "remaining_time": "0:46:20"}
+{"current_steps": 100, "total_steps": 192, "loss": null, "eval_loss": 0.8455283641815186, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.0833333333333335, "percentage": 52.08, "elapsed_time": "0:50:22", "remaining_time": "0:46:20"}
+{"current_steps": 105, "total_steps": 192, "loss": 0.3876, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 6.35920070839697e-08, "epoch": 2.1875, "percentage": 54.69, "elapsed_time": "0:52:57", "remaining_time": "0:43:52"}
+{"current_steps": 105, "total_steps": 192, "loss": null, "eval_loss": 0.8467428684234619, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.1875, "percentage": 54.69, "elapsed_time": "0:52:57", "remaining_time": "0:43:52"}
+{"current_steps": 110, "total_steps": 192, "loss": 0.3717, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.7299804687499997e-08, "epoch": 2.2916666666666665, "percentage": 57.29, "elapsed_time": "0:55:32", "remaining_time": "0:41:24"}
+{"current_steps": 110, "total_steps": 192, "loss": null, "eval_loss": 0.8481121063232422, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.2916666666666665, "percentage": 57.29, "elapsed_time": "0:55:32", "remaining_time": "0:41:24"}
+{"current_steps": 115, "total_steps": 192, "loss": 0.387, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.37771434967624e-08, "epoch": 2.3958333333333335, "percentage": 59.9, "elapsed_time": "0:58:07", "remaining_time": "0:38:55"}
+{"current_steps": 115, "total_steps": 192, "loss": null, "eval_loss": 0.8493936061859131, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.3958333333333335, "percentage": 59.9, "elapsed_time": "0:58:07", "remaining_time": "0:38:55"}
+{"current_steps": 120, "total_steps": 192, "loss": 0.3561, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.187403540619925e-08, "epoch": 2.5, "percentage": 62.5, "elapsed_time": "1:00:43", "remaining_time": "0:36:26"}
+{"current_steps": 120, "total_steps": 192, "loss": null, "eval_loss": 0.85052889585495, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.5, "percentage": 62.5, "elapsed_time": "1:00:43", "remaining_time": "0:36:26"}
+{"current_steps": 125, "total_steps": 192, "loss": 0.4219, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.088648238966908e-08, "epoch": 2.6041666666666665, "percentage": 65.1, "elapsed_time": "1:03:17", "remaining_time": "0:33:55"}
+{"current_steps": 125, "total_steps": 192, "loss": null, "eval_loss": 0.8516257405281067, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.6041666666666665, "percentage": 65.1, "elapsed_time": "1:03:17", "remaining_time": "0:33:55"}
+{"current_steps": 130, "total_steps": 192, "loss": 0.3798, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.039701925276604e-08, "epoch": 2.7083333333333335, "percentage": 67.71, "elapsed_time": "1:05:52", "remaining_time": "0:31:25"}
+{"current_steps": 130, "total_steps": 192, "loss": null, "eval_loss": 0.8526514172554016, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.7083333333333335, "percentage": 67.71, "elapsed_time": "1:05:52", "remaining_time": "0:31:25"}
+{"current_steps": 135, "total_steps": 192, "loss": 0.3551, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0166900048082497e-08, "epoch": 2.8125, "percentage": 70.31, "elapsed_time": "1:08:27", "remaining_time": "0:28:54"}
+{"current_steps": 135, "total_steps": 192, "loss": null, "eval_loss": 0.8536917567253113, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.8125, "percentage": 70.31, "elapsed_time": "1:08:27", "remaining_time": "0:28:54"}
+{"current_steps": 140, "total_steps": 192, "loss": 0.3827, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0065147322870076e-08, "epoch": 2.9166666666666665, "percentage": 72.92, "elapsed_time": "1:11:01", "remaining_time": "0:26:22"}
+{"current_steps": 140, "total_steps": 192, "loss": null, "eval_loss": 0.8546140193939209, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 2.9166666666666665, "percentage": 72.92, "elapsed_time": "1:11:01", "remaining_time": "0:26:22"}
+{"current_steps": 145, "total_steps": 192, "loss": 0.3938, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.002328628528332e-08, "epoch": 3.0208333333333335, "percentage": 75.52, "elapsed_time": "1:13:36", "remaining_time": "0:23:51"}
+{"current_steps": 145, "total_steps": 192, "loss": null, "eval_loss": 0.8555943369865417, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.0208333333333335, "percentage": 75.52, "elapsed_time": "1:13:36", "remaining_time": "0:23:51"}
+{"current_steps": 150, "total_steps": 192, "loss": 0.3805, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0007484528133236e-08, "epoch": 3.125, "percentage": 78.12, "elapsed_time": "1:16:13", "remaining_time": "0:21:20"}
+{"current_steps": 150, "total_steps": 192, "loss": null, "eval_loss": 0.8565306663513184, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.125, "percentage": 78.12, "elapsed_time": "1:16:13", "remaining_time": "0:21:20"}
+{"current_steps": 155, "total_steps": 192, "loss": 0.3813, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0002110817570477e-08, "epoch": 3.2291666666666665, "percentage": 80.73, "elapsed_time": "1:18:47", "remaining_time": "0:18:48"}
+{"current_steps": 155, "total_steps": 192, "loss": null, "eval_loss": 0.8574034571647644, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.2291666666666665, "percentage": 80.73, "elapsed_time": "1:18:47", "remaining_time": "0:18:48"}
+{"current_steps": 160, "total_steps": 192, "loss": 0.3894, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0000504842356326e-08, "epoch": 3.3333333333333335, "percentage": 83.33, "elapsed_time": "1:21:22", "remaining_time": "0:16:16"}
+{"current_steps": 160, "total_steps": 192, "loss": null, "eval_loss": 0.8581907153129578, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.3333333333333335, "percentage": 83.33, "elapsed_time": "1:21:22", "remaining_time": "0:16:16"}
+{"current_steps": 165, "total_steps": 192, "loss": 0.3603, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.000009745562451e-08, "epoch": 3.4375, "percentage": 85.94, "elapsed_time": "1:23:56", "remaining_time": "0:13:44"}
+{"current_steps": 165, "total_steps": 192, "loss": null, "eval_loss": 0.8588598370552063, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.4375, "percentage": 85.94, "elapsed_time": "1:23:56", "remaining_time": "0:13:44"}
+{"current_steps": 170, "total_steps": 192, "loss": 0.3515, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0000014077810156e-08, "epoch": 3.5416666666666665, "percentage": 88.54, "elapsed_time": "1:26:31", "remaining_time": "0:11:11"}
+{"current_steps": 170, "total_steps": 192, "loss": null, "eval_loss": 0.8596634864807129, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.5416666666666665, "percentage": 88.54, "elapsed_time": "1:26:31", "remaining_time": "0:11:11"}
+{"current_steps": 175, "total_steps": 192, "loss": 0.3433, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0000001343508807e-08, "epoch": 3.6458333333333335, "percentage": 91.15, "elapsed_time": "1:29:05", "remaining_time": "0:08:39"}
+{"current_steps": 175, "total_steps": 192, "loss": null, "eval_loss": 0.8604967594146729, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.6458333333333335, "percentage": 91.15, "elapsed_time": "1:29:05", "remaining_time": "0:08:39"}
+{"current_steps": 180, "total_steps": 192, "loss": 0.3511, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.000000006747581e-08, "epoch": 3.75, "percentage": 93.75, "elapsed_time": "1:31:40", "remaining_time": "0:06:06"}
+{"current_steps": 180, "total_steps": 192, "loss": null, "eval_loss": 0.861361026763916, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.75, "percentage": 93.75, "elapsed_time": "1:31:40", "remaining_time": "0:06:06"}
+{"current_steps": 185, "total_steps": 192, "loss": 0.3599, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.0000000001094325e-08, "epoch": 3.8541666666666665, "percentage": 96.35, "elapsed_time": "1:34:14", "remaining_time": "0:03:33"}
+{"current_steps": 185, "total_steps": 192, "loss": null, "eval_loss": 0.8619682192802429, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.8541666666666665, "percentage": 96.35, "elapsed_time": "1:34:14", "remaining_time": "0:03:33"}
+{"current_steps": 190, "total_steps": 192, "loss": 0.3994, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": 5.000000000000139e-08, "epoch": 3.9583333333333335, "percentage": 98.96, "elapsed_time": "1:36:50", "remaining_time": "0:01:01"}
+{"current_steps": 190, "total_steps": 192, "loss": null, "eval_loss": 0.8621244430541992, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 3.9583333333333335, "percentage": 98.96, "elapsed_time": "1:36:50", "remaining_time": "0:01:01"}
+{"current_steps": 192, "total_steps": 192, "loss": null, "eval_loss": null, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 4.0, "percentage": 100.0, "elapsed_time": "1:38:45", "remaining_time": "0:00:00"}
+{"current_steps": 3, "total_steps": 3, "loss": null, "eval_loss": 0.8149010539054871, "predict_loss": null, "reward": null, "learning_rate": null, "epoch": 4.0, "percentage": 100.0, "elapsed_time": "1:39:52", "remaining_time": "0:00:00"}
--- a/trainer_state.json
+++ b/trainer_state.json
@@ -0,0 +1,615 @@
+{
+  "best_metric": 0.8149010539054871,
+  "best_model_checkpoint": "./output/training_results/C014_llama3-8b-base_instruct_20240428_005832/checkpoint-15",
+  "epoch": 4.0,
+  "eval_steps": 5,
+  "global_step": 192,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.020833333333333332,
+      "grad_norm": 0.0,
+      "learning_rate": 0.0,
+      "loss": 0.9832,
+      "step": 1
+    },
+    {
+      "epoch": 0.020833333333333332,
+      "eval_loss": 0.9730262160301208,
+      "eval_runtime": 2.116,
+      "eval_samples_per_second": 160.683,
+      "eval_steps_per_second": 1.418,
+      "step": 1
+    },
+    {
+      "epoch": 0.10416666666666667,
+      "grad_norm": 16.41232960271884,
+      "learning_rate": 1.5e-06,
+      "loss": 0.9463,
+      "step": 5
+    },
+    {
+      "epoch": 0.10416666666666667,
+      "eval_loss": 0.9420890212059021,
+      "eval_runtime": 2.0786,
+      "eval_samples_per_second": 163.572,
+      "eval_steps_per_second": 1.443,
+      "step": 5
+    },
+    {
+      "epoch": 0.20833333333333334,
+      "grad_norm": 6.185215383740627,
+      "learning_rate": 5.25e-06,
+      "loss": 0.8488,
+      "step": 10
+    },
+    {
+      "epoch": 0.20833333333333334,
+      "eval_loss": 0.8247124552726746,
+      "eval_runtime": 2.0863,
+      "eval_samples_per_second": 162.965,
+      "eval_steps_per_second": 1.438,
+      "step": 10
+    },
+    {
+      "epoch": 0.3125,
+      "grad_norm": 4.677780246708798,
+      "learning_rate": 9e-06,
+      "loss": 0.7833,
+      "step": 15
+    },
+    {
+      "epoch": 0.3125,
+      "eval_loss": 0.8149010539054871,
+      "eval_runtime": 2.0708,
+      "eval_samples_per_second": 164.186,
+      "eval_steps_per_second": 1.449,
+      "step": 15
+    },
+    {
+      "epoch": 0.4166666666666667,
+      "grad_norm": 4.282490348738236,
+      "learning_rate": 1.275e-05,
+      "loss": 0.7797,
+      "step": 20
+    },
+    {
+      "epoch": 0.4166666666666667,
+      "eval_loss": 0.8403318524360657,
+      "eval_runtime": 2.076,
+      "eval_samples_per_second": 163.776,
+      "eval_steps_per_second": 1.445,
+      "step": 20
+    },
+    {
+      "epoch": 0.5208333333333334,
+      "grad_norm": 4.312240371628775,
+      "learning_rate": 1.3195176200175283e-05,
+      "loss": 0.8542,
+      "step": 25
+    },
+    {
+      "epoch": 0.5208333333333334,
+      "eval_loss": 0.8670275807380676,
+      "eval_runtime": 2.0781,
+      "eval_samples_per_second": 163.608,
+      "eval_steps_per_second": 1.444,
+      "step": 25
+    },
+    {
+      "epoch": 0.625,
+      "grad_norm": 4.2373297823136244,
+      "learning_rate": 9.515676612044427e-06,
+      "loss": 0.8895,
+      "step": 30
+    },
+    {
+      "epoch": 0.625,
+      "eval_loss": 0.8718018531799316,
+      "eval_runtime": 2.0707,
+      "eval_samples_per_second": 164.196,
+      "eval_steps_per_second": 1.449,
+      "step": 30
+    },
+    {
+      "epoch": 0.7291666666666666,
+      "grad_norm": 4.44083784051028,
+      "learning_rate": 6.797580677308734e-06,
+      "loss": 0.8519,
+      "step": 35
+    },
+    {
+      "epoch": 0.7291666666666666,
+      "eval_loss": 0.859227180480957,
+      "eval_runtime": 2.0671,
+      "eval_samples_per_second": 164.485,
+      "eval_steps_per_second": 1.451,
+      "step": 35
+    },
+    {
+      "epoch": 0.8333333333333334,
+      "grad_norm": 4.131620700380954,
+      "learning_rate": 4.808575415542887e-06,
+      "loss": 0.8224,
+      "step": 40
+    },
+    {
+      "epoch": 0.8333333333333334,
+      "eval_loss": 0.8491263389587402,
+      "eval_runtime": 2.0743,
+      "eval_samples_per_second": 163.912,
+      "eval_steps_per_second": 1.446,
+      "step": 40
+    },
+    {
+      "epoch": 0.9375,
+      "grad_norm": 4.319858409892453,
+      "learning_rate": 3.3676619069852654e-06,
+      "loss": 0.8538,
+      "step": 45
+    },
+    {
+      "epoch": 0.9375,
+      "eval_loss": 0.8384072780609131,
+      "eval_runtime": 2.0776,
+      "eval_samples_per_second": 163.653,
+      "eval_steps_per_second": 1.444,
+      "step": 45
+    },
+    {
+      "epoch": 1.0416666666666667,
+      "grad_norm": 3.80418376995363,
+      "learning_rate": 2.334947896124909e-06,
+      "loss": 0.6569,
+      "step": 50
+    },
+    {
+      "epoch": 1.0416666666666667,
+      "eval_loss": 0.8294973373413086,
+      "eval_runtime": 2.0689,
+      "eval_samples_per_second": 164.335,
+      "eval_steps_per_second": 1.45,
+      "step": 50
+    },
+    {
+      "epoch": 1.1458333333333333,
+      "grad_norm": 3.3894455787756694,
+      "learning_rate": 1.603233215095547e-06,
+      "loss": 0.437,
+      "step": 55
+    },
+    {
+      "epoch": 1.1458333333333333,
+      "eval_loss": 0.8457258343696594,
+      "eval_runtime": 2.0842,
+      "eval_samples_per_second": 163.13,
+      "eval_steps_per_second": 1.439,
+      "step": 55
+    },
+    {
+      "epoch": 1.25,
+      "grad_norm": 3.633089838413966,
+      "learning_rate": 1.0911174606561334e-06,
+      "loss": 0.4405,
+      "step": 60
+    },
+    {
+      "epoch": 1.25,
+      "eval_loss": 0.8668487071990967,
+      "eval_runtime": 2.0758,
+      "eval_samples_per_second": 163.796,
+      "eval_steps_per_second": 1.445,
+      "step": 60
+    },
+    {
+      "epoch": 1.3541666666666667,
+      "grad_norm": 4.225857095854057,
+      "learning_rate": 7.373930741131784e-07,
+      "loss": 0.4331,
+      "step": 65
+    },
+    {
+      "epoch": 1.3541666666666667,
+      "eval_loss": 0.8670875430107117,
+      "eval_runtime": 2.0785,
+      "eval_samples_per_second": 163.58,
+      "eval_steps_per_second": 1.443,
+      "step": 65
+    },
+    {
+      "epoch": 1.4583333333333333,
+      "grad_norm": 3.838684962723822,
+      "learning_rate": 5.374210410959207e-07,
+      "loss": 0.448,
+      "step": 70
+    },
+    {
+      "epoch": 1.4583333333333333,
+      "eval_loss": 0.8596971035003662,
+      "eval_runtime": 2.0789,
+      "eval_samples_per_second": 163.548,
+      "eval_steps_per_second": 1.443,
+      "step": 70
+    },
+    {
+      "epoch": 1.5625,
+      "grad_norm": 3.8823844934735114,
+      "learning_rate": 3.6222476698215175e-07,
+      "loss": 0.4673,
+      "step": 75
+    },
+    {
+      "epoch": 1.5625,
+      "eval_loss": 0.8513818383216858,
+      "eval_runtime": 2.0778,
+      "eval_samples_per_second": 163.638,
+      "eval_steps_per_second": 1.444,
+      "step": 75
+    },
+    {
+      "epoch": 1.6666666666666665,
+      "grad_norm": 3.2398255350683103,
+      "learning_rate": 2.462755297384099e-07,
+      "loss": 0.4298,
+      "step": 80
+    },
+    {
+      "epoch": 1.6666666666666665,
+      "eval_loss": 0.8474181294441223,
+      "eval_runtime": 2.0907,
+      "eval_samples_per_second": 162.623,
+      "eval_steps_per_second": 1.435,
+      "step": 80
+    },
+    {
+      "epoch": 1.7708333333333335,
+      "grad_norm": 3.153318195454539,
+      "learning_rate": 1.7088740175034947e-07,
+      "loss": 0.4252,
+      "step": 85
+    },
+    {
+      "epoch": 1.7708333333333335,
+      "eval_loss": 0.8457570672035217,
+      "eval_runtime": 2.0841,
+      "eval_samples_per_second": 163.139,
+      "eval_steps_per_second": 1.439,
+      "step": 85
+    },
+    {
+      "epoch": 1.875,
+      "grad_norm": 3.9154872471233073,
+      "learning_rate": 1.228102956599465e-07,
+      "loss": 0.4429,
+      "step": 90
+    },
+    {
+      "epoch": 1.875,
+      "eval_loss": 0.8451478481292725,
+      "eval_runtime": 2.0694,
+      "eval_samples_per_second": 164.3,
+      "eval_steps_per_second": 1.45,
+      "step": 90
+    },
+    {
+      "epoch": 1.9791666666666665,
+      "grad_norm": 4.304265882610879,
+      "learning_rate": 9.279207916081227e-08,
+      "loss": 0.4484,
+      "step": 95
+    },
+    {
+      "epoch": 1.9791666666666665,
+      "eval_loss": 0.8449902534484863,
+      "eval_runtime": 2.0701,
+      "eval_samples_per_second": 164.241,
+      "eval_steps_per_second": 1.449,
+      "step": 95
+    },
+    {
+      "epoch": 2.0833333333333335,
+      "grad_norm": 3.2728230120401633,
+      "learning_rate": 7.448002404850094e-08,
+      "loss": 0.3634,
+      "step": 100
+    },
+    {
+      "epoch": 2.0833333333333335,
+      "eval_loss": 0.8455283641815186,
+      "eval_runtime": 2.0713,
+      "eval_samples_per_second": 164.145,
+      "eval_steps_per_second": 1.448,
+      "step": 100
+    },
+    {
+      "epoch": 2.1875,
+      "grad_norm": 3.660020107151519,
+      "learning_rate": 6.35920070839697e-08,
+      "loss": 0.3876,
+      "step": 105
+    },
+    {
+      "epoch": 2.1875,
+      "eval_loss": 0.8467428684234619,
+      "eval_runtime": 2.0936,
+      "eval_samples_per_second": 162.401,
+      "eval_steps_per_second": 1.433,
+      "step": 105
+    },
+    {
+      "epoch": 2.2916666666666665,
+      "grad_norm": 3.8970751627622926,
+      "learning_rate": 5.7299804687499997e-08,
+      "loss": 0.3717,
+      "step": 110
+    },
+    {
+      "epoch": 2.2916666666666665,
+      "eval_loss": 0.8481121063232422,
+      "eval_runtime": 2.0678,
+      "eval_samples_per_second": 164.429,
+      "eval_steps_per_second": 1.451,
+      "step": 110
+    },
+    {
+      "epoch": 2.3958333333333335,
+      "grad_norm": 3.386652715934595,
+      "learning_rate": 5.37771434967624e-08,
+      "loss": 0.387,
+      "step": 115
+    },
+    {
+      "epoch": 2.3958333333333335,
+      "eval_loss": 0.8493936061859131,
+      "eval_runtime": 2.1051,
+      "eval_samples_per_second": 161.51,
+      "eval_steps_per_second": 1.425,
+      "step": 115
+    },
+    {
+      "epoch": 2.5,
+      "grad_norm": 3.4200942052169547,
+      "learning_rate": 5.187403540619925e-08,
+      "loss": 0.3561,
+      "step": 120
+    },
+    {
+      "epoch": 2.5,
+      "eval_loss": 0.85052889585495,
+      "eval_runtime": 2.0652,
+      "eval_samples_per_second": 164.632,
+      "eval_steps_per_second": 1.453,
+      "step": 120
+    },
+    {
+      "epoch": 2.6041666666666665,
+      "grad_norm": 3.268980501701993,
+      "learning_rate": 5.088648238966908e-08,
+      "loss": 0.4219,
+      "step": 125
+    },
+    {
+      "epoch": 2.6041666666666665,
+      "eval_loss": 0.8516257405281067,
+      "eval_runtime": 2.1146,
+      "eval_samples_per_second": 160.788,
+      "eval_steps_per_second": 1.419,
+      "step": 125
+    },
+    {
+      "epoch": 2.7083333333333335,
+      "grad_norm": 3.4285942542360806,
+      "learning_rate": 5.039701925276604e-08,
+      "loss": 0.3798,
+      "step": 130
+    },
+    {
+      "epoch": 2.7083333333333335,
+      "eval_loss": 0.8526514172554016,
+      "eval_runtime": 2.1018,
+      "eval_samples_per_second": 161.768,
+      "eval_steps_per_second": 1.427,
+      "step": 130
+    },
+    {
+      "epoch": 2.8125,
+      "grad_norm": 3.438575160339058,
+      "learning_rate": 5.0166900048082497e-08,
+      "loss": 0.3551,
+      "step": 135
+    },
+    {
+      "epoch": 2.8125,
+      "eval_loss": 0.8536917567253113,
+      "eval_runtime": 2.1025,
+      "eval_samples_per_second": 161.713,
+      "eval_steps_per_second": 1.427,
+      "step": 135
+    },
+    {
+      "epoch": 2.9166666666666665,
+      "grad_norm": 3.1199400563472683,
+      "learning_rate": 5.0065147322870076e-08,
+      "loss": 0.3827,
+      "step": 140
+    },
+    {
+      "epoch": 2.9166666666666665,
+      "eval_loss": 0.8546140193939209,
+      "eval_runtime": 2.0898,
+      "eval_samples_per_second": 162.691,
+      "eval_steps_per_second": 1.436,
+      "step": 140
+    },
+    {
+      "epoch": 3.0208333333333335,
+      "grad_norm": 3.1711921144705,
+      "learning_rate": 5.002328628528332e-08,
+      "loss": 0.3938,
+      "step": 145
+    },
+    {
+      "epoch": 3.0208333333333335,
+      "eval_loss": 0.8555943369865417,
+      "eval_runtime": 2.0827,
+      "eval_samples_per_second": 163.25,
+      "eval_steps_per_second": 1.44,
+      "step": 145
+    },
+    {
+      "epoch": 3.125,
+      "grad_norm": 3.133782976096458,
+      "learning_rate": 5.0007484528133236e-08,
+      "loss": 0.3805,
+      "step": 150
+    },
+    {
+      "epoch": 3.125,
+      "eval_loss": 0.8565306663513184,
+      "eval_runtime": 2.1024,
+      "eval_samples_per_second": 161.723,
+      "eval_steps_per_second": 1.427,
+      "step": 150
+    },
+    {
+      "epoch": 3.2291666666666665,
+      "grad_norm": 3.7319435280210085,
+      "learning_rate": 5.0002110817570477e-08,
+      "loss": 0.3813,
+      "step": 155
+    },
+    {
+      "epoch": 3.2291666666666665,
+      "eval_loss": 0.8574034571647644,
+      "eval_runtime": 2.0911,
+      "eval_samples_per_second": 162.593,
+      "eval_steps_per_second": 1.435,
+      "step": 155
+    },
+    {
+      "epoch": 3.3333333333333335,
+      "grad_norm": 3.5844045117334833,
+      "learning_rate": 5.0000504842356326e-08,
+      "loss": 0.3894,
+      "step": 160
+    },
+    {
+      "epoch": 3.3333333333333335,
+      "eval_loss": 0.8581907153129578,
+      "eval_runtime": 2.0963,
+      "eval_samples_per_second": 162.194,
+      "eval_steps_per_second": 1.431,
+      "step": 160
+    },
+    {
+      "epoch": 3.4375,
+      "grad_norm": 3.2964992218641544,
+      "learning_rate": 5.000009745562451e-08,
+      "loss": 0.3603,
+      "step": 165
+    },
+    {
+      "epoch": 3.4375,
+      "eval_loss": 0.8588598370552063,
+      "eval_runtime": 2.0794,
+      "eval_samples_per_second": 163.512,
+      "eval_steps_per_second": 1.443,
+      "step": 165
+    },
+    {
+      "epoch": 3.5416666666666665,
+      "grad_norm": 3.307148767623163,
+      "learning_rate": 5.0000014077810156e-08,
+      "loss": 0.3515,
+      "step": 170
+    },
+    {
+      "epoch": 3.5416666666666665,
+      "eval_loss": 0.8596634864807129,
+      "eval_runtime": 2.0755,
+      "eval_samples_per_second": 163.816,
+      "eval_steps_per_second": 1.445,
+      "step": 170
+    },
+    {
+      "epoch": 3.6458333333333335,
+      "grad_norm": 3.3334351206179402,
+      "learning_rate": 5.0000001343508807e-08,
+      "loss": 0.3433,
+      "step": 175
+    },
+    {
+      "epoch": 3.6458333333333335,
+      "eval_loss": 0.8604967594146729,
+      "eval_runtime": 2.0699,
+      "eval_samples_per_second": 164.261,
+      "eval_steps_per_second": 1.449,
+      "step": 175
+    },
+    {
+      "epoch": 3.75,
+      "grad_norm": 3.196293836404165,
+      "learning_rate": 5.000000006747581e-08,
+      "loss": 0.3511,
+      "step": 180
+    },
+    {
+      "epoch": 3.75,
+      "eval_loss": 0.861361026763916,
+      "eval_runtime": 2.0796,
+      "eval_samples_per_second": 163.491,
+      "eval_steps_per_second": 1.443,
+      "step": 180
+    },
+    {
+      "epoch": 3.8541666666666665,
+      "grad_norm": 3.472738636185267,
+      "learning_rate": 5.0000000001094325e-08,
+      "loss": 0.3599,
+      "step": 185
+    },
+    {
+      "epoch": 3.8541666666666665,
+      "eval_loss": 0.8619682192802429,
+      "eval_runtime": 2.0705,
+      "eval_samples_per_second": 164.215,
+      "eval_steps_per_second": 1.449,
+      "step": 185
+    },
+    {
+      "epoch": 3.9583333333333335,
+      "grad_norm": 3.6408101963860187,
+      "learning_rate": 5.000000000000139e-08,
+      "loss": 0.3994,
+      "step": 190
+    },
+    {
+      "epoch": 3.9583333333333335,
+      "eval_loss": 0.8621244430541992,
+      "eval_runtime": 2.0725,
+      "eval_samples_per_second": 164.052,
+      "eval_steps_per_second": 1.448,
+      "step": 190
+    },
+    {
+      "epoch": 4.0,
+      "step": 192,
+      "total_flos": 5360548577280.0,
+      "train_loss": 0.507965192819635,
+      "train_runtime": 5963.6843,
+      "train_samples_per_second": 2.048,
+      "train_steps_per_second": 0.032
+    }
+  ],
+  "logging_steps": 5,
+  "max_steps": 192,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 4,
+  "save_steps": 5,
+  "total_flos": 5360548577280.0,
+  "train_batch_size": 8,
+  "trial_name": null,
+  "trial_params": null
+}
--- a/training_args.bin
+++ b/training_args.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b6f5cc57e4de6a6111af868bf438bb519b737c60668638ce3523812799fd88b3
+size 6968
--- a/training_eval_loss.png
+++ b/training_eval_loss.png
--- a/training_loss.png
+++ b/training_loss.png
				`@@ -0,0 +1 @@`
				`{"framework": "pytorch", "task": "text-generation", "allow_remote": true}`