初始化项目,由ModelHub XC社区提供模型
Model: reaperdoesntknow/Qemma-sft Source: Original Platform
This commit is contained in:
36
.gitattributes
vendored
Normal file
36
.gitattributes
vendored
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
||||||
140
README.md
Normal file
140
README.md
Normal file
@@ -0,0 +1,140 @@
|
|||||||
|
---
|
||||||
|
library_name: transformers
|
||||||
|
model_name: Qemma-sft
|
||||||
|
tags:
|
||||||
|
- generated_from_trainer
|
||||||
|
- sft
|
||||||
|
- trl
|
||||||
|
- convergentintel
|
||||||
|
licence: license
|
||||||
|
license: osl-3.0
|
||||||
|
datasets:
|
||||||
|
- O1-OPEN/OpenO1-SFT
|
||||||
|
- yahma/alpaca-cleaned
|
||||||
|
- Jackrong/gpt-oss-120b-reasoning-STEM-5K
|
||||||
|
language:
|
||||||
|
- en
|
||||||
|
base_model:
|
||||||
|
- google/gemma-3-1b-it
|
||||||
|
- Qwen/Qwen3-0.6B
|
||||||
|
pipeline_tag: text-generation
|
||||||
|
---
|
||||||
|
|
||||||
|
# Model Card for Qemma
|
||||||
|
|
||||||
|
**Qemma** is a HuggingFace-native hybrid model that merges **Gemma-3 (1B)** and **Qwen-3 (0.6B)** at the weight level (no adapters).
|
||||||
|
Design: Gemma MLP/body + Qwen attention/head, projected and aligned to Gemma’s hidden size. The model is then SFT-tuned for stepwise reasoning.
|
||||||
|
|
||||||
|
## Quick start
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||||
|
import torch
|
||||||
|
|
||||||
|
model_id = "reaperdoesntknow/Qemma-sft"
|
||||||
|
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).eval()
|
||||||
|
|
||||||
|
messages = [{"role": "user", "content": "Explain finite-scale discrepancy Δ_r in one paragraph."}]
|
||||||
|
inputs = tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
|
||||||
|
|
||||||
|
out = model.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.9)
|
||||||
|
print(tok.decode(out[0], skip_special_tokens=True))
|
||||||
|
```
|
||||||
|
|
||||||
|
## What’s inside
|
||||||
|
|
||||||
|
* **Architecture:** Gemma-3 backbone (26 layers, hidden 1152, MLP 6912) with **Qwen-style attention** regrouped to Gemma’s 4×256 heads.
|
||||||
|
* **Tokenizer:** Gemma-3 tokenizer and chat template (see `chat_template.jinja`).
|
||||||
|
* **Training:** SFT for instruction following and stepwise reasoning.
|
||||||
|
|
||||||
|
## Intended use & limitations
|
||||||
|
|
||||||
|
**Use:** research, instruction following, code/help, analysis, further SFT/RLHF.
|
||||||
|
**Limits:** may hallucinate; not for safety-critical, medical, legal, or financial decisions. Follow dataset/model licenses.
|
||||||
|
|
||||||
|
## Training procedure
|
||||||
|
|
||||||
|
* ~512 warm-start steps (Alpaca-style data)
|
||||||
|
* 256 Additional pretraining steps on (O1-OPEN/OpenO1-SFT)
|
||||||
|
* 128 SFT steps with (Jackrong/gpt-oss-120b-reasoning-STEM-5K)
|
||||||
|
* 256 SFT steps with (O1-OPEN/OpenO1-SFT)
|
||||||
|
|
||||||
|
|
||||||
|
### Framework versions
|
||||||
|
|
||||||
|
* TRL: 0.25.0
|
||||||
|
* Transformers: 4.57.1
|
||||||
|
* Pytorch: 2.8.0+cpu
|
||||||
|
* Datasets: 4.4.1
|
||||||
|
* Tokenizers: 0.22.1
|
||||||
|
|
||||||
|
## Discrepancy Calculus Foundation
|
||||||
|
|
||||||
|
This model is part of the [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow) portfolio. All models in this portfolio are developed under the Discrepancy Calculus (DISC) framework — a measure-theoretic approach to understanding and controlling the gap between what a model *should* produce and what it *actually* produces.
|
||||||
|
|
||||||
|
DISC treats training singularities (loss plateaus, mode collapse, catastrophic forgetting) not as failures to be smoothed over, but as **structural signals** that reveal the geometry of the learning problem. Key concepts:
|
||||||
|
|
||||||
|
- **Discrepancy Operator (D):** Measures the gap between expected and observed behavior at each training step
|
||||||
|
- **Jump Sets:** Boundaries where model behavior changes discontinuously — these are *features*, not bugs
|
||||||
|
- **Ghost Imprinting:** Teacher knowledge that transfers to student models through weight-space topology rather than explicit distillation signal
|
||||||
|
|
||||||
|
For the full mathematical treatment, see [Discrepancy Calculus: Foundations and Core Theory](https://huggingface.co/reaperdoesntknow/Discrepancy_Calculus) (DOI: 10.57967/hf/8194).
|
||||||
|
|
||||||
|
**Citation chain:** [Structure Over Scale](https://huggingface.co/reaperdoesntknow/Structure-Over-Scale) (DOI: 10.57967/hf/8165) → [Three Teachers to Dual Cognition](https://huggingface.co/reaperdoesntknow/DualMind_Methodolgy) (DOI: 10.57967/hf/8184) → [Discrepancy Calculus](https://huggingface.co/reaperdoesntknow/Discrepancy_Calculus) (DOI: 10.57967/hf/8194)
|
||||||
|
|
||||||
|
## Citations
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Cite TRL as:
|
||||||
|
|
||||||
|
```bibtex
|
||||||
|
@misc{vonwerra2022trl,
|
||||||
|
title = {{TRL: Transformer Reinforcement Learning}},
|
||||||
|
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
|
||||||
|
year = 2020,
|
||||||
|
journal = {GitHub repository},
|
||||||
|
publisher = {GitHub},
|
||||||
|
howpublished = {\url{https://github.com/huggingface/trl}}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Convergent Intelligence Portfolio
|
||||||
|
|
||||||
|
*By [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)*
|
||||||
|
|
||||||
|
|
||||||
|
### Top Models from Our Lab
|
||||||
|
|
||||||
|
| Model | Downloads |
|
||||||
|
|-------|-----------|
|
||||||
|
| [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 501 |
|
||||||
|
| [LFM2.5-1.2B-Distilled-SFT](https://huggingface.co/reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT) | 342 |
|
||||||
|
| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 302 |
|
||||||
|
| [Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF) | 203 |
|
||||||
|
| [Qwen3-1.7B-Coder-Distilled-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF) | 194 |
|
||||||
|
|
||||||
|
**Total Portfolio: 41 models | 2,781 total downloads**
|
||||||
|
|
||||||
|
|
||||||
|
*Last updated: 2026-03-28 12:57 UTC*
|
||||||
|
|
||||||
|
<!-- CIX-CROSSLINK-START -->
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## From the Convergent Intelligence Portfolio
|
||||||
|
|
||||||
|
**[DistilQwen Collection](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c)** — Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B → 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models, 2,788 combined downloads. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.
|
||||||
|
|
||||||
|
Top model: [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) — 508 downloads
|
||||||
|
|
||||||
|
Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165)
|
||||||
|
|
||||||
|
*Convergent Intelligence LLC: Research Division*
|
||||||
|
|
||||||
|
<!-- CIX-CROSSLINK-END -->
|
||||||
|
<!-- cix-keeper-ts:2026-05-28T13:16:14Z -->
|
||||||
47
chat_template.jinja
Normal file
47
chat_template.jinja
Normal file
@@ -0,0 +1,47 @@
|
|||||||
|
{{ bos_token }}
|
||||||
|
{%- if messages[0]['role'] == 'system' -%}
|
||||||
|
{%- if messages[0]['content'] is string -%}
|
||||||
|
{%- set first_user_prefix = messages[0]['content'] + '
|
||||||
|
|
||||||
|
' -%}
|
||||||
|
{%- else -%}
|
||||||
|
{%- set first_user_prefix = messages[0]['content'][0]['text'] + '
|
||||||
|
|
||||||
|
' -%}
|
||||||
|
{%- endif -%}
|
||||||
|
{%- set loop_messages = messages[1:] -%}
|
||||||
|
{%- else -%}
|
||||||
|
{%- set first_user_prefix = "" -%}
|
||||||
|
{%- set loop_messages = messages -%}
|
||||||
|
{%- endif -%}
|
||||||
|
{%- for message in loop_messages -%}
|
||||||
|
{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
|
||||||
|
{{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
|
||||||
|
{%- endif -%}
|
||||||
|
{%- if (message['role'] == 'assistant') -%}
|
||||||
|
{%- set role = "model" -%}
|
||||||
|
{%- else -%}
|
||||||
|
{%- set role = message['role'] -%}
|
||||||
|
{%- endif -%}
|
||||||
|
{{ '<start_of_turn>' + role + '
|
||||||
|
' + (first_user_prefix if loop.first else "") }}
|
||||||
|
{%- if message['content'] is string -%}
|
||||||
|
{{ message['content'] | trim }}
|
||||||
|
{%- elif message['content'] is iterable -%}
|
||||||
|
{%- for item in message['content'] -%}
|
||||||
|
{%- if item['type'] == 'image' -%}
|
||||||
|
{{ '<start_of_image>' }}
|
||||||
|
{%- elif item['type'] == 'text' -%}
|
||||||
|
{{ item['text'] | trim }}
|
||||||
|
{%- endif -%}
|
||||||
|
{%- endfor -%}
|
||||||
|
{%- else -%}
|
||||||
|
{{ raise_exception("Invalid content type") }}
|
||||||
|
{%- endif -%}
|
||||||
|
{{ '<end_of_turn>
|
||||||
|
' }}
|
||||||
|
{%- endfor -%}
|
||||||
|
{%- if add_generation_prompt -%}
|
||||||
|
{{'<start_of_turn>model
|
||||||
|
'}}
|
||||||
|
{%- endif -%}
|
||||||
64
config.json
Normal file
64
config.json
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
{
|
||||||
|
"_sliding_window_pattern": 6,
|
||||||
|
"architectures": [
|
||||||
|
"Gemma3ForCausalLM"
|
||||||
|
],
|
||||||
|
"attention_bias": false,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"attn_logit_softcapping": null,
|
||||||
|
"bos_token_id": 2,
|
||||||
|
"cache_implementation": "hybrid",
|
||||||
|
"dtype": "float32",
|
||||||
|
"eos_token_id": 1,
|
||||||
|
"final_logit_softcapping": null,
|
||||||
|
"head_dim": 256,
|
||||||
|
"hidden_activation": "gelu_pytorch_tanh",
|
||||||
|
"hidden_size": 1152,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 6912,
|
||||||
|
"layer_types": [
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"full_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"full_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"full_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"full_attention",
|
||||||
|
"sliding_attention",
|
||||||
|
"sliding_attention"
|
||||||
|
],
|
||||||
|
"max_position_embeddings": 32768,
|
||||||
|
"model_type": "gemma3_text",
|
||||||
|
"num_attention_heads": 4,
|
||||||
|
"num_hidden_layers": 26,
|
||||||
|
"num_key_value_heads": 1,
|
||||||
|
"pad_token_id": 0,
|
||||||
|
"query_pre_attn_scalar": 256,
|
||||||
|
"rms_norm_eps": 1e-06,
|
||||||
|
"rope_local_base_freq": 10000,
|
||||||
|
"rope_scaling": null,
|
||||||
|
"rope_theta": 1000000,
|
||||||
|
"sliding_window": 512,
|
||||||
|
"sliding_window_pattern": 6,
|
||||||
|
"transformers_version": "4.57.1",
|
||||||
|
"use_bidirectional_attention": false,
|
||||||
|
"use_cache": true,
|
||||||
|
"vocab_size": 262149
|
||||||
|
}
|
||||||
16
generation_config.json
Normal file
16
generation_config.json
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
{
|
||||||
|
"bos_token_id": 2,
|
||||||
|
"do_sample": true,
|
||||||
|
"eos_token_id": [
|
||||||
|
1
|
||||||
|
],
|
||||||
|
"max_new_tokens": 1024,
|
||||||
|
"no_repeat_ngram_size": 3,
|
||||||
|
"pad_token_id": 0,
|
||||||
|
"repetition_penalty": 1.05,
|
||||||
|
"temperature": 0.6,
|
||||||
|
"top_k": 40,
|
||||||
|
"top_p": 0.9,
|
||||||
|
"transformers_version": "4.57.1",
|
||||||
|
"typical_p": 0.95
|
||||||
|
}
|
||||||
20
generation_config_think.json
Normal file
20
generation_config_think.json
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
{
|
||||||
|
"bos_token_id": 2,
|
||||||
|
"eos_token_id": 1,
|
||||||
|
"pad_token_id": 0,
|
||||||
|
"do_sample": true,
|
||||||
|
|
||||||
|
"max_new_tokens": 1024,
|
||||||
|
"max_length": 4096,
|
||||||
|
|
||||||
|
"temperature": 0.4,
|
||||||
|
"top_p": 0.90,
|
||||||
|
"top_k": 40,
|
||||||
|
|
||||||
|
"no_repeat_ngram_size": 3,
|
||||||
|
"repetition_penalty": 1.07,
|
||||||
|
|
||||||
|
"stop_sequences": ["</think>", "</reasoning_step>"],
|
||||||
|
"use_cache": true,
|
||||||
|
"transformers_version": "4.57.1"
|
||||||
|
}
|
||||||
3
pytorch_model.bin
Normal file
3
pytorch_model.bin
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:05e46900baec9b6ba50bf9366c96126e86c95ba97c3c77716e41f95f128db04b
|
||||||
|
size 3999677491
|
||||||
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:99c03c60a5f720fef850a03e2fa14ce6ce3e4664af919595a6b5b814bee61e13
|
||||||
|
size 30744
|
||||||
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:eb7349612b0afefab17fdcdc50b3ea50319cd25e76f44b31130bbdc967e19282
|
||||||
|
size 30736
|
||||||
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:37f34873021fe23a5e49d060f4dec31e27a557f1036fb908a829895bc7d4cf41
|
||||||
|
size 18737
|
||||||
33
special_tokens_map.json
Normal file
33
special_tokens_map.json
Normal file
@@ -0,0 +1,33 @@
|
|||||||
|
{
|
||||||
|
"boi_token": "<start_of_image>",
|
||||||
|
"bos_token": {
|
||||||
|
"content": "<bos>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"eoi_token": "<end_of_image>",
|
||||||
|
"eos_token": {
|
||||||
|
"content": "<eos>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"image_token": "<image_soft_token>",
|
||||||
|
"pad_token": {
|
||||||
|
"content": "<pad>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"unk_token": {
|
||||||
|
"content": "<unk>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
}
|
||||||
3
tokenizer.json
Normal file
3
tokenizer.json
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:4bfdbe18ca9d8be4e21fe6e8c46a19833b09560053bb9e9dbe71a9b6033e709b
|
||||||
|
size 33385589
|
||||||
51384
tokenizer_config.json
Normal file
51384
tokenizer_config.json
Normal file
File diff suppressed because it is too large
Load Diff
674
trainer_state.json
Normal file
674
trainer_state.json
Normal file
@@ -0,0 +1,674 @@
|
|||||||
|
{
|
||||||
|
"best_global_step": null,
|
||||||
|
"best_metric": null,
|
||||||
|
"best_model_checkpoint": null,
|
||||||
|
"epoch": 16.0,
|
||||||
|
"eval_steps": 500,
|
||||||
|
"global_step": 64,
|
||||||
|
"is_hyper_param_search": false,
|
||||||
|
"is_local_process_zero": true,
|
||||||
|
"is_world_process_zero": true,
|
||||||
|
"log_history": [
|
||||||
|
{
|
||||||
|
"entropy": 1.578833818435669,
|
||||||
|
"epoch": 0.25,
|
||||||
|
"grad_norm": 1.233445644378662,
|
||||||
|
"learning_rate": 0.0003,
|
||||||
|
"loss": 1.0328,
|
||||||
|
"mean_token_accuracy": 0.7567394971847534,
|
||||||
|
"num_tokens": 8192.0,
|
||||||
|
"step": 1
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.3467954397201538,
|
||||||
|
"epoch": 0.5,
|
||||||
|
"grad_norm": 16.553691864013672,
|
||||||
|
"learning_rate": 0.00029981931843077583,
|
||||||
|
"loss": 4.6647,
|
||||||
|
"mean_token_accuracy": 0.3241828382015228,
|
||||||
|
"num_tokens": 16384.0,
|
||||||
|
"step": 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.178613543510437,
|
||||||
|
"epoch": 0.75,
|
||||||
|
"grad_norm": 6.595342636108398,
|
||||||
|
"learning_rate": 0.00029927770900082954,
|
||||||
|
"loss": 2.7709,
|
||||||
|
"mean_token_accuracy": 0.44225722551345825,
|
||||||
|
"num_tokens": 24576.0,
|
||||||
|
"step": 3
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.4147611856460571,
|
||||||
|
"epoch": 1.0,
|
||||||
|
"grad_norm": 5.04085636138916,
|
||||||
|
"learning_rate": 0.00029837647649471715,
|
||||||
|
"loss": 2.766,
|
||||||
|
"mean_token_accuracy": 0.4439714550971985,
|
||||||
|
"num_tokens": 32768.0,
|
||||||
|
"step": 4
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.7952088117599487,
|
||||||
|
"epoch": 1.25,
|
||||||
|
"grad_norm": 2.6215267181396484,
|
||||||
|
"learning_rate": 0.00029711779206048454,
|
||||||
|
"loss": 1.6175,
|
||||||
|
"mean_token_accuracy": 0.605218768119812,
|
||||||
|
"num_tokens": 40960.0,
|
||||||
|
"step": 5
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.8796031475067139,
|
||||||
|
"epoch": 1.5,
|
||||||
|
"grad_norm": 3.3029890060424805,
|
||||||
|
"learning_rate": 0.0002955046879791816,
|
||||||
|
"loss": 1.6483,
|
||||||
|
"mean_token_accuracy": 0.5845749378204346,
|
||||||
|
"num_tokens": 49152.0,
|
||||||
|
"step": 6
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 2.1823692321777344,
|
||||||
|
"epoch": 1.75,
|
||||||
|
"grad_norm": 2.3034701347351074,
|
||||||
|
"learning_rate": 0.0002935410503598313,
|
||||||
|
"loss": 1.7309,
|
||||||
|
"mean_token_accuracy": 0.5875270366668701,
|
||||||
|
"num_tokens": 57344.0,
|
||||||
|
"step": 7
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.8705880641937256,
|
||||||
|
"epoch": 2.0,
|
||||||
|
"grad_norm": 2.278149127960205,
|
||||||
|
"learning_rate": 0.00029123160977745306,
|
||||||
|
"loss": 1.7439,
|
||||||
|
"mean_token_accuracy": 0.5833333134651184,
|
||||||
|
"num_tokens": 65536.0,
|
||||||
|
"step": 8
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.9831080436706543,
|
||||||
|
"epoch": 2.25,
|
||||||
|
"grad_norm": 1.9059035778045654,
|
||||||
|
"learning_rate": 0.000288581929876693,
|
||||||
|
"loss": 1.4867,
|
||||||
|
"mean_token_accuracy": 0.6460188627243042,
|
||||||
|
"num_tokens": 73728.0,
|
||||||
|
"step": 9
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.8127729892730713,
|
||||||
|
"epoch": 2.5,
|
||||||
|
"grad_norm": 1.695867896080017,
|
||||||
|
"learning_rate": 0.0002855983939685165,
|
||||||
|
"loss": 1.2372,
|
||||||
|
"mean_token_accuracy": 0.6903008222579956,
|
||||||
|
"num_tokens": 81920.0,
|
||||||
|
"step": 10
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.6545960903167725,
|
||||||
|
"epoch": 2.75,
|
||||||
|
"grad_norm": 1.8578449487686157,
|
||||||
|
"learning_rate": 0.0002822881896522532,
|
||||||
|
"loss": 1.2116,
|
||||||
|
"mean_token_accuracy": 0.6871835589408875,
|
||||||
|
"num_tokens": 90112.0,
|
||||||
|
"step": 11
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.5270276069641113,
|
||||||
|
"epoch": 3.0,
|
||||||
|
"grad_norm": 1.87300705909729,
|
||||||
|
"learning_rate": 0.0002786592915000408,
|
||||||
|
"loss": 1.0224,
|
||||||
|
"mean_token_accuracy": 0.7442499995231628,
|
||||||
|
"num_tokens": 98304.0,
|
||||||
|
"step": 12
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.30039381980896,
|
||||||
|
"epoch": 3.25,
|
||||||
|
"grad_norm": 1.6118026971817017,
|
||||||
|
"learning_rate": 0.0002747204418453818,
|
||||||
|
"loss": 0.825,
|
||||||
|
"mean_token_accuracy": 0.7948048710823059,
|
||||||
|
"num_tokens": 106496.0,
|
||||||
|
"step": 13
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.1720249652862549,
|
||||||
|
"epoch": 3.5,
|
||||||
|
"grad_norm": 1.5566579103469849,
|
||||||
|
"learning_rate": 0.0002704811297220967,
|
||||||
|
"loss": 0.8768,
|
||||||
|
"mean_token_accuracy": 0.7644100785255432,
|
||||||
|
"num_tokens": 114688.0,
|
||||||
|
"step": 14
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 1.0196048021316528,
|
||||||
|
"epoch": 3.75,
|
||||||
|
"grad_norm": 1.6275376081466675,
|
||||||
|
"learning_rate": 0.0002659515680044105,
|
||||||
|
"loss": 0.7812,
|
||||||
|
"mean_token_accuracy": 0.7838043570518494,
|
||||||
|
"num_tokens": 122880.0,
|
||||||
|
"step": 15
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.923040509223938,
|
||||||
|
"epoch": 4.0,
|
||||||
|
"grad_norm": 1.8159480094909668,
|
||||||
|
"learning_rate": 0.00026114266880324387,
|
||||||
|
"loss": 0.7571,
|
||||||
|
"mean_token_accuracy": 0.7924362421035767,
|
||||||
|
"num_tokens": 131072.0,
|
||||||
|
"step": 16
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.8564470410346985,
|
||||||
|
"epoch": 4.25,
|
||||||
|
"grad_norm": 1.325358271598816,
|
||||||
|
"learning_rate": 0.00025606601717798207,
|
||||||
|
"loss": 0.5237,
|
||||||
|
"mean_token_accuracy": 0.8597344756126404,
|
||||||
|
"num_tokens": 139264.0,
|
||||||
|
"step": 17
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.8045864701271057,
|
||||||
|
"epoch": 4.5,
|
||||||
|
"grad_norm": 1.4215672016143799,
|
||||||
|
"learning_rate": 0.00025073384322705274,
|
||||||
|
"loss": 0.4804,
|
||||||
|
"mean_token_accuracy": 0.8718394637107849,
|
||||||
|
"num_tokens": 147456.0,
|
||||||
|
"step": 18
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.59234619140625,
|
||||||
|
"epoch": 4.75,
|
||||||
|
"grad_norm": 1.8063968420028687,
|
||||||
|
"learning_rate": 0.0002451589926245468,
|
||||||
|
"loss": 0.4662,
|
||||||
|
"mean_token_accuracy": 0.8671249747276306,
|
||||||
|
"num_tokens": 155648.0,
|
||||||
|
"step": 19
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.5879663228988647,
|
||||||
|
"epoch": 5.0,
|
||||||
|
"grad_norm": 1.8828738927841187,
|
||||||
|
"learning_rate": 0.000239354895673865,
|
||||||
|
"loss": 0.5823,
|
||||||
|
"mean_token_accuracy": 0.8327223658561707,
|
||||||
|
"num_tokens": 163840.0,
|
||||||
|
"step": 20
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.3616185784339905,
|
||||||
|
"epoch": 5.25,
|
||||||
|
"grad_norm": 1.1424044370651245,
|
||||||
|
"learning_rate": 0.0002333355349529403,
|
||||||
|
"loss": 0.2181,
|
||||||
|
"mean_token_accuracy": 0.9427499771118164,
|
||||||
|
"num_tokens": 172032.0,
|
||||||
|
"step": 21
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.41956228017807007,
|
||||||
|
"epoch": 5.5,
|
||||||
|
"grad_norm": 1.4396123886108398,
|
||||||
|
"learning_rate": 0.00022711541162898321,
|
||||||
|
"loss": 0.3146,
|
||||||
|
"mean_token_accuracy": 0.9124827980995178,
|
||||||
|
"num_tokens": 180224.0,
|
||||||
|
"step": 22
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.432304710149765,
|
||||||
|
"epoch": 5.75,
|
||||||
|
"grad_norm": 1.491745114326477,
|
||||||
|
"learning_rate": 0.00022070951052389966,
|
||||||
|
"loss": 0.3073,
|
||||||
|
"mean_token_accuracy": 0.9125822186470032,
|
||||||
|
"num_tokens": 188416.0,
|
||||||
|
"step": 23
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.5084620118141174,
|
||||||
|
"epoch": 6.0,
|
||||||
|
"grad_norm": 1.7158794403076172,
|
||||||
|
"learning_rate": 0.0002141332640145423,
|
||||||
|
"loss": 0.4501,
|
||||||
|
"mean_token_accuracy": 0.8650408387184143,
|
||||||
|
"num_tokens": 196608.0,
|
||||||
|
"step": 24
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.33116769790649414,
|
||||||
|
"epoch": 6.25,
|
||||||
|
"grad_norm": 1.2156429290771484,
|
||||||
|
"learning_rate": 0.00020740251485476345,
|
||||||
|
"loss": 0.2181,
|
||||||
|
"mean_token_accuracy": 0.9379318952560425,
|
||||||
|
"num_tokens": 204800.0,
|
||||||
|
"step": 25
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.31190094351768494,
|
||||||
|
"epoch": 6.5,
|
||||||
|
"grad_norm": 1.1018553972244263,
|
||||||
|
"learning_rate": 0.00020053347800883298,
|
||||||
|
"loss": 0.1677,
|
||||||
|
"mean_token_accuracy": 0.950964629650116,
|
||||||
|
"num_tokens": 212992.0,
|
||||||
|
"step": 26
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.317983478307724,
|
||||||
|
"epoch": 6.75,
|
||||||
|
"grad_norm": 1.120477557182312,
|
||||||
|
"learning_rate": 0.0001935427015881693,
|
||||||
|
"loss": 0.2173,
|
||||||
|
"mean_token_accuracy": 0.9389505386352539,
|
||||||
|
"num_tokens": 221184.0,
|
||||||
|
"step": 27
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.29574161767959595,
|
||||||
|
"epoch": 7.0,
|
||||||
|
"grad_norm": 1.3950735330581665,
|
||||||
|
"learning_rate": 0.0001864470269854896,
|
||||||
|
"loss": 0.1842,
|
||||||
|
"mean_token_accuracy": 0.9490840435028076,
|
||||||
|
"num_tokens": 229376.0,
|
||||||
|
"step": 28
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.2848804295063019,
|
||||||
|
"epoch": 7.25,
|
||||||
|
"grad_norm": 0.9792242050170898,
|
||||||
|
"learning_rate": 0.00017926354830241924,
|
||||||
|
"loss": 0.1337,
|
||||||
|
"mean_token_accuracy": 0.9676337838172913,
|
||||||
|
"num_tokens": 237568.0,
|
||||||
|
"step": 29
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.21760335564613342,
|
||||||
|
"epoch": 7.5,
|
||||||
|
"grad_norm": 1.1371413469314575,
|
||||||
|
"learning_rate": 0.00017200957116830423,
|
||||||
|
"loss": 0.1221,
|
||||||
|
"mean_token_accuracy": 0.9651966094970703,
|
||||||
|
"num_tokens": 245760.0,
|
||||||
|
"step": 30
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.17992466688156128,
|
||||||
|
"epoch": 7.75,
|
||||||
|
"grad_norm": 0.9602733850479126,
|
||||||
|
"learning_rate": 0.0001647025710494341,
|
||||||
|
"loss": 0.0888,
|
||||||
|
"mean_token_accuracy": 0.975476861000061,
|
||||||
|
"num_tokens": 253952.0,
|
||||||
|
"step": 31
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.24877575039863586,
|
||||||
|
"epoch": 8.0,
|
||||||
|
"grad_norm": 1.1444693803787231,
|
||||||
|
"learning_rate": 0.0001573601511491127,
|
||||||
|
"loss": 0.1495,
|
||||||
|
"mean_token_accuracy": 0.9576800465583801,
|
||||||
|
"num_tokens": 262144.0,
|
||||||
|
"step": 32
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.1768663376569748,
|
||||||
|
"epoch": 8.25,
|
||||||
|
"grad_norm": 0.8242481350898743,
|
||||||
|
"learning_rate": 0.00015,
|
||||||
|
"loss": 0.0828,
|
||||||
|
"mean_token_accuracy": 0.9813156723976135,
|
||||||
|
"num_tokens": 270336.0,
|
||||||
|
"step": 33
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.14627668261528015,
|
||||||
|
"epoch": 8.5,
|
||||||
|
"grad_norm": 0.7782649993896484,
|
||||||
|
"learning_rate": 0.0001426398488508873,
|
||||||
|
"loss": 0.0607,
|
||||||
|
"mean_token_accuracy": 0.9849227070808411,
|
||||||
|
"num_tokens": 278528.0,
|
||||||
|
"step": 34
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.1337864249944687,
|
||||||
|
"epoch": 8.75,
|
||||||
|
"grad_norm": 0.5131941437721252,
|
||||||
|
"learning_rate": 0.0001352974289505659,
|
||||||
|
"loss": 0.0359,
|
||||||
|
"mean_token_accuracy": 0.9923518300056458,
|
||||||
|
"num_tokens": 286720.0,
|
||||||
|
"step": 35
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.2399912178516388,
|
||||||
|
"epoch": 9.0,
|
||||||
|
"grad_norm": 0.9744300842285156,
|
||||||
|
"learning_rate": 0.00012799042883169574,
|
||||||
|
"loss": 0.1362,
|
||||||
|
"mean_token_accuracy": 0.9642810821533203,
|
||||||
|
"num_tokens": 294912.0,
|
||||||
|
"step": 36
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.0918688029050827,
|
||||||
|
"epoch": 9.25,
|
||||||
|
"grad_norm": 0.6197434663772583,
|
||||||
|
"learning_rate": 0.00012073645169758076,
|
||||||
|
"loss": 0.0392,
|
||||||
|
"mean_token_accuracy": 0.9913245439529419,
|
||||||
|
"num_tokens": 303104.0,
|
||||||
|
"step": 37
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.12489282339811325,
|
||||||
|
"epoch": 9.5,
|
||||||
|
"grad_norm": 0.4741184711456299,
|
||||||
|
"learning_rate": 0.00011355297301451042,
|
||||||
|
"loss": 0.0373,
|
||||||
|
"mean_token_accuracy": 0.9934142827987671,
|
||||||
|
"num_tokens": 311296.0,
|
||||||
|
"step": 38
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.13954313099384308,
|
||||||
|
"epoch": 9.75,
|
||||||
|
"grad_norm": 0.4734343886375427,
|
||||||
|
"learning_rate": 0.00010645729841183066,
|
||||||
|
"loss": 0.0452,
|
||||||
|
"mean_token_accuracy": 0.9903808236122131,
|
||||||
|
"num_tokens": 319488.0,
|
||||||
|
"step": 39
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.1761952042579651,
|
||||||
|
"epoch": 10.0,
|
||||||
|
"grad_norm": 0.5753357410430908,
|
||||||
|
"learning_rate": 9.946652199116699e-05,
|
||||||
|
"loss": 0.0721,
|
||||||
|
"mean_token_accuracy": 0.9816750288009644,
|
||||||
|
"num_tokens": 327680.0,
|
||||||
|
"step": 40
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.12840500473976135,
|
||||||
|
"epoch": 10.25,
|
||||||
|
"grad_norm": 0.3442269265651703,
|
||||||
|
"learning_rate": 9.259748514523653e-05,
|
||||||
|
"loss": 0.0554,
|
||||||
|
"mean_token_accuracy": 0.9877277612686157,
|
||||||
|
"num_tokens": 335872.0,
|
||||||
|
"step": 41
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.15350428223609924,
|
||||||
|
"epoch": 10.5,
|
||||||
|
"grad_norm": 0.34964317083358765,
|
||||||
|
"learning_rate": 8.586673598545771e-05,
|
||||||
|
"loss": 0.0309,
|
||||||
|
"mean_token_accuracy": 0.9958372712135315,
|
||||||
|
"num_tokens": 344064.0,
|
||||||
|
"step": 42
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.0905783548951149,
|
||||||
|
"epoch": 10.75,
|
||||||
|
"grad_norm": 0.3658287525177002,
|
||||||
|
"learning_rate": 7.929048947610034e-05,
|
||||||
|
"loss": 0.0238,
|
||||||
|
"mean_token_accuracy": 0.9957314729690552,
|
||||||
|
"num_tokens": 352256.0,
|
||||||
|
"step": 43
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.055884674191474915,
|
||||||
|
"epoch": 11.0,
|
||||||
|
"grad_norm": 0.3193635642528534,
|
||||||
|
"learning_rate": 7.288458837101675e-05,
|
||||||
|
"loss": 0.023,
|
||||||
|
"mean_token_accuracy": 0.9956011772155762,
|
||||||
|
"num_tokens": 360448.0,
|
||||||
|
"step": 44
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.09747447818517685,
|
||||||
|
"epoch": 11.25,
|
||||||
|
"grad_norm": 0.27020013332366943,
|
||||||
|
"learning_rate": 6.66644650470597e-05,
|
||||||
|
"loss": 0.0228,
|
||||||
|
"mean_token_accuracy": 0.9966606497764587,
|
||||||
|
"num_tokens": 368640.0,
|
||||||
|
"step": 45
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.09450258314609528,
|
||||||
|
"epoch": 11.5,
|
||||||
|
"grad_norm": 0.1876399666070938,
|
||||||
|
"learning_rate": 6.064510432613499e-05,
|
||||||
|
"loss": 0.015,
|
||||||
|
"mean_token_accuracy": 0.9980137944221497,
|
||||||
|
"num_tokens": 376832.0,
|
||||||
|
"step": 46
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.11891507357358932,
|
||||||
|
"epoch": 11.75,
|
||||||
|
"grad_norm": 0.43645092844963074,
|
||||||
|
"learning_rate": 5.4841007375453186e-05,
|
||||||
|
"loss": 0.0441,
|
||||||
|
"mean_token_accuracy": 0.9901614785194397,
|
||||||
|
"num_tokens": 385024.0,
|
||||||
|
"step": 47
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.041454315185546875,
|
||||||
|
"epoch": 12.0,
|
||||||
|
"grad_norm": 0.2318841516971588,
|
||||||
|
"learning_rate": 4.926615677294723e-05,
|
||||||
|
"loss": 0.0122,
|
||||||
|
"mean_token_accuracy": 0.9977499842643738,
|
||||||
|
"num_tokens": 393216.0,
|
||||||
|
"step": 48
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.09900425374507904,
|
||||||
|
"epoch": 12.25,
|
||||||
|
"grad_norm": 0.2141266018152237,
|
||||||
|
"learning_rate": 4.3933982822017876e-05,
|
||||||
|
"loss": 0.0183,
|
||||||
|
"mean_token_accuracy": 0.9972920417785645,
|
||||||
|
"num_tokens": 401408.0,
|
||||||
|
"step": 49
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.03853936493396759,
|
||||||
|
"epoch": 12.5,
|
||||||
|
"grad_norm": 0.13608407974243164,
|
||||||
|
"learning_rate": 3.885733119675616e-05,
|
||||||
|
"loss": 0.0097,
|
||||||
|
"mean_token_accuracy": 0.9980000257492065,
|
||||||
|
"num_tokens": 409600.0,
|
||||||
|
"step": 50
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.09134722501039505,
|
||||||
|
"epoch": 12.75,
|
||||||
|
"grad_norm": 0.22313106060028076,
|
||||||
|
"learning_rate": 3.404843199558945e-05,
|
||||||
|
"loss": 0.0333,
|
||||||
|
"mean_token_accuracy": 0.9941705465316772,
|
||||||
|
"num_tokens": 417792.0,
|
||||||
|
"step": 51
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.08582671731710434,
|
||||||
|
"epoch": 13.0,
|
||||||
|
"grad_norm": 0.12559127807617188,
|
||||||
|
"learning_rate": 2.9518870277903274e-05,
|
||||||
|
"loss": 0.0094,
|
||||||
|
"mean_token_accuracy": 0.9985564351081848,
|
||||||
|
"num_tokens": 425984.0,
|
||||||
|
"step": 52
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.028270475566387177,
|
||||||
|
"epoch": 13.25,
|
||||||
|
"grad_norm": 0.10946992039680481,
|
||||||
|
"learning_rate": 2.5279558154618197e-05,
|
||||||
|
"loss": 0.0093,
|
||||||
|
"mean_token_accuracy": 0.9987781047821045,
|
||||||
|
"num_tokens": 434176.0,
|
||||||
|
"step": 53
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.14224599301815033,
|
||||||
|
"epoch": 13.5,
|
||||||
|
"grad_norm": 0.14812375605106354,
|
||||||
|
"learning_rate": 2.1340708499959197e-05,
|
||||||
|
"loss": 0.0153,
|
||||||
|
"mean_token_accuracy": 0.9980641603469849,
|
||||||
|
"num_tokens": 442368.0,
|
||||||
|
"step": 54
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.08890354633331299,
|
||||||
|
"epoch": 13.75,
|
||||||
|
"grad_norm": 0.18965214490890503,
|
||||||
|
"learning_rate": 1.7711810347746757e-05,
|
||||||
|
"loss": 0.0255,
|
||||||
|
"mean_token_accuracy": 0.9949755072593689,
|
||||||
|
"num_tokens": 450560.0,
|
||||||
|
"step": 55
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.03545539826154709,
|
||||||
|
"epoch": 14.0,
|
||||||
|
"grad_norm": 0.13480865955352783,
|
||||||
|
"learning_rate": 1.4401606031483497e-05,
|
||||||
|
"loss": 0.0107,
|
||||||
|
"mean_token_accuracy": 0.9982690215110779,
|
||||||
|
"num_tokens": 458752.0,
|
||||||
|
"step": 56
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.07514533400535583,
|
||||||
|
"epoch": 14.25,
|
||||||
|
"grad_norm": 0.15547268092632294,
|
||||||
|
"learning_rate": 1.1418070123306989e-05,
|
||||||
|
"loss": 0.0226,
|
||||||
|
"mean_token_accuracy": 0.9957281351089478,
|
||||||
|
"num_tokens": 466944.0,
|
||||||
|
"step": 57
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.08177103102207184,
|
||||||
|
"epoch": 14.5,
|
||||||
|
"grad_norm": 0.13583342730998993,
|
||||||
|
"learning_rate": 8.768390222546895e-06,
|
||||||
|
"loss": 0.0124,
|
||||||
|
"mean_token_accuracy": 0.9986894130706787,
|
||||||
|
"num_tokens": 475136.0,
|
||||||
|
"step": 58
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.053971339017152786,
|
||||||
|
"epoch": 14.75,
|
||||||
|
"grad_norm": 0.11598207801580429,
|
||||||
|
"learning_rate": 6.458949640168675e-06,
|
||||||
|
"loss": 0.0094,
|
||||||
|
"mean_token_accuracy": 0.9982340931892395,
|
||||||
|
"num_tokens": 483328.0,
|
||||||
|
"step": 59
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.07475290447473526,
|
||||||
|
"epoch": 15.0,
|
||||||
|
"grad_norm": 0.1316467970609665,
|
||||||
|
"learning_rate": 4.495312020818403e-06,
|
||||||
|
"loss": 0.0112,
|
||||||
|
"mean_token_accuracy": 0.9983223676681519,
|
||||||
|
"num_tokens": 491520.0,
|
||||||
|
"step": 60
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.05613658204674721,
|
||||||
|
"epoch": 15.25,
|
||||||
|
"grad_norm": 0.0949818417429924,
|
||||||
|
"learning_rate": 2.882207939515435e-06,
|
||||||
|
"loss": 0.0091,
|
||||||
|
"mean_token_accuracy": 0.9985954761505127,
|
||||||
|
"num_tokens": 499712.0,
|
||||||
|
"step": 61
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.06857230514287949,
|
||||||
|
"epoch": 15.5,
|
||||||
|
"grad_norm": 0.11168709397315979,
|
||||||
|
"learning_rate": 1.6235235052828476e-06,
|
||||||
|
"loss": 0.0113,
|
||||||
|
"mean_token_accuracy": 0.9984668493270874,
|
||||||
|
"num_tokens": 507904.0,
|
||||||
|
"step": 62
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.11523294448852539,
|
||||||
|
"epoch": 15.75,
|
||||||
|
"grad_norm": 0.17019794881343842,
|
||||||
|
"learning_rate": 7.222909991704773e-07,
|
||||||
|
"loss": 0.0262,
|
||||||
|
"mean_token_accuracy": 0.9955543875694275,
|
||||||
|
"num_tokens": 516096.0,
|
||||||
|
"step": 63
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entropy": 0.042229048907756805,
|
||||||
|
"epoch": 16.0,
|
||||||
|
"grad_norm": 0.08190463483333588,
|
||||||
|
"learning_rate": 1.8068156922413924e-07,
|
||||||
|
"loss": 0.0073,
|
||||||
|
"mean_token_accuracy": 0.9988691806793213,
|
||||||
|
"num_tokens": 524288.0,
|
||||||
|
"step": 64
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"logging_steps": 1,
|
||||||
|
"max_steps": 64,
|
||||||
|
"num_input_tokens_seen": 0,
|
||||||
|
"num_train_epochs": 16,
|
||||||
|
"save_steps": 32,
|
||||||
|
"stateful_callbacks": {
|
||||||
|
"TrainerControl": {
|
||||||
|
"args": {
|
||||||
|
"should_epoch_stop": false,
|
||||||
|
"should_evaluate": false,
|
||||||
|
"should_log": false,
|
||||||
|
"should_save": true,
|
||||||
|
"should_training_stop": true
|
||||||
|
},
|
||||||
|
"attributes": {}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"total_flos": 2195391189614592.0,
|
||||||
|
"train_batch_size": 8,
|
||||||
|
"trial_name": null,
|
||||||
|
"trial_params": null
|
||||||
|
}
|
||||||
3
training_args.bin
Normal file
3
training_args.bin
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:865a44a5945d719ed80188e66d111256af1160b18506b7fe933a2769e13f0ef3
|
||||||
|
size 6225
|
||||||
Reference in New Issue
Block a user