146 lines
4.7 KiB
Markdown
146 lines
4.7 KiB
Markdown
|
|
---
|
||
|
|
license: mit
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
base_model:
|
||
|
|
- XiaomiMiMo/MiMo-7B-Base
|
||
|
|
library_name: transformers
|
||
|
|
tags:
|
||
|
|
- writing
|
||
|
|
- creative-writing
|
||
|
|
---
|
||
|
|
|
||
|
|
# Koto Small 7B (Pretrained)
|
||
|
|
|
||
|
|

|
||
|
|
|
||
|
|
Koto-Small-7B-PT is a version of MiMo-7B-Base trained on almost a billion tokens of creative writing data.
|
||
|
|
|
||
|
|
**Please check out [Aurore-Reveil/Koto-Small-7B-IT](https://huggingface.co/Aurore-Reveil/Koto-Small-7B-IT), it's the official RP and instruct tune!**
|
||
|
|
|
||
|
|
## Usage
|
||
|
|
|
||
|
|
This model is not intended for use outside of raw text completion settings, such as cowriting. Instruct will *not* work. Multi-turn roleplay will *not* work.
|
||
|
|
|
||
|
|
It was trained at 32k, but as not all samples were this long, we expect that in the best case you can get ~16k effective context.
|
||
|
|
|
||
|
|
We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!
|
||
|
|
|
||
|
|
## Datasets
|
||
|
|
|
||
|
|
Some of the data used to train this model includes:
|
||
|
|
- Most of [The Anarchist Library](https://theanarchistlibrary.org/), a repository for anarchist manifestos and writing (see [allura-org/the-anarchist-library](https://huggingface.co/datasets/allura-org/the-anarchist-library))
|
||
|
|
- A random sample of public domain books from Project Gutenberg
|
||
|
|
- Furry (anthro and feral) storytelling and smut
|
||
|
|
- A small subset of known high-quality books and story data
|
||
|
|
|
||
|
|
## Acknowledgements
|
||
|
|
- thank you to [unk] for drawing the art used in the model card!
|
||
|
|
- thank you very much to [mango/deltavector](https://huggingface.co/Delta-Vector) for providing the compute used to train this model
|
||
|
|
- thanks to curse for testing, ideas
|
||
|
|
- thanks to toasty for some data, ideas
|
||
|
|
- thanks to everyone else in allura for moral support
|
||
|
|
|
||
|
|
ilya <3
|
||
|
|
|
||
|
|
## Call for Help
|
||
|
|
if you would like to help build on this model (instruct/RP SFT, further annealing on higher quality data, etc)...
|
||
|
|
please join [our discord](https://discord.gg/PPBMhF2vgC) or [our matrix](https://matrix.to/#/#allura:allura.moe)! <3
|
||
|
|
|
||
|
|
## Technical Appendix
|
||
|
|
<details>
|
||
|
|
|
||
|
|
### Training Notes
|
||
|
|
This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization.
|
||
|
|
Before training, we [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise.
|
||
|
|
|
||
|
|
We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.
|
||
|
|
|
||
|
|
### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/zk8t6oq6/workspace)
|
||
|
|

|
||
|
|
|
||
|
|
### Finetuning Notes
|
||
|
|
This model has had ChatML tokens already added by Xiaomi. Please use this format when finetuning to ensure compatibility with the rest of the ecosystem.
|
||
|
|
|
||
|
|
### Axolotl Config
|
||
|
|
```yaml
|
||
|
|
## model
|
||
|
|
base_model: allura-forge/MiMo-7B-Base-Qwenified
|
||
|
|
trust_remote_code: true
|
||
|
|
## qlora COPE!!!
|
||
|
|
load_in_8bit: false
|
||
|
|
load_in_4bit: false
|
||
|
|
strict: false
|
||
|
|
|
||
|
|
## data
|
||
|
|
datasets:
|
||
|
|
datasets:
|
||
|
|
- path: estrogen/bookscpt2
|
||
|
|
type: completion
|
||
|
|
field: text
|
||
|
|
|
||
|
|
shuffle_merged_datasets: true
|
||
|
|
dataset_prepared_path: dataset_prepareds
|
||
|
|
val_set_size: 0.0
|
||
|
|
output_dir: ./MiMo-Pretrain
|
||
|
|
|
||
|
|
## Liger + CCE
|
||
|
|
plugins:
|
||
|
|
- axolotl.integrations.liger.LigerPlugin
|
||
|
|
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
|
||
|
|
liger_rope: true
|
||
|
|
liger_rms_norm: true
|
||
|
|
liger_layer_norm: true
|
||
|
|
liger_glu_activation: true
|
||
|
|
liger_fused_linear_cross_entropy: false
|
||
|
|
cut_cross_entropy: true
|
||
|
|
|
||
|
|
## CTX settings
|
||
|
|
sequence_len: 32768
|
||
|
|
sample_packing: true
|
||
|
|
eval_sample_packing: false
|
||
|
|
pad_to_sequence_len: true
|
||
|
|
|
||
|
|
## max grad norm
|
||
|
|
max_grad_norm: 1.0
|
||
|
|
|
||
|
|
## WandB
|
||
|
|
wandb_project: Koto-Small
|
||
|
|
wandb_entity:
|
||
|
|
wandb_watch:
|
||
|
|
wandb_name: MiMo-7b_1e-5_adamw-8bit
|
||
|
|
wandb_log_model:
|
||
|
|
|
||
|
|
## hoe params
|
||
|
|
gradient_accumulation_steps: 4 # ???
|
||
|
|
micro_batch_size: 4
|
||
|
|
num_epochs: 1
|
||
|
|
lr_scheduler: cosine
|
||
|
|
learning_rate: 1e-5
|
||
|
|
optimizer: adamw_bnb_8bit # Options: "paged_ademamix_8bit", "adamw_bnb_8bit", "paged_adamw_8bit"
|
||
|
|
deepcompile: true
|
||
|
|
train_on_inputs: false
|
||
|
|
group_by_length: false
|
||
|
|
bf16: auto
|
||
|
|
fp16:
|
||
|
|
tf32: false
|
||
|
|
|
||
|
|
gradient_checkpointing: offload
|
||
|
|
early_stopping_patience:
|
||
|
|
resume_from_checkpoint:
|
||
|
|
local_rank:
|
||
|
|
logging_steps: 1
|
||
|
|
xformers_attention:
|
||
|
|
flash_attention: true
|
||
|
|
s2_attention:
|
||
|
|
|
||
|
|
warmup_steps: 50
|
||
|
|
saves_per_epoch: 2
|
||
|
|
debug:
|
||
|
|
deepspeed: ./deepspeed_configs/zero2.json
|
||
|
|
weight_decay: 0.0025
|
||
|
|
fsdp:
|
||
|
|
fsdp_config:
|
||
|
|
```
|
||
|
|
|
||
|
|
</details>
|