初始化项目,由ModelHub XC社区提供模型
Model: allura-org/Koto-Small-7B-PT Source: Original Platform
This commit is contained in:
146
README.md
Normal file
146
README.md
Normal file
@@ -0,0 +1,146 @@
|
||||
---
|
||||
license: mit
|
||||
language:
|
||||
- en
|
||||
base_model:
|
||||
- XiaomiMiMo/MiMo-7B-Base
|
||||
library_name: transformers
|
||||
tags:
|
||||
- writing
|
||||
- creative-writing
|
||||
---
|
||||
|
||||
# Koto Small 7B (Pretrained)
|
||||
|
||||

|
||||
|
||||
Koto-Small-7B-PT is a version of MiMo-7B-Base trained on almost a billion tokens of creative writing data.
|
||||
|
||||
**Please check out [Aurore-Reveil/Koto-Small-7B-IT](https://huggingface.co/Aurore-Reveil/Koto-Small-7B-IT), it's the official RP and instruct tune!**
|
||||
|
||||
## Usage
|
||||
|
||||
This model is not intended for use outside of raw text completion settings, such as cowriting. Instruct will *not* work. Multi-turn roleplay will *not* work.
|
||||
|
||||
It was trained at 32k, but as not all samples were this long, we expect that in the best case you can get ~16k effective context.
|
||||
|
||||
We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!
|
||||
|
||||
## Datasets
|
||||
|
||||
Some of the data used to train this model includes:
|
||||
- Most of [The Anarchist Library](https://theanarchistlibrary.org/), a repository for anarchist manifestos and writing (see [allura-org/the-anarchist-library](https://huggingface.co/datasets/allura-org/the-anarchist-library))
|
||||
- A random sample of public domain books from Project Gutenberg
|
||||
- Furry (anthro and feral) storytelling and smut
|
||||
- A small subset of known high-quality books and story data
|
||||
|
||||
## Acknowledgements
|
||||
- thank you to [unk] for drawing the art used in the model card!
|
||||
- thank you very much to [mango/deltavector](https://huggingface.co/Delta-Vector) for providing the compute used to train this model
|
||||
- thanks to curse for testing, ideas
|
||||
- thanks to toasty for some data, ideas
|
||||
- thanks to everyone else in allura for moral support
|
||||
|
||||
ilya <3
|
||||
|
||||
## Call for Help
|
||||
if you would like to help build on this model (instruct/RP SFT, further annealing on higher quality data, etc)...
|
||||
please join [our discord](https://discord.gg/PPBMhF2vgC) or [our matrix](https://matrix.to/#/#allura:allura.moe)! <3
|
||||
|
||||
## Technical Appendix
|
||||
<details>
|
||||
|
||||
### Training Notes
|
||||
This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization.
|
||||
Before training, we [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise.
|
||||
|
||||
We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.
|
||||
|
||||
### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/zk8t6oq6/workspace)
|
||||

|
||||
|
||||
### Finetuning Notes
|
||||
This model has had ChatML tokens already added by Xiaomi. Please use this format when finetuning to ensure compatibility with the rest of the ecosystem.
|
||||
|
||||
### Axolotl Config
|
||||
```yaml
|
||||
## model
|
||||
base_model: allura-forge/MiMo-7B-Base-Qwenified
|
||||
trust_remote_code: true
|
||||
## qlora COPE!!!
|
||||
load_in_8bit: false
|
||||
load_in_4bit: false
|
||||
strict: false
|
||||
|
||||
## data
|
||||
datasets:
|
||||
datasets:
|
||||
- path: estrogen/bookscpt2
|
||||
type: completion
|
||||
field: text
|
||||
|
||||
shuffle_merged_datasets: true
|
||||
dataset_prepared_path: dataset_prepareds
|
||||
val_set_size: 0.0
|
||||
output_dir: ./MiMo-Pretrain
|
||||
|
||||
## Liger + CCE
|
||||
plugins:
|
||||
- axolotl.integrations.liger.LigerPlugin
|
||||
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
|
||||
liger_rope: true
|
||||
liger_rms_norm: true
|
||||
liger_layer_norm: true
|
||||
liger_glu_activation: true
|
||||
liger_fused_linear_cross_entropy: false
|
||||
cut_cross_entropy: true
|
||||
|
||||
## CTX settings
|
||||
sequence_len: 32768
|
||||
sample_packing: true
|
||||
eval_sample_packing: false
|
||||
pad_to_sequence_len: true
|
||||
|
||||
## max grad norm
|
||||
max_grad_norm: 1.0
|
||||
|
||||
## WandB
|
||||
wandb_project: Koto-Small
|
||||
wandb_entity:
|
||||
wandb_watch:
|
||||
wandb_name: MiMo-7b_1e-5_adamw-8bit
|
||||
wandb_log_model:
|
||||
|
||||
## hoe params
|
||||
gradient_accumulation_steps: 4 # ???
|
||||
micro_batch_size: 4
|
||||
num_epochs: 1
|
||||
lr_scheduler: cosine
|
||||
learning_rate: 1e-5
|
||||
optimizer: adamw_bnb_8bit # Options: "paged_ademamix_8bit", "adamw_bnb_8bit", "paged_adamw_8bit"
|
||||
deepcompile: true
|
||||
train_on_inputs: false
|
||||
group_by_length: false
|
||||
bf16: auto
|
||||
fp16:
|
||||
tf32: false
|
||||
|
||||
gradient_checkpointing: offload
|
||||
early_stopping_patience:
|
||||
resume_from_checkpoint:
|
||||
local_rank:
|
||||
logging_steps: 1
|
||||
xformers_attention:
|
||||
flash_attention: true
|
||||
s2_attention:
|
||||
|
||||
warmup_steps: 50
|
||||
saves_per_epoch: 2
|
||||
debug:
|
||||
deepspeed: ./deepspeed_configs/zero2.json
|
||||
weight_decay: 0.0025
|
||||
fsdp:
|
||||
fsdp_config:
|
||||
```
|
||||
|
||||
</details>
|
||||
Reference in New Issue
Block a user