初始化项目，由ModelHub XC社区提供模型

Model: allura-org/Koto-Small-7B-PT Source: Original Platform
2026-04-27 16:43:31 +08:00
commit d3c443a262
15 changed files with 152327 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,146 @@
+---
+license: mit
+language:
+- en
+base_model:
+- XiaomiMiMo/MiMo-7B-Base
+library_name: transformers
+tags:
+- writing
+- creative-writing
+---
+
+# Koto Small 7B (Pretrained)
+
+![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)
+
+Koto-Small-7B-PT is a version of MiMo-7B-Base trained on almost a billion tokens of creative writing data.
+
+**Please check out [Aurore-Reveil/Koto-Small-7B-IT](https://huggingface.co/Aurore-Reveil/Koto-Small-7B-IT), it's the official RP and instruct tune!**
+
+## Usage
+
+This model is not intended for use outside of raw text completion settings, such as cowriting. Instruct will *not* work. Multi-turn roleplay will *not* work.
+
+It was trained at 32k, but as not all samples were this long, we expect that in the best case you can get ~16k effective context.
+
+We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!
+
+## Datasets
+
+Some of the data used to train this model includes:
+- Most of [The Anarchist Library](https://theanarchistlibrary.org/), a repository for anarchist manifestos and writing (see [allura-org/the-anarchist-library](https://huggingface.co/datasets/allura-org/the-anarchist-library))
+- A random sample of public domain books from Project Gutenberg
+- Furry (anthro and feral) storytelling and smut
+- A small subset of known high-quality books and story data
+
+## Acknowledgements
+- thank you to [unk] for drawing the art used in the model card!
+- thank you very much to [mango/deltavector](https://huggingface.co/Delta-Vector) for providing the compute used to train this model
+- thanks to curse for testing, ideas
+- thanks to toasty for some data, ideas
+- thanks to everyone else in allura for moral support
+
+ilya <3
+
+## Call for Help
+if you would like to help build on this model (instruct/RP SFT, further annealing on higher quality data, etc)...  
+please join [our discord](https://discord.gg/PPBMhF2vgC) or [our matrix](https://matrix.to/#/#allura:allura.moe)! <3
+
+## Technical Appendix
+<details>
+
+### Training Notes
+This model was trained over the course of ~18 hours on an A100 node. We used 8-bit AdamW and the Cosine LR scheduler, as well as both gradient clipping and weight decay for regularization.
+Before training, we [converted the original model to the Qwen 2 architecture](https://huggingface.co/allura-forge/MiMo-7B-Base-Qwenified) by removing the MTP weights and custom modelling code, and slightly modifying the `config.json`. This allowed us to use CCE and Liger which let the train go much faster than it would have otherwise.
+
+We decided to keep the final model in the converted Qwen 2 format, as it is more supported by community software such as EXL2, EXL3, Aphrodite, etc, as well as the original architecture's MTP weights likely being much less effective after finetuning without them.
+
+### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/zk8t6oq6/workspace)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/Fc-Dvakg3lSwk2co7jHIM.png)
+
+### Finetuning Notes
+This model has had ChatML tokens already added by Xiaomi. Please use this format when finetuning to ensure compatibility with the rest of the ecosystem.
+
+### Axolotl Config
+```yaml
+## model
+base_model: allura-forge/MiMo-7B-Base-Qwenified
+trust_remote_code: true
+## qlora COPE!!!
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+
+## data 
+datasets:
+datasets:
+  - path: estrogen/bookscpt2
+    type: completion
+    field: text
+
+shuffle_merged_datasets: true
+dataset_prepared_path: dataset_prepareds
+val_set_size: 0.0
+output_dir: ./MiMo-Pretrain
+
+## Liger + CCE
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+liger_rope: true
+liger_rms_norm: true
+liger_layer_norm: true
+liger_glu_activation: true
+liger_fused_linear_cross_entropy: false
+cut_cross_entropy: true
+
+## CTX settings
+sequence_len: 32768
+sample_packing: true
+eval_sample_packing: false
+pad_to_sequence_len: true
+
+## max grad norm
+max_grad_norm: 1.0
+
+## WandB
+wandb_project: Koto-Small
+wandb_entity:
+wandb_watch:
+wandb_name: MiMo-7b_1e-5_adamw-8bit
+wandb_log_model:
+
+## hoe params
+gradient_accumulation_steps: 4 # ???
+micro_batch_size: 4
+num_epochs: 1
+lr_scheduler: cosine
+learning_rate: 1e-5
+optimizer: adamw_bnb_8bit  # Options: "paged_ademamix_8bit", "adamw_bnb_8bit", "paged_adamw_8bit"
+deepcompile: true
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: false
+
+gradient_checkpointing: offload
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+s2_attention:
+
+warmup_steps: 50
+saves_per_epoch: 2
+debug:
+deepspeed: ./deepspeed_configs/zero2.json
+weight_decay: 0.0025
+fsdp:
+fsdp_config:
+```
+
+</details>