初始化项目，由ModelHub XC社区提供模型

Model: Aurore-Reveil/Koto-Small-7B-IT Source: Original Platform
2026-04-26 00:29:08 +08:00
commit eef36a5dd7
28 changed files with 158264 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,192 @@
+---
+license: mit
+language:
+- en
+base_model:
+- allura-org/Koto-Small-7B-PT
+library_name: transformers
+tags:
+- writing
+- creative-writing
+- roleplay
+---
+
+# Koto Small 7B (Instruct-Tuned)
+
+![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)
+
+Koto-Small-7B-IT is an instruct-tuned version of [Koto-Small-7B-PT](https://huggingface.co/allura-org/Koto-Small-7B-PT), which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases.
+
+
+## Usage
+
+### Chat template
+
+Trained with ChatML formatting, A typical input would look like this:
+
+```
+<|im_start|>system
+system prompt<|im_end|>
+<|im_start|>user
+Hi there!<|im_end|>
+<|im_start|>assistant
+Nice to meet you!<|im_end|>
+<|im_start|>user
+Can I ask a question?<|im_end|>
+<|im_start|>assistant
+```
+
+## Samplers
+
+We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!
+
+## Datasets
+
+```yaml
+datasets:
+  - path: Delta-Vector/Hydrus-General-Reasoning
+  - path: Delta-Vector/Hydrus-IF-Mix-Ai2
+  - path: Delta-Vector/Hydrus-Army-Inst
+  - path: Delta-Vector/Hydrus-AM-thinking-Science
+  - path: Delta-Vector/Hydrus-AM-Thinking-Code-Filtered
+  - path: Delta-Vector/Hydrus-AM-Thinking-IF-No-Think
+  - path: Delta-Vector/Hydrus-Tulu-SFT-Mix-V2
+  - path: Delta-Vector/Hydrus-System-Chat-2.0
+  - path: Delta-Vector/Orion-Praxis-Co-Writer
+  - path: Delta-Vector/Orion-Co-Writer-51K
+  - path: Delta-Vector/Orion-Creative_Writing-Complexity
+  - path: Delta-Vector/Orion-vanilla-backrooms-claude-sharegpt
+  - path: Delta-Vector/Hydrus-AM-Thinking-Multi-Turn
+  - path: PocketDoc/Dans-Failuremaxx-Adventure
+  - path: PocketDoc/Dans-Logicmaxx-SAT-AP
+  - path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small
+  - path: PocketDoc/Dans-Taskmaxx-DataPrepper
+  - path: PocketDoc/Dans-Prosemaxx-Instructwriter-Long
+  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-2
+  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-3
+  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-Continue-2
+  - path: PocketDoc/Dans-Systemmaxx
+```
+
+
+## Acknowledgements
+
+- Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector)/[Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.
+- Fizz for the pretrain.
+- Pocketdoc/Anthracite for da cool datasets.
+- Hensen chat.
+- Thank you to the illustrator of WataNare for drawing the art used in the model card!
+- Thanks to Curse for testing, ideas.
+- Thanks to Toasty for some data, ideas.
+- Thanks to everyone else in allura!
+
+ilya <3
+
+## Call for Help
+If you would like to help build on this model (RP SFT, further annealing on higher quality data, etc)...  
+
+Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](https://matrix.to/#/#allura:allura.moe)! <3
+
+## Technical Appendix
+<details>
+
+### Training Notes
+
+Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.
+
+### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)
+
+
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/66c26b6fb01b19d8c3c2467b/U-S6bC59Zg2Jhu5kPxYkj.png)
+
+
+### Axolotl Config
+```yaml
+# =============================================================================
+# Model + Saving
+# =============================================================================
+base_model: allura-forge/Koto-Small-7b-rc1
+output_dir: ./koto-sft 
+saves_per_epoch: 2
+deepcompile: true
+# =============================================================================
+# DATASET CONFIGURATION
+# =============================================================================
+datasets:
+  - path: /home/Ubuntu/Mango/pretok/test-koto-sft-7b-rc-1.parquet
+    ds_type: parquet
+    type: 
+
+shuffle_merged_datasets: true
+dataset_prepared_path: ./dataset_prepared
+train_on_inputs: false
+
+# =============================================================================
+# EVALUATION SETTINGS
+# =============================================================================
+#evals_per_epoch: 4
+#eval_table_size: 
+#eval_max_new_tokens: 128
+#eval_sample_packing: false
+val_set_size: 0.0
+
+# =============================================================================
+# MEMORY OPTIMIZATION
+# =============================================================================
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+liger_rope: true
+liger_rms_norm: true
+liger_layer_norm: true
+liger_glu_activation: true
+liger_fused_linear_cross_entropy: false
+cut_cross_entropy: true
+sample_packing: true
+pad_to_sequence_len: true
+gradient_checkpointing: true
+flash_attention: true
+
+# =============================================================================
+# MULTI-GPU TRAINING
+# =============================================================================
+deepspeed: ./deepspeed_configs/zero2.json
+
+# =============================================================================
+# LOGGING & MONITORING
+# =============================================================================
+wandb_project: Koto-Small
+wandb_entity: 
+wandb_watch: 
+wandb_name: sft
+wandb_log_model: 
+logging_steps: 1
+debug: false
+
+# =============================================================================
+# TRAINING PARAMETERS
+# =============================================================================
+micro_batch_size: 6
+gradient_accumulation_steps: 2
+num_epochs: 2
+sequence_len: 16000
+optimizer: paged_ademamix_8bit
+lr_scheduler: rex
+learning_rate: 8e-6
+warmup_ratio: 0.1
+max_grad_norm: 0.0001
+weight_decay: 0.0
+
+
+# =============================================================================
+# ADDITIONAL SETTINGS
+# =============================================================================
+local_rank: 
+group_by_length: false
+early_stopping_patience: 
+save_safetensors: true
+bf16: auto
+special_tokens:
+```
+
+</details>