Koto-Small-7B-IT/README.md

---
license: mit
language:
- en
base_model:
- allura-org/Koto-Small-7B-PT
library_name: transformers
tags:
- writing
- creative-writing
- roleplay
---

# Koto Small 7B (Instruct-Tuned)

![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)

Koto-Small-7B-IT is an instruct-tuned version of [Koto-Small-7B-PT](https://huggingface.co/allura-org/Koto-Small-7B-PT), which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases.


## Usage

### Chat template

Trained with ChatML formatting, A typical input would look like this:

```
<|im_start|>system
system prompt<|im_end|>
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
```

## Samplers

We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!

## Datasets

```yaml
datasets:
  - path: Delta-Vector/Hydrus-General-Reasoning
  - path: Delta-Vector/Hydrus-IF-Mix-Ai2
  - path: Delta-Vector/Hydrus-Army-Inst
  - path: Delta-Vector/Hydrus-AM-thinking-Science
  - path: Delta-Vector/Hydrus-AM-Thinking-Code-Filtered
  - path: Delta-Vector/Hydrus-AM-Thinking-IF-No-Think
  - path: Delta-Vector/Hydrus-Tulu-SFT-Mix-V2
  - path: Delta-Vector/Hydrus-System-Chat-2.0
  - path: Delta-Vector/Orion-Praxis-Co-Writer
  - path: Delta-Vector/Orion-Co-Writer-51K
  - path: Delta-Vector/Orion-Creative_Writing-Complexity
  - path: Delta-Vector/Orion-vanilla-backrooms-claude-sharegpt
  - path: Delta-Vector/Hydrus-AM-Thinking-Multi-Turn
  - path: PocketDoc/Dans-Failuremaxx-Adventure
  - path: PocketDoc/Dans-Logicmaxx-SAT-AP
  - path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small
  - path: PocketDoc/Dans-Taskmaxx-DataPrepper
  - path: PocketDoc/Dans-Prosemaxx-Instructwriter-Long
  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-2
  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-3
  - path: PocketDoc/Dans-Prosemaxx-InstructWriter-Continue-2
  - path: PocketDoc/Dans-Systemmaxx
```


## Acknowledgements

- Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector)/[Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.
- Fizz for the pretrain.
- Pocketdoc/Anthracite for da cool datasets.
- Hensen chat.
- Thank you to the illustrator of WataNare for drawing the art used in the model card!
- Thanks to Curse for testing, ideas.
- Thanks to Toasty for some data, ideas.
- Thanks to everyone else in allura!

ilya <3

## Call for Help
If you would like to help build on this model (RP SFT, further annealing on higher quality data, etc)...  

Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](https://matrix.to/#/#allura:allura.moe)! <3

## Technical Appendix
<details>

### Training Notes

Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.

### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)


![image/png](https://cdn-uploads.huggingface.co/production/uploads/66c26b6fb01b19d8c3c2467b/U-S6bC59Zg2Jhu5kPxYkj.png)


### Axolotl Config
```yaml
# =============================================================================
# Model + Saving
# =============================================================================
base_model: allura-forge/Koto-Small-7b-rc1
output_dir: ./koto-sft 
saves_per_epoch: 2
deepcompile: true
# =============================================================================
# DATASET CONFIGURATION
# =============================================================================
datasets:
  - path: /home/Ubuntu/Mango/pretok/test-koto-sft-7b-rc-1.parquet
    ds_type: parquet
    type: 

shuffle_merged_datasets: true
dataset_prepared_path: ./dataset_prepared
train_on_inputs: false

# =============================================================================
# EVALUATION SETTINGS
# =============================================================================
#evals_per_epoch: 4
#eval_table_size: 
#eval_max_new_tokens: 128
#eval_sample_packing: false
val_set_size: 0.0

# =============================================================================
# MEMORY OPTIMIZATION
# =============================================================================
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true
sample_packing: true
pad_to_sequence_len: true
gradient_checkpointing: true
flash_attention: true

# =============================================================================
# MULTI-GPU TRAINING
# =============================================================================
deepspeed: ./deepspeed_configs/zero2.json

# =============================================================================
# LOGGING & MONITORING
# =============================================================================
wandb_project: Koto-Small
wandb_entity: 
wandb_watch: 
wandb_name: sft
wandb_log_model: 
logging_steps: 1
debug: false

# =============================================================================
# TRAINING PARAMETERS
# =============================================================================
micro_batch_size: 6
gradient_accumulation_steps: 2
num_epochs: 2
sequence_len: 16000
optimizer: paged_ademamix_8bit
lr_scheduler: rex
learning_rate: 8e-6
warmup_ratio: 0.1
max_grad_norm: 0.0001
weight_decay: 0.0


# =============================================================================
# ADDITIONAL SETTINGS
# =============================================================================
local_rank: 
group_by_length: false
early_stopping_patience: 
save_safetensors: true
bf16: auto
special_tokens:
```

</details>
初始化项目，由ModelHub XC社区提供模型 Model: Aurore-Reveil/Koto-Small-7B-IT Source: Original Platform 2026-04-26 00:29:08 +08:00			`---`
			`license: mit`
			`language:`
			`- en`
			`base_model:`
			`- allura-org/Koto-Small-7B-PT`
			`library_name: transformers`
			`tags:`
			`- writing`
			`- creative-writing`
			`- roleplay`
			`---`

			`# Koto Small 7B (Instruct-Tuned)`

			`![482629.png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/9Bnn2AnIjfQFWBGkhDNmI.png)`

			`Koto-Small-7B-IT is an instruct-tuned version of [Koto-Small-7B-PT](https://huggingface.co/allura-org/Koto-Small-7B-PT), which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases.`


			`## Usage`

			`### Chat template`

			`Trained with ChatML formatting, A typical input would look like this:`

			```
			`<\|im_start\|>system`
			`system prompt<\|im_end\|>`
			`<\|im_start\|>user`
			`Hi there!<\|im_end\|>`
			`<\|im_start\|>assistant`
			`Nice to meet you!<\|im_end\|>`
			`<\|im_start\|>user`
			`Can I ask a question?<\|im_end\|>`
			`<\|im_start\|>assistant`
			```

			`## Samplers`

			`We found that 1.25 temperature and 0.05 min_p worked best, but YMMV!`

			`## Datasets`

			```yaml
			`datasets:`
			`- path: Delta-Vector/Hydrus-General-Reasoning`
			`- path: Delta-Vector/Hydrus-IF-Mix-Ai2`
			`- path: Delta-Vector/Hydrus-Army-Inst`
			`- path: Delta-Vector/Hydrus-AM-thinking-Science`
			`- path: Delta-Vector/Hydrus-AM-Thinking-Code-Filtered`
			`- path: Delta-Vector/Hydrus-AM-Thinking-IF-No-Think`
			`- path: Delta-Vector/Hydrus-Tulu-SFT-Mix-V2`
			`- path: Delta-Vector/Hydrus-System-Chat-2.0`
			`- path: Delta-Vector/Orion-Praxis-Co-Writer`
			`- path: Delta-Vector/Orion-Co-Writer-51K`
			`- path: Delta-Vector/Orion-Creative_Writing-Complexity`
			`- path: Delta-Vector/Orion-vanilla-backrooms-claude-sharegpt`
			`- path: Delta-Vector/Hydrus-AM-Thinking-Multi-Turn`
			`- path: PocketDoc/Dans-Failuremaxx-Adventure`
			`- path: PocketDoc/Dans-Logicmaxx-SAT-AP`
			`- path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small`
			`- path: PocketDoc/Dans-Taskmaxx-DataPrepper`
			`- path: PocketDoc/Dans-Prosemaxx-Instructwriter-Long`
			`- path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-2`
			`- path: PocketDoc/Dans-Prosemaxx-InstructWriter-ZeroShot-3`
			`- path: PocketDoc/Dans-Prosemaxx-InstructWriter-Continue-2`
			`- path: PocketDoc/Dans-Systemmaxx`
			```


			`## Acknowledgements`

			`- Thank you very much to [Delta-Vector](https://huggingface.co/Delta-Vector)/[Mango](https://x.com/MangoSweet78) for providing the compute used to train this model.`
			`- Fizz for the pretrain.`
			`- Pocketdoc/Anthracite for da cool datasets.`
			`- Hensen chat.`
			`- Thank you to the illustrator of WataNare for drawing the art used in the model card!`
			`- Thanks to Curse for testing, ideas.`
			`- Thanks to Toasty for some data, ideas.`
			`- Thanks to everyone else in allura!`

			`ilya <3`

			`## Call for Help`
			`If you would like to help build on this model (RP SFT, further annealing on higher quality data, etc)...`

			`Please join [the allura discord](https://discord.gg/PPBMhF2vgC) or [the matrix](https://matrix.to/#/#allura:allura.moe)! <3`

			`## Technical Appendix`
			`<details>`

			`### Training Notes`

			`Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.`

			`### [WandB](https://wandb.ai/new-eden/Koto-Small/runs/fgln5fjh?nw=nwuserdeltavector)`


			`![image/png](https://cdn-uploads.huggingface.co/production/uploads/66c26b6fb01b19d8c3c2467b/U-S6bC59Zg2Jhu5kPxYkj.png)`


			`### Axolotl Config`
			```yaml
			`# =============================================================================`
			`# Model + Saving`
			`# =============================================================================`
			`base_model: allura-forge/Koto-Small-7b-rc1`
			`output_dir: ./koto-sft`
			`saves_per_epoch: 2`
			`deepcompile: true`
			`# =============================================================================`
			`# DATASET CONFIGURATION`
			`# =============================================================================`
			`datasets:`
			`- path: /home/Ubuntu/Mango/pretok/test-koto-sft-7b-rc-1.parquet`
			`ds_type: parquet`
			`type:`

			`shuffle_merged_datasets: true`
			`dataset_prepared_path: ./dataset_prepared`
			`train_on_inputs: false`

			`# =============================================================================`
			`# EVALUATION SETTINGS`
			`# =============================================================================`
			`#evals_per_epoch: 4`
			`#eval_table_size:`
			`#eval_max_new_tokens: 128`
			`#eval_sample_packing: false`
			`val_set_size: 0.0`

			`# =============================================================================`
			`# MEMORY OPTIMIZATION`
			`# =============================================================================`
			`plugins:`
			`- axolotl.integrations.liger.LigerPlugin`
			`- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin`
			`liger_rope: true`
			`liger_rms_norm: true`
			`liger_layer_norm: true`
			`liger_glu_activation: true`
			`liger_fused_linear_cross_entropy: false`
			`cut_cross_entropy: true`
			`sample_packing: true`
			`pad_to_sequence_len: true`
			`gradient_checkpointing: true`
			`flash_attention: true`

			`# =============================================================================`
			`# MULTI-GPU TRAINING`
			`# =============================================================================`
			`deepspeed: ./deepspeed_configs/zero2.json`

			`# =============================================================================`
			`# LOGGING & MONITORING`
			`# =============================================================================`
			`wandb_project: Koto-Small`
			`wandb_entity:`
			`wandb_watch:`
			`wandb_name: sft`
			`wandb_log_model:`
			`logging_steps: 1`
			`debug: false`

			`# =============================================================================`
			`# TRAINING PARAMETERS`
			`# =============================================================================`
			`micro_batch_size: 6`
			`gradient_accumulation_steps: 2`
			`num_epochs: 2`
			`sequence_len: 16000`
			`optimizer: paged_ademamix_8bit`
			`lr_scheduler: rex`
			`learning_rate: 8e-6`
			`warmup_ratio: 0.1`
			`max_grad_norm: 0.0001`
			`weight_decay: 0.0`


			`# =============================================================================`
			`# ADDITIONAL SETTINGS`
			`# =============================================================================`
			`local_rank:`
			`group_by_length: false`
			`early_stopping_patience:`
			`save_safetensors: true`
			`bf16: auto`
			`special_tokens:`
			```

			`</details>`