49 lines
2.3 KiB
Markdown
49 lines
2.3 KiB
Markdown
---
|
|
license: other
|
|
license_name: lfm1.0
|
|
license_link: https://www.liquid.ai/legal/lfm-license
|
|
base_model:
|
|
- LiquidAI/LFM2-2.6B
|
|
datasets:
|
|
- anakin87/tictactoe-filtered
|
|
library_name: transformers
|
|
tags:
|
|
- sft
|
|
- tictactoe
|
|
pipeline_tag: text-generation
|
|
language:
|
|
- en
|
|
---
|
|
|
|
# LFM2-2.6B-ttt-sft
|
|
|
|
Supervised Fine-Tuning checkpoint of [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B) for Tic Tac Toe.
|
|
|
|
The goal of this SFT warm-up was to teach the model the correct output format and valid move syntax, before applying Reinforcement Learning. The model is not a strong player at this stage.
|
|
|
|
This is an intermediate checkpoint from 🎓 **[LLM RL Environments Lil Course](https://github.com/anakin87/llm-rl-environments-lil-course)**, a hands-on course on building RL environments for Language Models, where models learn from rewards, not examples. It walks through the full process of turning a small open model into a specialist that outperforms a large proprietary one on a specific task (Tic Tac Toe). The final model is [anakin87/LFM2-2.6B-mr-tictactoe](https://huggingface.co/anakin87/LFM2-2.6B-mr-tictactoe).
|
|
|
|
🤗🕹️ **[Play against the final model](https://huggingface.co/spaces/anakin87/LFM2-2.6B-mr-tictactoe)**
|
|
|
|
## Training
|
|
|
|
- **Method:** SFT with [PRIME-RL](https://docs.primeintellect.ai/prime-rl)
|
|
- **Dataset:** [anakin87/tictactoe-filtered](https://huggingface.co/datasets/anakin87/tictactoe-filtered) (174 examples, ~5.5 epochs)
|
|
- **Steps:** 30, batch size 32, lr 1e-5, seq_len 700
|
|
- **Hardware:** NVIDIA RTX Pro 6000 96GB (~5 min)
|
|
|
|
## Evaluation
|
|
|
|
100 games per setting.
|
|
|
|
| **Model vs random opponent** | **% Wins** | **% Draws** | **% Losses** | **% Follows format** | **% Games w invalid moves** |
|
|
|------------------------------|------------|-------------|--------------|----------------------|---------------------|
|
|
| LiquidAI/LFM2-2.6B | 40 | 11 | 49 | 27.8 | 40 |
|
|
| **anakin87/LFM2-2.6B-ttt-sft** | **74** | **13** | **13** | **99.8** | **11** |
|
|
| | | | | | |
|
|
| **Model vs optimal opponent** | **% Wins** | **% Draws** | **% Losses** | **% Follows format** | **% Games w invalid moves** |
|
|
| LiquidAI/LFM2-2.6B | 0 | 11 | 89 | 24.7 | 43 |
|
|
| **anakin87/LFM2-2.6B-ttt-sft** | **0** | **52** | **48** | **99** | **14** |
|
|
|
|
Format following jumped from <30% to 99%. Gameplay strategy improved as a side effect.
|