LFM2-2.6B-ttt-sft/README.md

---
license: other
license_name: lfm1.0
license_link: https://www.liquid.ai/legal/lfm-license
base_model:
- LiquidAI/LFM2-2.6B
datasets:
- anakin87/tictactoe-filtered
library_name: transformers
tags:
- sft
- tictactoe
pipeline_tag: text-generation
language:
- en
---

# LFM2-2.6B-ttt-sft

Supervised Fine-Tuning checkpoint of [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B) for Tic Tac Toe.

The goal of this SFT warm-up was to teach the model the correct output format and valid move syntax, before applying Reinforcement Learning. The model is not a strong player at this stage.

This is an intermediate checkpoint from 🎓 **[LLM RL Environments Lil Course](https://github.com/anakin87/llm-rl-environments-lil-course)**, a hands-on course on building RL environments for Language Models, where models learn from rewards, not examples. It walks through the full process of turning a small open model into a specialist that outperforms a large proprietary one on a specific task (Tic Tac Toe). The final model is [anakin87/LFM2-2.6B-mr-tictactoe](https://huggingface.co/anakin87/LFM2-2.6B-mr-tictactoe).

🤗🕹️ **[Play against the final model](https://huggingface.co/spaces/anakin87/LFM2-2.6B-mr-tictactoe)**

## Training

- **Method:** SFT with [PRIME-RL](https://docs.primeintellect.ai/prime-rl)
- **Dataset:** [anakin87/tictactoe-filtered](https://huggingface.co/datasets/anakin87/tictactoe-filtered) (174 examples, ~5.5 epochs)
- **Steps:** 30, batch size 32, lr 1e-5, seq_len 700
- **Hardware:** NVIDIA RTX Pro 6000 96GB (~5 min)

## Evaluation

100 games per setting.

| **Model vs random opponent** | **% Wins** | **% Draws** | **% Losses** | **% Follows format** | **% Games w invalid moves** |
|------------------------------|------------|-------------|--------------|----------------------|---------------------|
| LiquidAI/LFM2-2.6B | 40 | 11 | 49 | 27.8 | 40 |
| **anakin87/LFM2-2.6B-ttt-sft** | **74** | **13** | **13** | **99.8** | **11** |
| | | | | | |
| **Model vs optimal opponent** | **% Wins** | **% Draws** | **% Losses** | **% Follows format** | **% Games w invalid moves** |
| LiquidAI/LFM2-2.6B | 0 | 11 | 89 | 24.7 | 43 |
| **anakin87/LFM2-2.6B-ttt-sft** | **0** | **52** | **48** | **99** | **14** |

Format following jumped from <30% to 99%. Gameplay strategy improved as a side effect.
初始化项目，由ModelHub XC社区提供模型 Model: anakin87/LFM2-2.6B-ttt-sft Source: Original Platform 2026-06-01 14:07:20 +08:00			`---`
			`license: other`
			`license_name: lfm1.0`
			`license_link: https://www.liquid.ai/legal/lfm-license`
			`base_model:`
			`- LiquidAI/LFM2-2.6B`
			`datasets:`
			`- anakin87/tictactoe-filtered`
			`library_name: transformers`
			`tags:`
			`- sft`
			`- tictactoe`
			`pipeline_tag: text-generation`
			`language:`
			`- en`
			`---`

			`# LFM2-2.6B-ttt-sft`

			`Supervised Fine-Tuning checkpoint of [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B) for Tic Tac Toe.`

			`The goal of this SFT warm-up was to teach the model the correct output format and valid move syntax, before applying Reinforcement Learning. The model is not a strong player at this stage.`

			This is an intermediate checkpoint from 🎓 [LLM RL Environments Lil Course](https://github.com/anakin87/llm-rl-environments-lil-course), a hands-on course on building RL environments for Language Models, where models learn from rewards, not examples. It walks through the full process of turning a small open model into a specialist that outperforms a large proprietary one on a specific task (Tic Tac Toe). The final model is [anakin87/LFM2-2.6B-mr-tictactoe](https://huggingface.co/anakin87/LFM2-2.6B-mr-tictactoe).

			`🤗🕹️ [Play against the final model](https://huggingface.co/spaces/anakin87/LFM2-2.6B-mr-tictactoe)`

			`## Training`

			`- Method: SFT with [PRIME-RL](https://docs.primeintellect.ai/prime-rl)`
			`- Dataset: [anakin87/tictactoe-filtered](https://huggingface.co/datasets/anakin87/tictactoe-filtered) (174 examples, ~5.5 epochs)`
			`- Steps: 30, batch size 32, lr 1e-5, seq_len 700`
			`- Hardware: NVIDIA RTX Pro 6000 96GB (~5 min)`

			`## Evaluation`

			`100 games per setting.`

			`\| Model vs random opponent \| % Wins \| % Draws \| % Losses \| % Follows format \| % Games w invalid moves \|`
			`\|------------------------------\|------------\|-------------\|--------------\|----------------------\|---------------------\|`
			`\| LiquidAI/LFM2-2.6B \| 40 \| 11 \| 49 \| 27.8 \| 40 \|`
			`\| anakin87/LFM2-2.6B-ttt-sft \| 74 \| 13 \| 13 \| 99.8 \| 11 \|`
			`\| \| \| \| \| \| \|`
			`\| Model vs optimal opponent \| % Wins \| % Draws \| % Losses \| % Follows format \| % Games w invalid moves \|`
			`\| LiquidAI/LFM2-2.6B \| 0 \| 11 \| 89 \| 24.7 \| 43 \|`
			`\| anakin87/LFM2-2.6B-ttt-sft \| 0 \| 52 \| 48 \| 99 \| 14 \|`

			`Format following jumped from <30% to 99%. Gameplay strategy improved as a side effect.`