pokemon-showdown-agent-v6/README.md

---
license: apache-2.0
base_model: Qwen/Qwen3-4B
library_name: transformers
pipeline_tag: text-generation
tags:
- unsloth
- trl
- sft
- qwen3
- pokemon-showdown
- game-ai
- rocm
- amd
language:
- en
---

# Pokemon Showdown Agent v6

`Pokemon Showdown Agent v6` is a `Qwen/Qwen3-4B` fine-tune for **next-action prediction from raw Pokemon Showdown replay logs**. Given a battle-log prefix and the side it controls, the model is trained to emit a short action command such as `move Earthquake` or `switch Corviknight`.

This release is the merged checkpoint from the `v6` pipeline built with **Unsloth + TRL + AMD ROCm**. The tutorial version of the workflow uses a much smaller streamed subset for fast reproduction; this model is the larger production-oriented artifact.

## What makes v6 different

- It learns directly from messy raw replay logs instead of hand-written state summaries.
- It targets a strict action format suitable for agent pipelines: `move [move-name]` or `switch [pokemon-name]`.
- It was developed around AMD ROCm workflows, with `bfloat16` recommended for stable inference.

## Official notebook

Use the cleaned release notebook `pokemon_showdown_agent_v6_release.ipynb` for the reproducible tutorial flow.

## Intended use

Use this model when you want to:

- Predict the next action from a raw Pokemon Showdown log prefix.
- Build a text-only battle agent or evaluation harness.
- Study agent alignment from real replay trajectories.

This model is **not** a full simulator policy by itself. For ladder play or automated battle loops, you still need legality checks, environment wrappers, and battle-state management outside the model.

## Prompt format

The model expects a chat-style prompt with:

- A `system` message specifying which side the model is playing as.
- A `user` message containing the raw replay log prefix up to the current turn marker.

Recommended system prompt:

```text
You are a Pokemon Showdown battle AI. You play as {side}. Given the battle log, output your next action. Format: move <name> OR switch <name>. Append terastallize if you terastallize this turn.
```

## Quick start

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "GoldenGrapeGentleman1/pokemon-showdown-agent-v6"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "system",
        "content": (
            "You are a Pokemon Showdown battle AI. You play as p2. "
            "Given the battle log, output your next action. "
            "Format: move <name> OR switch <name>. "
            "Append terastallize if you terastallize this turn."
        ),
    },
    {
        "role": "user",
        "content": (
            "|player|p1|Player1|266|1500\n"
            "|player|p2|Player2|1|1500\n"
            "|teamsize|p1|6\n"
            "|teamsize|p2|6\n"
            "|gen|9\n"
            "|tier|[Gen 9] OU\n"
            "|\n"
            "|start\n"
            "|switch|p1a: Garchomp|Garchomp, M|100/100\n"
            "|switch|p2a: Corviknight|Corviknight, M|100/100\n"
            "|turn|1\n"
            "|move|p1a: Garchomp|Earthquake|p2a: Corviknight\n"
            "|-immune|p2a: Corviknight\n"
            "|turn|2"
        ),
    },
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=False,
        temperature=0.1,
        pad_token_id=tokenizer.eos_token_id,
    )

decoded = tokenizer.decode(outputs[0], skip_special_tokens=False)
response = decoded.split("<|im_start|>assistant\n")[-1].replace("<|im_end|>", "").strip()
if response.startswith("<think>"):
    response = response.split("</think>", 1)[-1].strip()
print(response)
```

## Training data

The full `v6` preprocessing pipeline was built from the public dataset:

- Source dataset: [`milkkarten/pokemon-showdown-replays-merged`](https://huggingface.co/datasets/milkkarten/pokemon-showdown-replays-merged)

Project preprocessing summary:

- `100,000` train games
- `10,000` test games
- `2,330,115` train samples
- `236,349` test samples
- `min_rating = 1200`
- `max_chars = 12000`

The companion tutorial notebook uses a smaller streamed subset with a higher rating filter so readers can reproduce the workflow quickly without downloading the full corpus.

## Training recipe

- Base model: `Qwen/Qwen3-4B`
- Fine-tuning method: LoRA SFT with Unsloth
- LoRA rank / alpha: `64 / 128`
- Full training context length: up to `4096`
- Frameworks: Unsloth, TRL, Transformers, Datasets
- Deployment recommendation on AMD: prefer `bfloat16` inference for stability

## Limitations

- This is a research checkpoint, not a complete battle engine.
- The model can still produce illegal or strategically weak actions.
- Prompt wording matters; changing the system format can reduce output reliability.
- Included evaluation artifacts are sanity checks, not a full competitive benchmark.

## Acknowledgements

- [Unsloth](https://github.com/unslothai/unsloth)
- [TRL](https://github.com/huggingface/trl)
- [Qwen](https://huggingface.co/Qwen)
- [Pokemon Showdown](https://pokemonshowdown.com/)

## Citation

If you build on this work, please cite the upstream tooling as well:

```bibtex
@misc{vonwerra2022trl,
  title        = {{TRL: Transformer Reinforcement Learning}},
  author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
  year         = 2020,
  journal      = {GitHub repository},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/huggingface/trl}}
}
```
初始化项目，由ModelHub XC社区提供模型 Model: GoldenGrapeGentleman1/pokemon-showdown-agent-v6 Source: Original Platform 2026-04-18 08:31:48 +08:00			`---`
			`license: apache-2.0`
			`base_model: Qwen/Qwen3-4B`
			`library_name: transformers`
			`pipeline_tag: text-generation`
			`tags:`
			`- unsloth`
			`- trl`
			`- sft`
			`- qwen3`
			`- pokemon-showdown`
			`- game-ai`
			`- rocm`
			`- amd`
			`language:`
			`- en`
			`---`

			`# Pokemon Showdown Agent v6`

			`Pokemon Showdown Agent v6` is a `Qwen/Qwen3-4B` fine-tune for next-action prediction from raw Pokemon Showdown replay logs. Given a battle-log prefix and the side it controls, the model is trained to emit a short action command such as `move Earthquake` or `switch Corviknight`.

			This release is the merged checkpoint from the `v6` pipeline built with Unsloth + TRL + AMD ROCm. The tutorial version of the workflow uses a much smaller streamed subset for fast reproduction; this model is the larger production-oriented artifact.

			`## What makes v6 different`

			`- It learns directly from messy raw replay logs instead of hand-written state summaries.`
			- It targets a strict action format suitable for agent pipelines: `move [move-name]` or `switch [pokemon-name]`.
			- It was developed around AMD ROCm workflows, with `bfloat16` recommended for stable inference.

			`## Official notebook`

			Use the cleaned release notebook `pokemon_showdown_agent_v6_release.ipynb` for the reproducible tutorial flow.

			`## Intended use`

			`Use this model when you want to:`

			`- Predict the next action from a raw Pokemon Showdown log prefix.`
			`- Build a text-only battle agent or evaluation harness.`
			`- Study agent alignment from real replay trajectories.`

			`This model is not a full simulator policy by itself. For ladder play or automated battle loops, you still need legality checks, environment wrappers, and battle-state management outside the model.`

			`## Prompt format`

			`The model expects a chat-style prompt with:`

			- A `system` message specifying which side the model is playing as.
			- A `user` message containing the raw replay log prefix up to the current turn marker.

			`Recommended system prompt:`

			```text
			`You are a Pokemon Showdown battle AI. You play as {side}. Given the battle log, output your next action. Format: move <name> OR switch <name>. Append terastallize if you terastallize this turn.`
			```

			`## Quick start`

			```python
			`import torch`
			`from transformers import AutoModelForCausalLM, AutoTokenizer`

			`repo_id = "GoldenGrapeGentleman1/pokemon-showdown-agent-v6"`

			`tokenizer = AutoTokenizer.from_pretrained(repo_id)`
			`model = AutoModelForCausalLM.from_pretrained(`
			`repo_id,`
			`torch_dtype=torch.bfloat16,`
			`device_map="auto",`
			`)`

			`messages = [`
			`{`
			`"role": "system",`
			`"content": (`
			`"You are a Pokemon Showdown battle AI. You play as p2. "`
			`"Given the battle log, output your next action. "`
			`"Format: move <name> OR switch <name>. "`
			`"Append terastallize if you terastallize this turn."`
			`),`
			`},`
			`{`
			`"role": "user",`
			`"content": (`
			`"\|player\|p1\|Player1\|266\|1500\n"`
			`"\|player\|p2\|Player2\|1\|1500\n"`
			`"\|teamsize\|p1\|6\n"`
			`"\|teamsize\|p2\|6\n"`
			`"\|gen\|9\n"`
			`"\|tier\|[Gen 9] OU\n"`
			`"\|\n"`
			`"\|start\n"`
			`"\|switch\|p1a: Garchomp\|Garchomp, M\|100/100\n"`
			`"\|switch\|p2a: Corviknight\|Corviknight, M\|100/100\n"`
			`"\|turn\|1\n"`
			`"\|move\|p1a: Garchomp\|Earthquake\|p2a: Corviknight\n"`
			`"\|-immune\|p2a: Corviknight\n"`
			`"\|turn\|2"`
			`),`
			`},`
			`]`

			`text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)`
			`inputs = tokenizer(text, return_tensors="pt").to(model.device)`

			`with torch.no_grad():`
			`outputs = model.generate(`
			`**inputs,`
			`max_new_tokens=64,`
			`do_sample=False,`
			`temperature=0.1,`
			`pad_token_id=tokenizer.eos_token_id,`
			`)`

			`decoded = tokenizer.decode(outputs[0], skip_special_tokens=False)`
			`response = decoded.split("<\|im_start\|>assistant\n")[-1].replace("<\|im_end\|>", "").strip()`
			`if response.startswith("<think>"):`
			`response = response.split("</think>", 1)[-1].strip()`
			`print(response)`
			```

			`## Training data`

			The full `v6` preprocessing pipeline was built from the public dataset:

			- Source dataset: [`milkkarten/pokemon-showdown-replays-merged`](https://huggingface.co/datasets/milkkarten/pokemon-showdown-replays-merged)

			`Project preprocessing summary:`

			- `100,000` train games
			- `10,000` test games
			- `2,330,115` train samples
			- `236,349` test samples
			- `min_rating = 1200`
			- `max_chars = 12000`

			`The companion tutorial notebook uses a smaller streamed subset with a higher rating filter so readers can reproduce the workflow quickly without downloading the full corpus.`

			`## Training recipe`

			- Base model: `Qwen/Qwen3-4B`
			`- Fine-tuning method: LoRA SFT with Unsloth`
			- LoRA rank / alpha: `64 / 128`
			- Full training context length: up to `4096`
			`- Frameworks: Unsloth, TRL, Transformers, Datasets`
			- Deployment recommendation on AMD: prefer `bfloat16` inference for stability

			`## Limitations`

			`- This is a research checkpoint, not a complete battle engine.`
			`- The model can still produce illegal or strategically weak actions.`
			`- Prompt wording matters; changing the system format can reduce output reliability.`
			`- Included evaluation artifacts are sanity checks, not a full competitive benchmark.`

			`## Acknowledgements`

			`- [Unsloth](https://github.com/unslothai/unsloth)`
			`- [TRL](https://github.com/huggingface/trl)`
			`- [Qwen](https://huggingface.co/Qwen)`
			`- [Pokemon Showdown](https://pokemonshowdown.com/)`

			`## Citation`

			`If you build on this work, please cite the upstream tooling as well:`

			```bibtex
			`@misc{vonwerra2022trl,`
			`title = {{TRL: Transformer Reinforcement Learning}},`
			`author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},`
			`year = 2020,`
			`journal = {GitHub repository},`
			`publisher = {GitHub},`
			`howpublished = {\url{https://github.com/huggingface/trl}}`
			`}`
			```