--- license: apache-2.0 base_model: Qwen/Qwen3-4B library_name: transformers pipeline_tag: text-generation tags: - unsloth - trl - sft - qwen3 - pokemon-showdown - game-ai - rocm - amd language: - en --- # Pokemon Showdown Agent v6 `Pokemon Showdown Agent v6` is a `Qwen/Qwen3-4B` fine-tune for **next-action prediction from raw Pokemon Showdown replay logs**. Given a battle-log prefix and the side it controls, the model is trained to emit a short action command such as `move Earthquake` or `switch Corviknight`. This release is the merged checkpoint from the `v6` pipeline built with **Unsloth + TRL + AMD ROCm**. The tutorial version of the workflow uses a much smaller streamed subset for fast reproduction; this model is the larger production-oriented artifact. ## What makes v6 different - It learns directly from messy raw replay logs instead of hand-written state summaries. - It targets a strict action format suitable for agent pipelines: `move [move-name]` or `switch [pokemon-name]`. - It was developed around AMD ROCm workflows, with `bfloat16` recommended for stable inference. ## Official notebook Use the cleaned release notebook `pokemon_showdown_agent_v6_release.ipynb` for the reproducible tutorial flow. ## Intended use Use this model when you want to: - Predict the next action from a raw Pokemon Showdown log prefix. - Build a text-only battle agent or evaluation harness. - Study agent alignment from real replay trajectories. This model is **not** a full simulator policy by itself. For ladder play or automated battle loops, you still need legality checks, environment wrappers, and battle-state management outside the model. ## Prompt format The model expects a chat-style prompt with: - A `system` message specifying which side the model is playing as. - A `user` message containing the raw replay log prefix up to the current turn marker. Recommended system prompt: ```text You are a Pokemon Showdown battle AI. You play as {side}. Given the battle log, output your next action. Format: move OR switch . Append terastallize if you terastallize this turn. ``` ## Quick start ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer repo_id = "GoldenGrapeGentleman1/pokemon-showdown-agent-v6" tokenizer = AutoTokenizer.from_pretrained(repo_id) model = AutoModelForCausalLM.from_pretrained( repo_id, torch_dtype=torch.bfloat16, device_map="auto", ) messages = [ { "role": "system", "content": ( "You are a Pokemon Showdown battle AI. You play as p2. " "Given the battle log, output your next action. " "Format: move OR switch . " "Append terastallize if you terastallize this turn." ), }, { "role": "user", "content": ( "|player|p1|Player1|266|1500\n" "|player|p2|Player2|1|1500\n" "|teamsize|p1|6\n" "|teamsize|p2|6\n" "|gen|9\n" "|tier|[Gen 9] OU\n" "|\n" "|start\n" "|switch|p1a: Garchomp|Garchomp, M|100/100\n" "|switch|p2a: Corviknight|Corviknight, M|100/100\n" "|turn|1\n" "|move|p1a: Garchomp|Earthquake|p2a: Corviknight\n" "|-immune|p2a: Corviknight\n" "|turn|2" ), }, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=64, do_sample=False, temperature=0.1, pad_token_id=tokenizer.eos_token_id, ) decoded = tokenizer.decode(outputs[0], skip_special_tokens=False) response = decoded.split("<|im_start|>assistant\n")[-1].replace("<|im_end|>", "").strip() if response.startswith(""): response = response.split("", 1)[-1].strip() print(response) ``` ## Training data The full `v6` preprocessing pipeline was built from the public dataset: - Source dataset: [`milkkarten/pokemon-showdown-replays-merged`](https://huggingface.co/datasets/milkkarten/pokemon-showdown-replays-merged) Project preprocessing summary: - `100,000` train games - `10,000` test games - `2,330,115` train samples - `236,349` test samples - `min_rating = 1200` - `max_chars = 12000` The companion tutorial notebook uses a smaller streamed subset with a higher rating filter so readers can reproduce the workflow quickly without downloading the full corpus. ## Training recipe - Base model: `Qwen/Qwen3-4B` - Fine-tuning method: LoRA SFT with Unsloth - LoRA rank / alpha: `64 / 128` - Full training context length: up to `4096` - Frameworks: Unsloth, TRL, Transformers, Datasets - Deployment recommendation on AMD: prefer `bfloat16` inference for stability ## Limitations - This is a research checkpoint, not a complete battle engine. - The model can still produce illegal or strategically weak actions. - Prompt wording matters; changing the system format can reduce output reliability. - Included evaluation artifacts are sanity checks, not a full competitive benchmark. ## Acknowledgements - [Unsloth](https://github.com/unslothai/unsloth) - [TRL](https://github.com/huggingface/trl) - [Qwen](https://huggingface.co/Qwen) - [Pokemon Showdown](https://pokemonshowdown.com/) ## Citation If you build on this work, please cite the upstream tooling as well: ```bibtex @misc{vonwerra2022trl, title = {{TRL: Transformer Reinforcement Learning}}, author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec}, year = 2020, journal = {GitHub repository}, publisher = {GitHub}, howpublished = {\url{https://github.com/huggingface/trl}} } ```