--- license: apache-2.0 base_model: Qwen/Qwen3-0.6B datasets: - Divij/qwen3-32b-mas-traces language: - en library_name: transformers tags: - sft - qwen3 - multi-agent - distillation - planner --- # STEVENZHANG904/Qwen3-0.6B-planner-sft SFT-finetuned **Qwen/Qwen3-0.6B** on the **planner** subset of [Divij/qwen3-32b-mas-traces](https://huggingface.co/datasets/Divij/qwen3-32b-mas-traces), which contains traces of **Qwen3-32B** acting as a `planner` agent in a multi-agent system. This model is the distilled student that learns to play the same role as Qwen3-32B in that pipeline. ## Branches | Branch | Epochs trained | Notes | |---|---|---| | `epoch2` | 2 | intermediate | | `epoch5` | 5 | intermediate | | `main` | 10 | final | ## Training configuration - **Base model:** `Qwen/Qwen3-0.6B` - **Dataset:** `Divij/qwen3-32b-mas-traces` (config `planner`) - **Loss:** assistant-only (system + user tokens masked) - **Optimizer:** AdamW (β=(0.9, 0.95), wd=0.01, eps=1e-8) - **Learning rate:** 1e-5, constant with 3% warmup - **Sequence length:** 8192 (sequence packing on) - **Precision:** bf16 - **Hardware:** 8× H100 80GB, DDP - **Liger-Kernel:** on (chunked CE + fused RMSNorm) ## Inference ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch repo = "STEVENZHANG904/Qwen3-0.6B-planner-sft" tok = AutoTokenizer.from_pretrained(repo) model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="cuda") # Planner role expects a task-spec prompt — see the dataset card for the exact format. messages = [ {"role": "system", "content": "You are a helpful, creative, and smart assistant."}, {"role": "user", "content": ""}, ] inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda") out = model.generate( inputs, max_new_tokens=4096, do_sample=True, temperature=0.6, top_p=0.95, # Qwen3 thinking-mode defaults ) print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True)) ``` The model emits `...` reasoning blocks (inherited from Qwen3-32B traces). **Use sampling**, not greedy decoding — small distilled models can loop in `` under greedy.