SFT-finetuned Qwen/Qwen3-0.6B on the planner subset of Divij/qwen3-32b-mas-traces,
which contains traces of Qwen3-32B acting as a planner agent in a multi-agent system. This model is the
distilled student that learns to play the same role as Qwen3-32B in that pipeline.
fromtransformersimportAutoModelForCausalLM,AutoTokenizerimporttorchrepo="STEVENZHANG904/Qwen3-0.6B-planner-sft"tok=AutoTokenizer.from_pretrained(repo)model=AutoModelForCausalLM.from_pretrained(repo,dtype=torch.bfloat16,device_map="cuda")# Planner role expects a task-spec prompt — see the dataset card for the exact format.messages=[{"role":"system","content":"You are a helpful, creative, and smart assistant."},{"role":"user","content":"<your planner task spec here>"},]inputs=tok.apply_chat_template(messages,return_tensors="pt",add_generation_prompt=True).to("cuda")out=model.generate(inputs,max_new_tokens=4096,do_sample=True,temperature=0.6,top_p=0.95,# Qwen3 thinking-mode defaults)print(tok.decode(out[0][inputs.shape[-1]:],skip_special_tokens=True))
The model emits <think>...</think> reasoning blocks (inherited from Qwen3-32B traces).
Use sampling, not greedy decoding — small distilled models can loop in <think> under greedy.