68 lines
2.2 KiB
Markdown
68 lines
2.2 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
base_model: Qwen/Qwen3-0.6B
|
|||
|
|
datasets:
|
|||
|
|
- Divij/qwen3-32b-mas-traces
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
library_name: transformers
|
|||
|
|
tags:
|
|||
|
|
- sft
|
|||
|
|
- qwen3
|
|||
|
|
- multi-agent
|
|||
|
|
- distillation
|
|||
|
|
- planner
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# STEVENZHANG904/Qwen3-0.6B-planner-sft
|
|||
|
|
|
|||
|
|
SFT-finetuned **Qwen/Qwen3-0.6B** on the **planner** subset of [Divij/qwen3-32b-mas-traces](https://huggingface.co/datasets/Divij/qwen3-32b-mas-traces),
|
|||
|
|
which contains traces of **Qwen3-32B** acting as a `planner` agent in a multi-agent system. This model is the
|
|||
|
|
distilled student that learns to play the same role as Qwen3-32B in that pipeline.
|
|||
|
|
|
|||
|
|
## Branches
|
|||
|
|
|
|||
|
|
| Branch | Epochs trained | Notes |
|
|||
|
|
|---|---|---|
|
|||
|
|
| `epoch2` | 2 | intermediate |
|
|||
|
|
| `epoch5` | 5 | intermediate |
|
|||
|
|
| `main` | 10 | final |
|
|||
|
|
|
|||
|
|
## Training configuration
|
|||
|
|
|
|||
|
|
- **Base model:** `Qwen/Qwen3-0.6B`
|
|||
|
|
- **Dataset:** `Divij/qwen3-32b-mas-traces` (config `planner`)
|
|||
|
|
- **Loss:** assistant-only (system + user tokens masked)
|
|||
|
|
- **Optimizer:** AdamW (β=(0.9, 0.95), wd=0.01, eps=1e-8)
|
|||
|
|
- **Learning rate:** 1e-5, constant with 3% warmup
|
|||
|
|
- **Sequence length:** 8192 (sequence packing on)
|
|||
|
|
- **Precision:** bf16
|
|||
|
|
- **Hardware:** 8× H100 80GB, DDP
|
|||
|
|
- **Liger-Kernel:** on (chunked CE + fused RMSNorm)
|
|||
|
|
|
|||
|
|
## Inference
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
repo = "STEVENZHANG904/Qwen3-0.6B-planner-sft"
|
|||
|
|
tok = AutoTokenizer.from_pretrained(repo)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="cuda")
|
|||
|
|
|
|||
|
|
# Planner role expects a task-spec prompt — see the dataset card for the exact format.
|
|||
|
|
messages = [
|
|||
|
|
{"role": "system", "content": "You are a helpful, creative, and smart assistant."},
|
|||
|
|
{"role": "user", "content": "<your planner task spec here>"},
|
|||
|
|
]
|
|||
|
|
inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
|
|||
|
|
out = model.generate(
|
|||
|
|
inputs, max_new_tokens=4096,
|
|||
|
|
do_sample=True, temperature=0.6, top_p=0.95, # Qwen3 thinking-mode defaults
|
|||
|
|
)
|
|||
|
|
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The model emits `<think>...</think>` reasoning blocks (inherited from Qwen3-32B traces).
|
|||
|
|
**Use sampling**, not greedy decoding — small distilled models can loop in `<think>` under greedy.
|