3.5 KiB
license, base_model, tags, datasets, language, pipeline_tag, library_name
| license | base_model | tags | datasets | language | pipeline_tag | library_name | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| apache-2.0 | Qwen/Qwen3-1.7B |
|
|
|
text-generation | transformers |
tft-benchmark-s2-tft-Qwen3-1.7B
A Qwen3-1.7B model fine-tuned for multi-turn tool calling as part of the TFT (Training from Traces) Benchmark.
- Pipeline: TFT Pipeline
- Scenario: S2 Noisy Labels — Noisy Labels
- LLM-as-a-judge score: 0.844
- staged_tool_call score: 0.758
For full benchmark details, see our blog post: Why Training on Production Traces Fails (and What to Do Instead)
Benchmark Overview
This model is one of 10 models trained for the TFT benchmark, which compares two approaches to training Small Language Models (SLMs) from production traces:
- TFT Pipeline: trace filtering + committee relabeling + synthetic data generation + finetuning
- Direct Training: train directly on raw/corrupted traces (no filtering, no relabeling, no synth gen)
Both pipelines are evaluated on the same held-out test set of 34 multi-turn Restaurants_1 conversations (~359 per-turn evaluation pairs) using LLM-as-a-judge scoring (0-1 scale).
Scenario: S2 Noisy Labels — Noisy Labels
327 Restaurants_1 traces with 50% of assistant tool calls corrupted. Corruption types attack tool timing: service tools may be swapped, have wrong parameters, or be replaced with respond_to_user calls (and vice versa). 52% of corruptions change the tool choice itself.
Training Details
Trained using the TFT (Training from Traces) pipeline: production traces are filtered, committee-relabeled by multiple LLMs, then used as seeds for synthetic data generation. The student model is fine-tuned on the resulting synthetic dataset.
Configuration
- Base model: Qwen3-1.7B
- Task: multi-turn-tool-calling-closed-book
- Teacher / synth gen model: zai.glm-5
- Judge model: openai.gpt-oss-120b
- Committee (TFT relabeling): openai.gpt-oss-120b + zai.glm-5
- Training: LoRA fine-tuning, merged weights
Target Tools
Based on the Schema-Guided Dialogue (SGD) dataset — restaurant search and reservation:
respond_to_user— send text messages to the userFindRestaurants— search restaurants by cuisine, city, price range, live music, alcoholReserveRestaurant— reserve a table (restaurant name, city, time, date, party size)
Full Benchmark Results
| Scenario | TFT | Direct | Delta |
|---|---|---|---|
| S1 Baseline | 0.866 | 0.864 | +0.2pp |
| S2 Noisy Labels | 0.844 | 0.721 | +12.3pp |
| S3 Schema Drift | 0.844 | 0.585 | +25.9pp |
| S4 Low Data | 0.852 | 0.649 | +20.3pp |
| S5 Trace Mixing | 0.858 | 0.694 | +16.4pp |
TFT matches Direct Training on clean data (S1) and outperforms it on every corrupted scenario by 12-26 percentage points.
Links
- Blog post: Why Training on Production Traces Fails (and What to Do Instead)
- Benchmark data & code: https://github.com/distil-labs/distil-tft-benchmarking
- Dataset: Schema-Guided Dialogue (SGD)