--- license: apache-2.0 base_model: Qwen/Qwen3-1.7B tags: - tool-calling - multi-turn - fine-tuned - tft-benchmark datasets: - google-research-datasets/dstc8-schema-guided-dialogue language: - en pipeline_tag: text-generation library_name: transformers --- # tft-benchmark-s2-tft-Qwen3-1.7B A **Qwen3-1.7B** model fine-tuned for multi-turn tool calling as part of the [TFT (Training from Traces) Benchmark](https://github.com/distil-labs/distil-tft-benchmarking). - **Pipeline**: TFT Pipeline - **Scenario**: S2 Noisy Labels — Noisy Labels - **LLM-as-a-judge score**: **0.844** - **staged_tool_call score**: **0.758** For full benchmark details, see our blog post: [Why Training on Production Traces Fails (and What to Do Instead)](https://www.distillabs.ai/blog/traces-vs-synthetic-benchmark/) ## Benchmark Overview This model is one of 10 models trained for the TFT benchmark, which compares two approaches to training Small Language Models (SLMs) from production traces: - **TFT Pipeline**: trace filtering + committee relabeling + synthetic data generation + finetuning - **Direct Training**: train directly on raw/corrupted traces (no filtering, no relabeling, no synth gen) Both pipelines are evaluated on the same held-out test set of 34 multi-turn Restaurants_1 conversations (~359 per-turn evaluation pairs) using LLM-as-a-judge scoring (0-1 scale). ## Scenario: S2 Noisy Labels — Noisy Labels 327 Restaurants_1 traces with 50% of assistant tool calls corrupted. Corruption types attack tool timing: service tools may be swapped, have wrong parameters, or be replaced with respond_to_user calls (and vice versa). 52% of corruptions change the tool choice itself. ## Training Details Trained using the **TFT (Training from Traces) pipeline**: production traces are filtered, committee-relabeled by multiple LLMs, then used as seeds for synthetic data generation. The student model is fine-tuned on the resulting synthetic dataset. ### Configuration - **Base model**: Qwen3-1.7B - **Task**: multi-turn-tool-calling-closed-book - **Teacher / synth gen model**: zai.glm-5 - **Judge model**: openai.gpt-oss-120b - **Committee** (TFT relabeling): openai.gpt-oss-120b + zai.glm-5 - **Training**: LoRA fine-tuning, merged weights ### Target Tools Based on the [Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue) dataset — restaurant search and reservation: - `respond_to_user` — send text messages to the user - `FindRestaurants` — search restaurants by cuisine, city, price range, live music, alcohol - `ReserveRestaurant` — reserve a table (restaurant name, city, time, date, party size) ## Full Benchmark Results | Scenario | TFT | Direct | Delta | |----------|-----|--------|-------| | S1 Baseline | 0.866 | 0.864 | +0.2pp | | S2 Noisy Labels | **0.844** | 0.721 | **+12.3pp** | | S3 Schema Drift | **0.844** | 0.585 | **+25.9pp** | | S4 Low Data | **0.852** | 0.649 | **+20.3pp** | | S5 Trace Mixing | **0.858** | 0.694 | **+16.4pp** | TFT matches Direct Training on clean data (S1) and outperforms it on every corrupted scenario by 12-26 percentage points. ## Links - **Blog post**: [Why Training on Production Traces Fails (and What to Do Instead)](https://www.distillabs.ai/blog/traces-vs-synthetic-benchmark/) - **Benchmark data & code**: [https://github.com/distil-labs/distil-tft-benchmarking](https://github.com/distil-labs/distil-tft-benchmarking) - **Dataset**: [Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue)