tft-benchmark-s1-tft-Qwen3-…/README.md

---
license: apache-2.0
base_model: Qwen/Qwen3-1.7B
tags:
  - tool-calling
  - multi-turn
  - fine-tuned
  - tft-benchmark
datasets:
  - google-research-datasets/dstc8-schema-guided-dialogue
language:
  - en
pipeline_tag: text-generation
library_name: transformers
---

# tft-benchmark-s1-tft-Qwen3-1.7B

A **Qwen3-1.7B** model fine-tuned for multi-turn tool calling as part of the [TFT (Training from Traces) Benchmark](https://github.com/distil-labs/distil-tft-benchmarking).

- **Pipeline**: TFT Pipeline
- **Scenario**: S1 Baseline — Baseline
- **LLM-as-a-judge score**: **0.866**
- **staged_tool_call score**: **0.765**

For full benchmark details, see our blog post: [Why Training on Production Traces Fails (and What to Do Instead)](https://www.distillabs.ai/blog/traces-vs-synthetic-benchmark/)

## Benchmark Overview

This model is one of 10 models trained for the TFT benchmark, which compares two approaches to training Small Language Models (SLMs) from production traces:

- **TFT Pipeline**: trace filtering + committee relabeling + synthetic data generation + finetuning
- **Direct Training**: train directly on raw/corrupted traces (no filtering, no relabeling, no synth gen)

Both pipelines are evaluated on the same held-out test set of 34 multi-turn Restaurants_1 conversations (~359 per-turn evaluation pairs) using LLM-as-a-judge scoring (0-1 scale).

## Scenario: S1 Baseline — Baseline

327 clean Restaurants_1 traces with no corruption. Tests the quality ceiling — how well each pipeline performs with perfect data.

## Training Details

Trained using the **TFT (Training from Traces) pipeline**: production traces are filtered, committee-relabeled by multiple LLMs, then used as seeds for synthetic data generation. The student model is fine-tuned on the resulting synthetic dataset.

### Configuration

- **Base model**: Qwen3-1.7B
- **Task**: multi-turn-tool-calling-closed-book
- **Teacher / synth gen model**: zai.glm-5
- **Judge model**: openai.gpt-oss-120b
- **Committee** (TFT relabeling): openai.gpt-oss-120b + zai.glm-5
- **Training**: LoRA fine-tuning, merged weights

### Target Tools

Based on the [Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue) dataset — restaurant search and reservation:

- `respond_to_user` — send text messages to the user
- `FindRestaurants` — search restaurants by cuisine, city, price range, live music, alcohol
- `ReserveRestaurant` — reserve a table (restaurant name, city, time, date, party size)

## Full Benchmark Results

| Scenario | TFT | Direct | Delta |
|----------|-----|--------|-------|
| S1 Baseline | 0.866 | 0.864 | +0.2pp |
| S2 Noisy Labels | **0.844** | 0.721 | **+12.3pp** |
| S3 Schema Drift | **0.844** | 0.585 | **+25.9pp** |
| S4 Low Data | **0.852** | 0.649 | **+20.3pp** |
| S5 Trace Mixing | **0.858** | 0.694 | **+16.4pp** |

TFT matches Direct Training on clean data (S1) and outperforms it on every corrupted scenario by 12-26 percentage points.

## Links

- **Blog post**: [Why Training on Production Traces Fails (and What to Do Instead)](https://www.distillabs.ai/blog/traces-vs-synthetic-benchmark/)
- **Benchmark data & code**: [https://github.com/distil-labs/distil-tft-benchmarking](https://github.com/distil-labs/distil-tft-benchmarking)
- **Dataset**: [Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue)
初始化项目，由ModelHub XC社区提供模型 Model: distillabs/tft-benchmark-s1-tft-Qwen3-1.7B Source: Original Platform 2026-04-22 17:08:47 +08:00			`---`
			`license: apache-2.0`
			`base_model: Qwen/Qwen3-1.7B`
			`tags:`
			`- tool-calling`
			`- multi-turn`
			`- fine-tuned`
			`- tft-benchmark`
			`datasets:`
			`- google-research-datasets/dstc8-schema-guided-dialogue`
			`language:`
			`- en`
			`pipeline_tag: text-generation`
			`library_name: transformers`
			`---`

			`# tft-benchmark-s1-tft-Qwen3-1.7B`

			`A Qwen3-1.7B model fine-tuned for multi-turn tool calling as part of the [TFT (Training from Traces) Benchmark](https://github.com/distil-labs/distil-tft-benchmarking).`

			`- Pipeline: TFT Pipeline`
			`- Scenario: S1 Baseline — Baseline`
			`- LLM-as-a-judge score: 0.866`
			`- staged_tool_call score: 0.765`

			`For full benchmark details, see our blog post: [Why Training on Production Traces Fails (and What to Do Instead)](https://www.distillabs.ai/blog/traces-vs-synthetic-benchmark/)`

			`## Benchmark Overview`

			`This model is one of 10 models trained for the TFT benchmark, which compares two approaches to training Small Language Models (SLMs) from production traces:`

			`- TFT Pipeline: trace filtering + committee relabeling + synthetic data generation + finetuning`
			`- Direct Training: train directly on raw/corrupted traces (no filtering, no relabeling, no synth gen)`

			`Both pipelines are evaluated on the same held-out test set of 34 multi-turn Restaurants_1 conversations (~359 per-turn evaluation pairs) using LLM-as-a-judge scoring (0-1 scale).`

			`## Scenario: S1 Baseline — Baseline`

			`327 clean Restaurants_1 traces with no corruption. Tests the quality ceiling — how well each pipeline performs with perfect data.`

			`## Training Details`

			`Trained using the TFT (Training from Traces) pipeline: production traces are filtered, committee-relabeled by multiple LLMs, then used as seeds for synthetic data generation. The student model is fine-tuned on the resulting synthetic dataset.`

			`### Configuration`

			`- Base model: Qwen3-1.7B`
			`- Task: multi-turn-tool-calling-closed-book`
			`- Teacher / synth gen model: zai.glm-5`
			`- Judge model: openai.gpt-oss-120b`
			`- Committee (TFT relabeling): openai.gpt-oss-120b + zai.glm-5`
			`- Training: LoRA fine-tuning, merged weights`

			`### Target Tools`

			`Based on the [Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue) dataset — restaurant search and reservation:`

			- `respond_to_user` — send text messages to the user
			- `FindRestaurants` — search restaurants by cuisine, city, price range, live music, alcohol
			- `ReserveRestaurant` — reserve a table (restaurant name, city, time, date, party size)

			`## Full Benchmark Results`

			`\| Scenario \| TFT \| Direct \| Delta \|`
			`\|----------\|-----\|--------\|-------\|`
			`\| S1 Baseline \| 0.866 \| 0.864 \| +0.2pp \|`
			`\| S2 Noisy Labels \| 0.844 \| 0.721 \| +12.3pp \|`
			`\| S3 Schema Drift \| 0.844 \| 0.585 \| +25.9pp \|`
			`\| S4 Low Data \| 0.852 \| 0.649 \| +20.3pp \|`
			`\| S5 Trace Mixing \| 0.858 \| 0.694 \| +16.4pp \|`

			`TFT matches Direct Training on clean data (S1) and outperforms it on every corrupted scenario by 12-26 percentage points.`

			`## Links`

			`- Blog post: [Why Training on Production Traces Fails (and What to Do Instead)](https://www.distillabs.ai/blog/traces-vs-synthetic-benchmark/)`
			`- Benchmark data & code: [https://github.com/distil-labs/distil-tft-benchmarking](https://github.com/distil-labs/distil-tft-benchmarking)`
			`- Dataset: [Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue)`