distillabs/tft-benchmark-s3-direct-Qwen3-1.7B

Go to file

ModelHub XC 16e07d4e21 初始化项目，由ModelHub XC社区提供模型

Model: distillabs/tft-benchmark-s3-direct-Qwen3-1.7B
Source: Original Platform

2026-04-21 12:57:00 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

added_tokens.json

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

merges.txt

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

model.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

vocab.json

初始化项目，由ModelHub XC社区提供模型

2026-04-21 12:57:00 +08:00

README.md

license, base_model, tags, datasets, language, pipeline_tag, library_name

license

base_model

tft-benchmark-s3-direct-Qwen3-1.7B

A Qwen3-1.7B model fine-tuned for multi-turn tool calling as part of the TFT (Training from Traces) Benchmark.

Pipeline: Direct Training
Scenario: S3 Schema Drift — Schema Drift
LLM-as-a-judge score: 0.585
staged_tool_call score: 0.499

For full benchmark details, see our blog post: Why Training on Production Traces Fails (and What to Do Instead)

Benchmark Overview

This model is one of 10 models trained for the TFT benchmark, which compares two approaches to training Small Language Models (SLMs) from production traces:

TFT Pipeline: trace filtering + committee relabeling + synthetic data generation + finetuning
Direct Training: train directly on raw/corrupted traces (no filtering, no relabeling, no synth gen)

Both pipelines are evaluated on the same held-out test set of 34 multi-turn Restaurants_1 conversations (~359 per-turn evaluation pairs) using LLM-as-a-judge scoring (0-1 scale).

Scenario: S3 Schema Drift — Schema Drift

50/50 mix of Restaurants_2 (146 traces) and Restaurants_1 (146 traces) with all function and parameter names randomly renamed. 0% of training data uses correct R1 function names — 21 unique function names and 47 unique parameter names across the training set.

Training Details

Trained using Direct Training: the student model is fine-tuned directly on the raw production traces (expanded into per-turn training examples) with no filtering, relabeling, or synthetic data generation.

Configuration

Base model: Qwen3-1.7B
Task: multi-turn-tool-calling-closed-book
Teacher / synth gen model: zai.glm-5
Judge model: openai.gpt-oss-120b
Committee (TFT relabeling): openai.gpt-oss-120b + zai.glm-5
Training: LoRA fine-tuning, merged weights

Target Tools

Based on the Schema-Guided Dialogue (SGD) dataset — restaurant search and reservation:

respond_to_user — send text messages to the user
FindRestaurants — search restaurants by cuisine, city, price range, live music, alcohol
ReserveRestaurant — reserve a table (restaurant name, city, time, date, party size)

Full Benchmark Results

Scenario	TFT	Direct	Delta
S1 Baseline	0.866	0.864	+0.2pp
S2 Noisy Labels	0.844	0.721	+12.3pp
S3 Schema Drift	0.844	0.585	+25.9pp
S4 Low Data	0.852	0.649	+20.3pp
S5 Trace Mixing	0.858	0.694	+16.4pp

TFT matches Direct Training on clean data (S1) and outperforms it on every corrupted scenario by 12-26 percentage points.

README.md

tft-benchmark-s3-direct-Qwen3-1.7B

Benchmark Overview

Scenario: S3 Schema Drift — Schema Drift

Training Details

Configuration

Target Tools

Full Benchmark Results

Links