Files
ModelHub XC a1c21a3b75 初始化项目,由ModelHub XC社区提供模型
Model: distillabs/tft-benchmark-s4-direct-Qwen3-1.7B
Source: Original Platform
2026-05-05 03:35:45 +08:00

3.4 KiB

license, base_model, tags, datasets, language, pipeline_tag, library_name
license base_model tags datasets language pipeline_tag library_name
apache-2.0 Qwen/Qwen3-1.7B
tool-calling
multi-turn
fine-tuned
tft-benchmark
google-research-datasets/dstc8-schema-guided-dialogue
en
text-generation transformers

tft-benchmark-s4-direct-Qwen3-1.7B

A Qwen3-1.7B model fine-tuned for multi-turn tool calling as part of the TFT (Training from Traces) Benchmark.

  • Pipeline: Direct Training
  • Scenario: S4 Low Data — Low Data
  • LLM-as-a-judge score: 0.649
  • staged_tool_call score: 0.66

For full benchmark details, see our blog post: Why Training on Production Traces Fails (and What to Do Instead)

Benchmark Overview

This model is one of 10 models trained for the TFT benchmark, which compares two approaches to training Small Language Models (SLMs) from production traces:

  • TFT Pipeline: trace filtering + committee relabeling + synthetic data generation + finetuning
  • Direct Training: train directly on raw/corrupted traces (no filtering, no relabeling, no synth gen)

Both pipelines are evaluated on the same held-out test set of 34 multi-turn Restaurants_1 conversations (~359 per-turn evaluation pairs) using LLM-as-a-judge scoring (0-1 scale).

Scenario: S4 Low Data — Low Data

Only 5 clean Restaurants_1 traces (subsampled from 327). Tests extreme data scarcity — Direct Training has only ~55 per-turn examples after expansion, while TFT amplifies from 5 seed conversations via synthetic data generation.

Training Details

Trained using Direct Training: the student model is fine-tuned directly on the raw production traces (expanded into per-turn training examples) with no filtering, relabeling, or synthetic data generation.

Configuration

  • Base model: Qwen3-1.7B
  • Task: multi-turn-tool-calling-closed-book
  • Teacher / synth gen model: zai.glm-5
  • Judge model: openai.gpt-oss-120b
  • Committee (TFT relabeling): openai.gpt-oss-120b + zai.glm-5
  • Training: LoRA fine-tuning, merged weights

Target Tools

Based on the Schema-Guided Dialogue (SGD) dataset — restaurant search and reservation:

  • respond_to_user — send text messages to the user
  • FindRestaurants — search restaurants by cuisine, city, price range, live music, alcohol
  • ReserveRestaurant — reserve a table (restaurant name, city, time, date, party size)

Full Benchmark Results

Scenario TFT Direct Delta
S1 Baseline 0.866 0.864 +0.2pp
S2 Noisy Labels 0.844 0.721 +12.3pp
S3 Schema Drift 0.844 0.585 +25.9pp
S4 Low Data 0.852 0.649 +20.3pp
S5 Trace Mixing 0.858 0.694 +16.4pp

TFT matches Direct Training on clean data (S1) and outperforms it on every corrupted scenario by 12-26 percentage points.