初始化项目，由ModelHub XC社区提供模型

Model: distillabs/tft-benchmark-s4-tft-Qwen3-1.7B Source: Original Platform
2026-05-05 04:27:48 +08:00
commit 7ce9153d10
12 changed files with 151981 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,78 @@
+---
+license: apache-2.0
+base_model: Qwen/Qwen3-1.7B
+tags:
+  - tool-calling
+  - multi-turn
+  - fine-tuned
+  - tft-benchmark
+datasets:
+  - google-research-datasets/dstc8-schema-guided-dialogue
+language:
+  - en
+pipeline_tag: text-generation
+library_name: transformers
+---
+
+# tft-benchmark-s4-tft-Qwen3-1.7B
+
+A **Qwen3-1.7B** model fine-tuned for multi-turn tool calling as part of the [TFT (Training from Traces) Benchmark](https://github.com/distil-labs/distil-tft-benchmarking).
+
+- **Pipeline**: TFT Pipeline
+- **Scenario**: S4 Low Data — Low Data
+- **LLM-as-a-judge score**: **0.852**
+- **staged_tool_call score**: **0.74**
+
+For full benchmark details, see our blog post: [Why Training on Production Traces Fails (and What to Do Instead)](https://www.distillabs.ai/blog/traces-vs-synthetic-benchmark/)
+
+## Benchmark Overview
+
+This model is one of 10 models trained for the TFT benchmark, which compares two approaches to training Small Language Models (SLMs) from production traces:
+
+- **TFT Pipeline**: trace filtering + committee relabeling + synthetic data generation + finetuning
+- **Direct Training**: train directly on raw/corrupted traces (no filtering, no relabeling, no synth gen)
+
+Both pipelines are evaluated on the same held-out test set of 34 multi-turn Restaurants_1 conversations (~359 per-turn evaluation pairs) using LLM-as-a-judge scoring (0-1 scale).
+
+## Scenario: S4 Low Data — Low Data
+
+Only 5 clean Restaurants_1 traces (subsampled from 327). Tests extreme data scarcity — Direct Training has only ~55 per-turn examples after expansion, while TFT amplifies from 5 seed conversations via synthetic data generation.
+
+## Training Details
+
+Trained using the **TFT (Training from Traces) pipeline**: production traces are filtered, committee-relabeled by multiple LLMs, then used as seeds for synthetic data generation. The student model is fine-tuned on the resulting synthetic dataset.
+
+### Configuration
+
+- **Base model**: Qwen3-1.7B
+- **Task**: multi-turn-tool-calling-closed-book
+- **Teacher / synth gen model**: zai.glm-5
+- **Judge model**: openai.gpt-oss-120b
+- **Committee** (TFT relabeling): openai.gpt-oss-120b + zai.glm-5
+- **Training**: LoRA fine-tuning, merged weights
+
+### Target Tools
+
+Based on the [Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue) dataset — restaurant search and reservation:
+
+- `respond_to_user` — send text messages to the user
+- `FindRestaurants` — search restaurants by cuisine, city, price range, live music, alcohol
+- `ReserveRestaurant` — reserve a table (restaurant name, city, time, date, party size)
+
+## Full Benchmark Results
+
+| Scenario | TFT | Direct | Delta |
+|----------|-----|--------|-------|
+| S1 Baseline | 0.866 | 0.864 | +0.2pp |
+| S2 Noisy Labels | **0.844** | 0.721 | **+12.3pp** |
+| S3 Schema Drift | **0.844** | 0.585 | **+25.9pp** |
+| S4 Low Data | **0.852** | 0.649 | **+20.3pp** |
+| S5 Trace Mixing | **0.858** | 0.694 | **+16.4pp** |
+
+TFT matches Direct Training on clean data (S1) and outperforms it on every corrupted scenario by 12-26 percentage points.
+
+## Links
+
+- **Blog post**: [Why Training on Production Traces Fails (and What to Do Instead)](https://www.distillabs.ai/blog/traces-vs-synthetic-benchmark/)
+- **Benchmark data & code**: [https://github.com/distil-labs/distil-tft-benchmarking](https://github.com/distil-labs/distil-tft-benchmarking)
+- **Dataset**: [Schema-Guided Dialogue (SGD)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue)