Files
Qwen3-1.7B-FC/README.md

355 lines
12 KiB
Markdown
Raw Normal View History

---
license: apache-2.0
language:
- en
- vi
base_model: Qwen/Qwen3-1.7B
tags:
- function-calling
- tool-use
- qwen3
- grpo
- rl-fine-tuned
datasets:
- Salesforce/xlam-function-calling-60k
- Team-ACE/ToolACE
- Agent-Ark/Toucan-1.5M
pipeline_tag: text-generation
library_name: transformers
---
# Qwen3-1.7B-FC: Function Calling Specialist
A function calling model based on Qwen3-1.7B, fine-tuned using **RLVR (Reinforcement Learning with Verifiable Rewards)** to improve tool-use capabilities on the BFCL V3 benchmark.
## 🏆 Performance Highlights
| Model | Size | BFCL Overall | Category Avg |
|-------|------|--------------|--------------|
| **Qwen3-1.7B-FC (Our)** | **1.7B** | **54.2%** | **50.8%** |
| Qwen3-1.7B (Base) | 1.7B | 48.8% | 45.8% |
| Qwen3-8B | 8B | 51.9% | 48.6% |
| Qwen3-14B | 14B | 51.6% | 49.0% |
### Response Efficiency
| Model | Avg Response Tokens | Efficiency vs Base |
|-------|--------------------|--------------------|
| Base Qwen3-1.7B | 35.6 tokens | - |
| **Qwen3-1.7B-FC (Our)** | **22.7 tokens** | **-36%** |
The fine-tuned model generates **36% fewer tokens** while maintaining higher accuracy, thanks to:
- Direct tool calls without verbose preambles
- Concise refusal messages ("None of the provided tools can answer this question")
- Reduced `<think>` reasoning blocks
## 📊 Detailed Benchmark Results (BFCL V3)
### Core Function Calling
| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | Qwen3-8B | Qwen3-14B |
|----------|---------------|-----------|----------|----------|
| simple | **81.0%** | 61.5% | 69.2% | 65.5% |
| multiple | **79.0%** | 55.5% | 66.0% | 57.0% |
| parallel | 78.0% | 68.0% | **78.0%** | 77.0% |
| parallel_multiple | 64.5% | 51.5% | **66.5%** | **66.5%** |
| irrelevance | 81.2% | 86.2% | 85.4% | **90.4%** |
### Executable Python
| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | 8B | 14B |
|----------|---------------|-----------|-----|-----|
| exec_simple | 84.0% | 82.0% | 84.0% | **87.0%** |
| exec_multiple | 70.0% | 70.0% | **78.0%** | **78.0%** |
| exec_parallel | 80.0% | 76.0% | **86.0%** | **90.0%** |
| exec_parallel_multiple | 60.0% | 60.0% | **67.5%** | 65.0% |
### Live API Categories
| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | Qwen3-8B | Qwen3-14B |
|----------|---------------|-----------|----------|----------|
| live_simple | **63.6%** | 43.8% | 51.2% | 51.6% |
| live_multiple | **55.0%** | 36.8% | 43.7% | 42.5% |
| live_parallel | **50.0%** | 18.8% | 43.8% | 43.8% |
| live_parallel_multiple | **66.7%** | 37.5% | 54.2% | 50.0% |
| live_irrelevance | 66.1% | **80.3%** | 78.7% | **79.9%** |
## 📚 Training Data
### Data Sources
| Source | Samples | Type | Description |
|--------|---------|------|-------------|
| [**xLAM**](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | ~60,000 | Positive | High-quality function calling examples from Salesforce |
| [**ToolACE**](https://huggingface.co/datasets/Team-ACE/ToolACE) | ~11,000 | Positive | Diverse multi-turn tool usage scenarios |
| [**Toucan-1.5M**](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) | 40,000 | **Negative** | Irrelevant queries (Server Shuffle method) |
| **Synthetic Negatives** | 6,000 | **Negative** | Domain mismatch, partial fulfillment, permission errors |
### Negative Sample Types
The model is trained to **refuse appropriately** using diverse negative samples:
| Type | Description | Example |
|------|-------------|---------|
| **Toucan Irrelevant** | Query has no matching tool in available functions | "What's the weather?" when only `get_stock_price` is available |
| **Domain Mismatch** | Tools from wrong domain | Asking about finance when only cooking tools available |
| **Action Mismatch** | Similar name but wrong action | Asking to "delete" when only "get" function exists |
| **Partial Fulfillment** | Tools can't fully solve query | Need 2 steps but only 1 tool available |
| **Permission/Auth** | Missing required permissions | Admin action without credentials |
| **Format Mismatch** | Wrong data format requirements | Tool expects JSON but query provides CSV |
## 🔧 Training Methodology
### Two-Stage RLVR Fine-tuning
1. **Stage 1**: Accuracy-focused training (V3)
- Trained from Qwen3-1.7B base
- Dataset: ~40K samples (stage2.parquet)
- Reward: Correctness (1.0) + Format (0.1) + Efficiency (0.3) + Refusal (0.3)
- Config: max_steps=5000, LR=5e-7, temp=1.2
- **Best checkpoint: step 100** (early stopping, highest accuracy)
2. **Stage 2**: Efficiency optimization (V4)
- Loaded from Stage 1 checkpoint-100
- Focus: Reduce verbosity, discourage `<think>` tags
- Reward weights: Efficiency=1.0, Correctness=0.5, Format=0.1, Refusal=0.3
- Config: max_steps=3000, LR=2e-7
- **Selected checkpoint: step 1100**
- **Result**: 36% reduction in response tokens
### Reward Function Design
```python
# Combined Reward Formula
total_reward = (
format_weight * format_reward + # Valid <tool_call> JSON (0.0-1.0)
correct_weight * correctness_reward + # Tool name + arguments match (0.0-1.0)
refusal_weight * refusal_reward + # +1.0 correct refusal, -1.0 hallucination
efficiency_weight * efficiency_reward # Penalty for verbose <think>
)
# Stage 1 Weights (Accuracy Focus)
STAGE1_WEIGHTS = {
'format': 0.2,
'correctness': 1.0, # Main focus
'efficiency': 0.2,
'refusal': 0.3,
}
# Stage 2 Weights (Efficiency Focus)
STAGE2_WEIGHTS = {
'format': 0.1,
'correctness': 0.5, # Reduced - already accurate from Stage 1
'efficiency': 1.0, # Main focus - penalize <think> tags
'refusal': 0.3,
}
```
### Individual Reward Components
| Component | Description | Range |
|-----------|-------------|-------|
| **format_reward** | Valid `<tool_call>JSON</tool_call>` structure | 0.0 - 1.0 |
| **correctness_reward** | Tool name match + argument similarity | 0.0 - 1.0 |
| **refusal_reward** | +1.0 correct refusal, **-1.0 hallucination** | -1.0 to +1.0 |
| **efficiency_reward** | Stage 1: -0.3 for `<think>`, Stage 2: **-1.0** | -1.0 to +0.1 |
### Key Training Innovations
1. **Strong Refusal Penalty**: -1.0 for calling tools when `ground_truth = []`
2. **Toucan Irrelevant Data**: 40K high-quality "unanswerable" samples
3. **Efficiency Optimization**: Rewarding direct tool calls without preambles
4. **Discourage `<think>` Tags**: Strong penalty (-1.0) for verbose reasoning blocks
## 🚀 Usage
### With Transformers
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "contextboxai/Qwen3-1.7B-FC"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
# Define tools
tools = [{
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}]
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
prompt = tokenizer.apply_chat_template(
messages,
tools=tools,
add_generation_prompt=True,
tokenize=False,
enable_thinking=False # Disable thinking for efficiency
)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
### Expected Output
```xml
<tool_call>
{"name": "get_weather", "arguments": {"location": "Tokyo"}}
</tool_call>
```
### Refusal Example
When asked "What is the meaning of life?" with only `get_weather` tool available:
```
None of the provided tools can answer this question.
```
### With vLLM (Recommended for Production)
```python
from vllm import LLM, SamplingParams
llm = LLM(model="contextboxai/Qwen3-1.7B-FC")
sampling_params = SamplingParams(temperature=0, max_tokens=256)
# Generate with same prompt format as above
outputs = llm.generate([prompt], sampling_params)
```
## 💡 Key Features
### ✅ Strengths
- **Compact Size**: Only 1.7B parameters, runs on consumer GPUs
- **High Accuracy**: Outperforms larger models (8B, 14B) on function calling
- **Efficient Responses**: Direct tool calls without verbose preambles
- **Strong Refusal**: Trained on 46K negative samples to avoid hallucination
- **Multilingual**: Supports English and Vietnamese
- **Chat Compatible**: Maintains general chat ability (100% on chatable benchmark)
### ⚠️ Limitations
- **Irrelevance**: Slightly more aggressive at calling tools (-5% vs base)
## 📝 Use Cases
### 🎯 Ideal For
This model is optimized for **edge deployment** and **customer service automation** where a small, efficient model is needed:
| Use Case | Description |
|----------|-------------|
| **Edge Device Deployment** | Run locally on devices with limited GPU/RAM |
| **Customer Service Chatbot** | Automate order lookup, ticket creation, FAQ with tool calls |
| **Voice Agent / Call Center** | Real-time voice-to-action for phone support systems |
| **IoT/Smart Home** | Control devices via function calling on edge hardware |
| **Mobile AI Assistant** | On-device tool execution without cloud dependency |
| **Cost-Efficient API Gateway** | Route requests to appropriate backend services |
### 💼 Customer Service Examples
```python
# Example: Customer asks about their order
tools = [
{"name": "lookup_order", "parameters": {"order_id": "string"}},
{"name": "create_ticket", "parameters": {"issue": "string", "priority": "string"}},
{"name": "get_faq", "parameters": {"topic": "string"}}
]
# User: "Đơn hàng #12345 của tôi ở đâu rồi?"
# Model output:
# <tool_call>
# {"name": "lookup_order", "arguments": {"order_id": "12345"}}
# </tool_call>
# User: "Tôi muốn đổi trả sản phẩm"
# Model output:
# <tool_call>
# {"name": "create_ticket", "arguments": {"issue": "product_return", "priority": "normal"}}
# </tool_call>
```
### ⚡ Why Small Model?
| Benefit | Description |
|---------|-------------|
| **Low Latency** | ~50ms inference on consumer GPU |
| **Low Cost** | 8x cheaper than 14B model to deploy |
| **Privacy** | Run entirely on-premise, no data leaves device |
| **Offline Capable** | Works without internet connection |
### 🧠 Reduced Catastrophic Forgetting
This model uses **RLVR (Reinforcement Learning from Verifiable Rewards)** instead of traditional SFT, which helps reduce capability loss:
- **Less forgetting than SFT**: RLVR fine-tunes through reward signals rather than directly overwriting weights
- **100% chatable score**: Model maintains normal conversation ability on BFCL benchmark
- **Multilingual preserved**: English and Vietnamese capabilities remain functional
- **Lower risk**: Compared to SFT, RLVR typically causes less regression on non-target tasks
## 🔬 Technical Details
| Attribute | Value |
|-----------|-------|
| Base Model | Qwen/Qwen3-1.7B |
| Training Method | RLVR (RL fine-tuning) |
| Training Steps | 100 (V3) + 3000 (V4) |
| Peak LR | 1e-6 → 2e-7 |
| Training Data | 117K samples (71K positive + 46K negative) |
| Precision | bfloat16 |
| Max Sequence Length | 32768 tokens |
| Tool Format | XML-style (`<tool_call>...</tool_call>`) |
## 📚 Citation
If you use this model, please cite:
```bibtex
@misc{qwen3-fc,
title={Qwen3-1.7B-FC: Efficient Function Calling via GRPO Fine-tuning},
author={ContextboxAI},
year={2024},
howpublished={\url{https://huggingface.co/contextboxai/Qwen3-1.7B-FC}},
}
```
## 🙏 Acknowledgments
- [Qwen Team](https://github.com/QwenLM/Qwen3) for the excellent base model
- [Jan-nano](https://arxiv.org/pdf/2506.22760) for training methodology inspiration
- [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html) for the benchmark
- [xLAM (Salesforce)](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) for function calling data
- [ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) for multi-turn tool usage data
- [Toucan-1.5M (Agent-Ark)](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) for irrelevant/negative samples
- [TRL](https://github.com/huggingface/trl) for GRPO implementation
## 📄 License
Apache 2.0
---
**Model Card Contact**: ContextboxAI