355 lines
12 KiB
Markdown
355 lines
12 KiB
Markdown
|
|
---
|
||
|
|
license: apache-2.0
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
- vi
|
||
|
|
base_model: Qwen/Qwen3-1.7B
|
||
|
|
tags:
|
||
|
|
- function-calling
|
||
|
|
- tool-use
|
||
|
|
- qwen3
|
||
|
|
- grpo
|
||
|
|
- rl-fine-tuned
|
||
|
|
datasets:
|
||
|
|
- Salesforce/xlam-function-calling-60k
|
||
|
|
- Team-ACE/ToolACE
|
||
|
|
- Agent-Ark/Toucan-1.5M
|
||
|
|
pipeline_tag: text-generation
|
||
|
|
library_name: transformers
|
||
|
|
---
|
||
|
|
|
||
|
|
# Qwen3-1.7B-FC: Function Calling Specialist
|
||
|
|
|
||
|
|
A function calling model based on Qwen3-1.7B, fine-tuned using **RLVR (Reinforcement Learning with Verifiable Rewards)** to improve tool-use capabilities on the BFCL V3 benchmark.
|
||
|
|
|
||
|
|
## 🏆 Performance Highlights
|
||
|
|
|
||
|
|
| Model | Size | BFCL Overall | Category Avg |
|
||
|
|
|-------|------|--------------|--------------|
|
||
|
|
| **Qwen3-1.7B-FC (Our)** | **1.7B** | **54.2%** | **50.8%** |
|
||
|
|
| Qwen3-1.7B (Base) | 1.7B | 48.8% | 45.8% |
|
||
|
|
| Qwen3-8B | 8B | 51.9% | 48.6% |
|
||
|
|
| Qwen3-14B | 14B | 51.6% | 49.0% |
|
||
|
|
|
||
|
|
|
||
|
|
### Response Efficiency
|
||
|
|
|
||
|
|
| Model | Avg Response Tokens | Efficiency vs Base |
|
||
|
|
|-------|--------------------|--------------------|
|
||
|
|
| Base Qwen3-1.7B | 35.6 tokens | - |
|
||
|
|
| **Qwen3-1.7B-FC (Our)** | **22.7 tokens** | **-36%** |
|
||
|
|
|
||
|
|
The fine-tuned model generates **36% fewer tokens** while maintaining higher accuracy, thanks to:
|
||
|
|
- Direct tool calls without verbose preambles
|
||
|
|
- Concise refusal messages ("None of the provided tools can answer this question")
|
||
|
|
- Reduced `<think>` reasoning blocks
|
||
|
|
|
||
|
|
## 📊 Detailed Benchmark Results (BFCL V3)
|
||
|
|
|
||
|
|
### Core Function Calling
|
||
|
|
|
||
|
|
| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | Qwen3-8B | Qwen3-14B |
|
||
|
|
|----------|---------------|-----------|----------|----------|
|
||
|
|
| simple | **81.0%** | 61.5% | 69.2% | 65.5% |
|
||
|
|
| multiple | **79.0%** | 55.5% | 66.0% | 57.0% |
|
||
|
|
| parallel | 78.0% | 68.0% | **78.0%** | 77.0% |
|
||
|
|
| parallel_multiple | 64.5% | 51.5% | **66.5%** | **66.5%** |
|
||
|
|
| irrelevance | 81.2% | 86.2% | 85.4% | **90.4%** |
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
### Executable Python
|
||
|
|
|
||
|
|
| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | 8B | 14B |
|
||
|
|
|----------|---------------|-----------|-----|-----|
|
||
|
|
| exec_simple | 84.0% | 82.0% | 84.0% | **87.0%** |
|
||
|
|
| exec_multiple | 70.0% | 70.0% | **78.0%** | **78.0%** |
|
||
|
|
| exec_parallel | 80.0% | 76.0% | **86.0%** | **90.0%** |
|
||
|
|
| exec_parallel_multiple | 60.0% | 60.0% | **67.5%** | 65.0% |
|
||
|
|
|
||
|
|
### Live API Categories
|
||
|
|
|
||
|
|
| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | Qwen3-8B | Qwen3-14B |
|
||
|
|
|----------|---------------|-----------|----------|----------|
|
||
|
|
| live_simple | **63.6%** | 43.8% | 51.2% | 51.6% |
|
||
|
|
| live_multiple | **55.0%** | 36.8% | 43.7% | 42.5% |
|
||
|
|
| live_parallel | **50.0%** | 18.8% | 43.8% | 43.8% |
|
||
|
|
| live_parallel_multiple | **66.7%** | 37.5% | 54.2% | 50.0% |
|
||
|
|
| live_irrelevance | 66.1% | **80.3%** | 78.7% | **79.9%** |
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
## 📚 Training Data
|
||
|
|
|
||
|
|
### Data Sources
|
||
|
|
|
||
|
|
| Source | Samples | Type | Description |
|
||
|
|
|--------|---------|------|-------------|
|
||
|
|
| [**xLAM**](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | ~60,000 | Positive | High-quality function calling examples from Salesforce |
|
||
|
|
| [**ToolACE**](https://huggingface.co/datasets/Team-ACE/ToolACE) | ~11,000 | Positive | Diverse multi-turn tool usage scenarios |
|
||
|
|
| [**Toucan-1.5M**](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) | 40,000 | **Negative** | Irrelevant queries (Server Shuffle method) |
|
||
|
|
| **Synthetic Negatives** | 6,000 | **Negative** | Domain mismatch, partial fulfillment, permission errors |
|
||
|
|
|
||
|
|
|
||
|
|
### Negative Sample Types
|
||
|
|
|
||
|
|
The model is trained to **refuse appropriately** using diverse negative samples:
|
||
|
|
|
||
|
|
| Type | Description | Example |
|
||
|
|
|------|-------------|---------|
|
||
|
|
| **Toucan Irrelevant** | Query has no matching tool in available functions | "What's the weather?" when only `get_stock_price` is available |
|
||
|
|
| **Domain Mismatch** | Tools from wrong domain | Asking about finance when only cooking tools available |
|
||
|
|
| **Action Mismatch** | Similar name but wrong action | Asking to "delete" when only "get" function exists |
|
||
|
|
| **Partial Fulfillment** | Tools can't fully solve query | Need 2 steps but only 1 tool available |
|
||
|
|
| **Permission/Auth** | Missing required permissions | Admin action without credentials |
|
||
|
|
| **Format Mismatch** | Wrong data format requirements | Tool expects JSON but query provides CSV |
|
||
|
|
|
||
|
|
## 🔧 Training Methodology
|
||
|
|
|
||
|
|
### Two-Stage RLVR Fine-tuning
|
||
|
|
|
||
|
|
|
||
|
|
1. **Stage 1**: Accuracy-focused training (V3)
|
||
|
|
- Trained from Qwen3-1.7B base
|
||
|
|
- Dataset: ~40K samples (stage2.parquet)
|
||
|
|
- Reward: Correctness (1.0) + Format (0.1) + Efficiency (0.3) + Refusal (0.3)
|
||
|
|
- Config: max_steps=5000, LR=5e-7, temp=1.2
|
||
|
|
- **Best checkpoint: step 100** (early stopping, highest accuracy)
|
||
|
|
|
||
|
|
2. **Stage 2**: Efficiency optimization (V4)
|
||
|
|
- Loaded from Stage 1 checkpoint-100
|
||
|
|
- Focus: Reduce verbosity, discourage `<think>` tags
|
||
|
|
- Reward weights: Efficiency=1.0, Correctness=0.5, Format=0.1, Refusal=0.3
|
||
|
|
- Config: max_steps=3000, LR=2e-7
|
||
|
|
- **Selected checkpoint: step 1100**
|
||
|
|
- **Result**: 36% reduction in response tokens
|
||
|
|
|
||
|
|
### Reward Function Design
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Combined Reward Formula
|
||
|
|
total_reward = (
|
||
|
|
format_weight * format_reward + # Valid <tool_call> JSON (0.0-1.0)
|
||
|
|
correct_weight * correctness_reward + # Tool name + arguments match (0.0-1.0)
|
||
|
|
refusal_weight * refusal_reward + # +1.0 correct refusal, -1.0 hallucination
|
||
|
|
efficiency_weight * efficiency_reward # Penalty for verbose <think>
|
||
|
|
)
|
||
|
|
|
||
|
|
# Stage 1 Weights (Accuracy Focus)
|
||
|
|
STAGE1_WEIGHTS = {
|
||
|
|
'format': 0.2,
|
||
|
|
'correctness': 1.0, # Main focus
|
||
|
|
'efficiency': 0.2,
|
||
|
|
'refusal': 0.3,
|
||
|
|
}
|
||
|
|
|
||
|
|
# Stage 2 Weights (Efficiency Focus)
|
||
|
|
STAGE2_WEIGHTS = {
|
||
|
|
'format': 0.1,
|
||
|
|
'correctness': 0.5, # Reduced - already accurate from Stage 1
|
||
|
|
'efficiency': 1.0, # Main focus - penalize <think> tags
|
||
|
|
'refusal': 0.3,
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Individual Reward Components
|
||
|
|
|
||
|
|
| Component | Description | Range |
|
||
|
|
|-----------|-------------|-------|
|
||
|
|
| **format_reward** | Valid `<tool_call>JSON</tool_call>` structure | 0.0 - 1.0 |
|
||
|
|
| **correctness_reward** | Tool name match + argument similarity | 0.0 - 1.0 |
|
||
|
|
| **refusal_reward** | +1.0 correct refusal, **-1.0 hallucination** | -1.0 to +1.0 |
|
||
|
|
| **efficiency_reward** | Stage 1: -0.3 for `<think>`, Stage 2: **-1.0** | -1.0 to +0.1 |
|
||
|
|
|
||
|
|
### Key Training Innovations
|
||
|
|
|
||
|
|
1. **Strong Refusal Penalty**: -1.0 for calling tools when `ground_truth = []`
|
||
|
|
2. **Toucan Irrelevant Data**: 40K high-quality "unanswerable" samples
|
||
|
|
3. **Efficiency Optimization**: Rewarding direct tool calls without preambles
|
||
|
|
4. **Discourage `<think>` Tags**: Strong penalty (-1.0) for verbose reasoning blocks
|
||
|
|
|
||
|
|
## 🚀 Usage
|
||
|
|
|
||
|
|
### With Transformers
|
||
|
|
|
||
|
|
```python
|
||
|
|
import torch
|
||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||
|
|
|
||
|
|
model_name = "contextboxai/Qwen3-1.7B-FC"
|
||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
|
||
|
|
|
||
|
|
# Define tools
|
||
|
|
tools = [{
|
||
|
|
"name": "get_weather",
|
||
|
|
"description": "Get weather for a location",
|
||
|
|
"parameters": {
|
||
|
|
"type": "object",
|
||
|
|
"properties": {
|
||
|
|
"location": {"type": "string", "description": "City name"}
|
||
|
|
},
|
||
|
|
"required": ["location"]
|
||
|
|
}
|
||
|
|
}]
|
||
|
|
|
||
|
|
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
|
||
|
|
|
||
|
|
prompt = tokenizer.apply_chat_template(
|
||
|
|
messages,
|
||
|
|
tools=tools,
|
||
|
|
add_generation_prompt=True,
|
||
|
|
tokenize=False,
|
||
|
|
enable_thinking=False # Disable thinking for efficiency
|
||
|
|
)
|
||
|
|
|
||
|
|
inputs = tokenizer(prompt, return_tensors="pt")
|
||
|
|
outputs = model.generate(**inputs, max_new_tokens=256)
|
||
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
||
|
|
print(response)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Expected Output
|
||
|
|
|
||
|
|
```xml
|
||
|
|
<tool_call>
|
||
|
|
{"name": "get_weather", "arguments": {"location": "Tokyo"}}
|
||
|
|
</tool_call>
|
||
|
|
```
|
||
|
|
|
||
|
|
### Refusal Example
|
||
|
|
|
||
|
|
When asked "What is the meaning of life?" with only `get_weather` tool available:
|
||
|
|
|
||
|
|
```
|
||
|
|
None of the provided tools can answer this question.
|
||
|
|
```
|
||
|
|
|
||
|
|
### With vLLM (Recommended for Production)
|
||
|
|
|
||
|
|
```python
|
||
|
|
from vllm import LLM, SamplingParams
|
||
|
|
|
||
|
|
llm = LLM(model="contextboxai/Qwen3-1.7B-FC")
|
||
|
|
sampling_params = SamplingParams(temperature=0, max_tokens=256)
|
||
|
|
|
||
|
|
# Generate with same prompt format as above
|
||
|
|
outputs = llm.generate([prompt], sampling_params)
|
||
|
|
```
|
||
|
|
|
||
|
|
## 💡 Key Features
|
||
|
|
|
||
|
|
### ✅ Strengths
|
||
|
|
|
||
|
|
- **Compact Size**: Only 1.7B parameters, runs on consumer GPUs
|
||
|
|
- **High Accuracy**: Outperforms larger models (8B, 14B) on function calling
|
||
|
|
- **Efficient Responses**: Direct tool calls without verbose preambles
|
||
|
|
- **Strong Refusal**: Trained on 46K negative samples to avoid hallucination
|
||
|
|
- **Multilingual**: Supports English and Vietnamese
|
||
|
|
- **Chat Compatible**: Maintains general chat ability (100% on chatable benchmark)
|
||
|
|
|
||
|
|
### ⚠️ Limitations
|
||
|
|
|
||
|
|
- **Irrelevance**: Slightly more aggressive at calling tools (-5% vs base)
|
||
|
|
|
||
|
|
|
||
|
|
## 📝 Use Cases
|
||
|
|
|
||
|
|
### 🎯 Ideal For
|
||
|
|
|
||
|
|
This model is optimized for **edge deployment** and **customer service automation** where a small, efficient model is needed:
|
||
|
|
|
||
|
|
| Use Case | Description |
|
||
|
|
|----------|-------------|
|
||
|
|
| **Edge Device Deployment** | Run locally on devices with limited GPU/RAM |
|
||
|
|
| **Customer Service Chatbot** | Automate order lookup, ticket creation, FAQ with tool calls |
|
||
|
|
| **Voice Agent / Call Center** | Real-time voice-to-action for phone support systems |
|
||
|
|
| **IoT/Smart Home** | Control devices via function calling on edge hardware |
|
||
|
|
| **Mobile AI Assistant** | On-device tool execution without cloud dependency |
|
||
|
|
| **Cost-Efficient API Gateway** | Route requests to appropriate backend services |
|
||
|
|
|
||
|
|
### 💼 Customer Service Examples
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Example: Customer asks about their order
|
||
|
|
tools = [
|
||
|
|
{"name": "lookup_order", "parameters": {"order_id": "string"}},
|
||
|
|
{"name": "create_ticket", "parameters": {"issue": "string", "priority": "string"}},
|
||
|
|
{"name": "get_faq", "parameters": {"topic": "string"}}
|
||
|
|
]
|
||
|
|
|
||
|
|
# User: "Đơn hàng #12345 của tôi ở đâu rồi?"
|
||
|
|
# Model output:
|
||
|
|
# <tool_call>
|
||
|
|
# {"name": "lookup_order", "arguments": {"order_id": "12345"}}
|
||
|
|
# </tool_call>
|
||
|
|
|
||
|
|
# User: "Tôi muốn đổi trả sản phẩm"
|
||
|
|
# Model output:
|
||
|
|
# <tool_call>
|
||
|
|
# {"name": "create_ticket", "arguments": {"issue": "product_return", "priority": "normal"}}
|
||
|
|
# </tool_call>
|
||
|
|
```
|
||
|
|
|
||
|
|
### ⚡ Why Small Model?
|
||
|
|
|
||
|
|
| Benefit | Description |
|
||
|
|
|---------|-------------|
|
||
|
|
| **Low Latency** | ~50ms inference on consumer GPU |
|
||
|
|
| **Low Cost** | 8x cheaper than 14B model to deploy |
|
||
|
|
| **Privacy** | Run entirely on-premise, no data leaves device |
|
||
|
|
| **Offline Capable** | Works without internet connection |
|
||
|
|
|
||
|
|
### 🧠 Reduced Catastrophic Forgetting
|
||
|
|
|
||
|
|
This model uses **RLVR (Reinforcement Learning from Verifiable Rewards)** instead of traditional SFT, which helps reduce capability loss:
|
||
|
|
|
||
|
|
- **Less forgetting than SFT**: RLVR fine-tunes through reward signals rather than directly overwriting weights
|
||
|
|
- **100% chatable score**: Model maintains normal conversation ability on BFCL benchmark
|
||
|
|
- **Multilingual preserved**: English and Vietnamese capabilities remain functional
|
||
|
|
- **Lower risk**: Compared to SFT, RLVR typically causes less regression on non-target tasks
|
||
|
|
|
||
|
|
## 🔬 Technical Details
|
||
|
|
|
||
|
|
| Attribute | Value |
|
||
|
|
|-----------|-------|
|
||
|
|
| Base Model | Qwen/Qwen3-1.7B |
|
||
|
|
| Training Method | RLVR (RL fine-tuning) |
|
||
|
|
| Training Steps | 100 (V3) + 3000 (V4) |
|
||
|
|
| Peak LR | 1e-6 → 2e-7 |
|
||
|
|
| Training Data | 117K samples (71K positive + 46K negative) |
|
||
|
|
| Precision | bfloat16 |
|
||
|
|
| Max Sequence Length | 32768 tokens |
|
||
|
|
| Tool Format | XML-style (`<tool_call>...</tool_call>`) |
|
||
|
|
|
||
|
|
## 📚 Citation
|
||
|
|
|
||
|
|
If you use this model, please cite:
|
||
|
|
|
||
|
|
```bibtex
|
||
|
|
@misc{qwen3-fc,
|
||
|
|
title={Qwen3-1.7B-FC: Efficient Function Calling via GRPO Fine-tuning},
|
||
|
|
author={ContextboxAI},
|
||
|
|
year={2024},
|
||
|
|
howpublished={\url{https://huggingface.co/contextboxai/Qwen3-1.7B-FC}},
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🙏 Acknowledgments
|
||
|
|
|
||
|
|
- [Qwen Team](https://github.com/QwenLM/Qwen3) for the excellent base model
|
||
|
|
- [Jan-nano](https://arxiv.org/pdf/2506.22760) for training methodology inspiration
|
||
|
|
- [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html) for the benchmark
|
||
|
|
- [xLAM (Salesforce)](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) for function calling data
|
||
|
|
- [ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) for multi-turn tool usage data
|
||
|
|
- [Toucan-1.5M (Agent-Ark)](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) for irrelevant/negative samples
|
||
|
|
- [TRL](https://github.com/huggingface/trl) for GRPO implementation
|
||
|
|
|
||
|
|
## 📄 License
|
||
|
|
|
||
|
|
Apache 2.0
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Model Card Contact**: ContextboxAI
|