---
license: apache-2.0
language:
- en
- vi
base_model: Qwen/Qwen3-1.7B
tags:
- function-calling
- tool-use
- qwen3
- grpo
- rl-fine-tuned
datasets:
- Salesforce/xlam-function-calling-60k
- Team-ACE/ToolACE
- Agent-Ark/Toucan-1.5M
pipeline_tag: text-generation
library_name: transformers
---

# Qwen3-1.7B-FC: Function Calling Specialist

A function calling model based on Qwen3-1.7B, fine-tuned using **RLVR (Reinforcement Learning with Verifiable Rewards)** to improve tool-use capabilities on the BFCL V3 benchmark.

## 🏆 Performance Highlights

| Model | Size | BFCL Overall | Category Avg |
|-------|------|--------------|--------------|
| **Qwen3-1.7B-FC (Our)** | **1.7B** | **54.2%** | **50.8%** |
| Qwen3-1.7B (Base) | 1.7B | 48.8% | 45.8% |
| Qwen3-8B | 8B | 51.9% | 48.6% |
| Qwen3-14B | 14B | 51.6% | 49.0% |


### Response Efficiency

| Model | Avg Response Tokens | Efficiency vs Base |
|-------|--------------------|--------------------|
| Base Qwen3-1.7B | 35.6 tokens | - |
| **Qwen3-1.7B-FC (Our)** | **22.7 tokens** | **-36%** |

The fine-tuned model generates **36% fewer tokens** while maintaining higher accuracy, thanks to:
- Direct tool calls without verbose preambles
- Concise refusal messages ("None of the provided tools can answer this question")
- Reduced `<think>` reasoning blocks

## 📊 Detailed Benchmark Results (BFCL V3)

### Core Function Calling

| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | Qwen3-8B | Qwen3-14B |
|----------|---------------|-----------|----------|----------|
| simple | **81.0%** | 61.5% | 69.2% | 65.5% |
| multiple | **79.0%** | 55.5% | 66.0% | 57.0% |
| parallel | 78.0% | 68.0% | **78.0%** | 77.0% |
| parallel_multiple | 64.5% | 51.5% | **66.5%** | **66.5%** |
| irrelevance | 81.2% | 86.2% | 85.4% | **90.4%** |


### Executable Python

| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | 8B | 14B |
|----------|---------------|-----------|-----|-----|
| exec_simple | 84.0% | 82.0% | 84.0% | **87.0%** |
| exec_multiple | 70.0% | 70.0% | **78.0%** | **78.0%** |
| exec_parallel | 80.0% | 76.0% | **86.0%** | **90.0%** |
| exec_parallel_multiple | 60.0% | 60.0% | **67.5%** | 65.0% |

### Live API Categories

| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | Qwen3-8B | Qwen3-14B |
|----------|---------------|-----------|----------|----------|
| live_simple | **63.6%** | 43.8% | 51.2% | 51.6% |
| live_multiple | **55.0%** | 36.8% | 43.7% | 42.5% |
| live_parallel | **50.0%** | 18.8% | 43.8% | 43.8% |
| live_parallel_multiple | **66.7%** | 37.5% | 54.2% | 50.0% |
| live_irrelevance | 66.1% | **80.3%** | 78.7% | **79.9%** |


## 📚 Training Data

### Data Sources

| Source | Samples | Type | Description |
|--------|---------|------|-------------|
| [**xLAM**](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | ~60,000 | Positive | High-quality function calling examples from Salesforce |
| [**ToolACE**](https://huggingface.co/datasets/Team-ACE/ToolACE) | ~11,000 | Positive | Diverse multi-turn tool usage scenarios |
| [**Toucan-1.5M**](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) | 40,000 | **Negative** | Irrelevant queries (Server Shuffle method) |
| **Synthetic Negatives** | 6,000 | **Negative** | Domain mismatch, partial fulfillment, permission errors |


### Negative Sample Types

The model is trained to **refuse appropriately** using diverse negative samples:

| Type | Description | Example |
|------|-------------|---------|
| **Toucan Irrelevant** | Query has no matching tool in available functions | "What's the weather?" when only `get_stock_price` is available |
| **Domain Mismatch** | Tools from wrong domain | Asking about finance when only cooking tools available |
| **Action Mismatch** | Similar name but wrong action | Asking to "delete" when only "get" function exists |
| **Partial Fulfillment** | Tools can't fully solve query | Need 2 steps but only 1 tool available |
| **Permission/Auth** | Missing required permissions | Admin action without credentials |
| **Format Mismatch** | Wrong data format requirements | Tool expects JSON but query provides CSV |

## 🔧 Training Methodology

### Two-Stage RLVR Fine-tuning


1. **Stage 1**: Accuracy-focused training (V3)
   - Trained from Qwen3-1.7B base
   - Dataset: ~40K samples (stage2.parquet)
   - Reward: Correctness (1.0) + Format (0.1) + Efficiency (0.3) + Refusal (0.3)
   - Config: max_steps=5000, LR=5e-7, temp=1.2
   - **Best checkpoint: step 100** (early stopping, highest accuracy)

2. **Stage 2**: Efficiency optimization (V4)
   - Loaded from Stage 1 checkpoint-100
   - Focus: Reduce verbosity, discourage `<think>` tags
   - Reward weights: Efficiency=1.0, Correctness=0.5, Format=0.1, Refusal=0.3
   - Config: max_steps=3000, LR=2e-7
   - **Selected checkpoint: step 1100**
   - **Result**: 36% reduction in response tokens

### Reward Function Design

```python
# Combined Reward Formula
total_reward = (
    format_weight * format_reward +           # Valid <tool_call> JSON (0.0-1.0)
    correct_weight * correctness_reward +     # Tool name + arguments match (0.0-1.0)
    refusal_weight * refusal_reward +         # +1.0 correct refusal, -1.0 hallucination
    efficiency_weight * efficiency_reward     # Penalty for verbose <think>
)

# Stage 1 Weights (Accuracy Focus)
STAGE1_WEIGHTS = {
    'format': 0.2,
    'correctness': 1.0,    # Main focus
    'efficiency': 0.2,
    'refusal': 0.3,
}

# Stage 2 Weights (Efficiency Focus)
STAGE2_WEIGHTS = {
    'format': 0.1,
    'correctness': 0.5,    # Reduced - already accurate from Stage 1
    'efficiency': 1.0,     # Main focus - penalize <think> tags
    'refusal': 0.3,
}
```

### Individual Reward Components

| Component | Description | Range |
|-----------|-------------|-------|
| **format_reward** | Valid `<tool_call>JSON</tool_call>` structure | 0.0 - 1.0 |
| **correctness_reward** | Tool name match + argument similarity | 0.0 - 1.0 |
| **refusal_reward** | +1.0 correct refusal, **-1.0 hallucination** | -1.0 to +1.0 |
| **efficiency_reward** | Stage 1: -0.3 for `<think>`, Stage 2: **-1.0** | -1.0 to +0.1 |

### Key Training Innovations

1. **Strong Refusal Penalty**: -1.0 for calling tools when `ground_truth = []`
2. **Toucan Irrelevant Data**: 40K high-quality "unanswerable" samples
3. **Efficiency Optimization**: Rewarding direct tool calls without preambles
4. **Discourage `<think>` Tags**: Strong penalty (-1.0) for verbose reasoning blocks

## 🚀 Usage

### With Transformers

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "contextboxai/Qwen3-1.7B-FC"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

# Define tools
tools = [{
    "name": "get_weather",
    "description": "Get weather for a location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {"type": "string", "description": "City name"}
        },
        "required": ["location"]
    }
}]

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    tokenize=False,
    enable_thinking=False  # Disable thinking for efficiency
)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Expected Output

```xml
<tool_call>
{"name": "get_weather", "arguments": {"location": "Tokyo"}}
</tool_call>
```

### Refusal Example

When asked "What is the meaning of life?" with only `get_weather` tool available:

```
None of the provided tools can answer this question.
```

### With vLLM (Recommended for Production)

```python
from vllm import LLM, SamplingParams

llm = LLM(model="contextboxai/Qwen3-1.7B-FC")
sampling_params = SamplingParams(temperature=0, max_tokens=256)

# Generate with same prompt format as above
outputs = llm.generate([prompt], sampling_params)
```

## 💡 Key Features

### ✅ Strengths

- **Compact Size**: Only 1.7B parameters, runs on consumer GPUs
- **High Accuracy**: Outperforms larger models (8B, 14B) on function calling
- **Efficient Responses**: Direct tool calls without verbose preambles
- **Strong Refusal**: Trained on 46K negative samples to avoid hallucination
- **Multilingual**: Supports English and Vietnamese
- **Chat Compatible**: Maintains general chat ability (100% on chatable benchmark)

### ⚠️ Limitations

- **Irrelevance**: Slightly more aggressive at calling tools (-5% vs base)


## 📝 Use Cases

### 🎯 Ideal For

This model is optimized for **edge deployment** and **customer service automation** where a small, efficient model is needed:

| Use Case | Description |
|----------|-------------|
| **Edge Device Deployment** | Run locally on devices with limited GPU/RAM |
| **Customer Service Chatbot** | Automate order lookup, ticket creation, FAQ with tool calls |
| **Voice Agent / Call Center** | Real-time voice-to-action for phone support systems |
| **IoT/Smart Home** | Control devices via function calling on edge hardware |
| **Mobile AI Assistant** | On-device tool execution without cloud dependency |
| **Cost-Efficient API Gateway** | Route requests to appropriate backend services |

### 💼 Customer Service Examples

```python
# Example: Customer asks about their order
tools = [
    {"name": "lookup_order", "parameters": {"order_id": "string"}},
    {"name": "create_ticket", "parameters": {"issue": "string", "priority": "string"}},
    {"name": "get_faq", "parameters": {"topic": "string"}}
]

# User: "Đơn hàng #12345 của tôi ở đâu rồi?"
# Model output:
# <tool_call>
# {"name": "lookup_order", "arguments": {"order_id": "12345"}}
# </tool_call>

# User: "Tôi muốn đổi trả sản phẩm"
# Model output:
# <tool_call>
# {"name": "create_ticket", "arguments": {"issue": "product_return", "priority": "normal"}}
# </tool_call>
```

### ⚡ Why Small Model?

| Benefit | Description |
|---------|-------------|
| **Low Latency** | ~50ms inference on consumer GPU |
| **Low Cost** | 8x cheaper than 14B model to deploy |
| **Privacy** | Run entirely on-premise, no data leaves device |
| **Offline Capable** | Works without internet connection |

### 🧠 Reduced Catastrophic Forgetting

This model uses **RLVR (Reinforcement Learning from Verifiable Rewards)** instead of traditional SFT, which helps reduce capability loss:

- **Less forgetting than SFT**: RLVR fine-tunes through reward signals rather than directly overwriting weights
- **100% chatable score**: Model maintains normal conversation ability on BFCL benchmark
- **Multilingual preserved**: English and Vietnamese capabilities remain functional
- **Lower risk**: Compared to SFT, RLVR typically causes less regression on non-target tasks

## 🔬 Technical Details

| Attribute | Value |
|-----------|-------|
| Base Model | Qwen/Qwen3-1.7B |
| Training Method | RLVR (RL fine-tuning) |
| Training Steps | 100 (V3) + 3000 (V4) |
| Peak LR | 1e-6 → 2e-7 |
| Training Data | 117K samples (71K positive + 46K negative) |
| Precision | bfloat16 |
| Max Sequence Length | 32768 tokens |
| Tool Format | XML-style (`<tool_call>...</tool_call>`) |

## 📚 Citation

If you use this model, please cite:

```bibtex
@misc{qwen3-fc,
  title={Qwen3-1.7B-FC: Efficient Function Calling via GRPO Fine-tuning},
  author={ContextboxAI},
  year={2024},
  howpublished={\url{https://huggingface.co/contextboxai/Qwen3-1.7B-FC}},
}
```

## 🙏 Acknowledgments

- [Qwen Team](https://github.com/QwenLM/Qwen3) for the excellent base model
- [Jan-nano](https://arxiv.org/pdf/2506.22760) for training methodology inspiration
- [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html) for the benchmark
- [xLAM (Salesforce)](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) for function calling data
- [ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) for multi-turn tool usage data
- [Toucan-1.5M (Agent-Ark)](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) for irrelevant/negative samples
- [TRL](https://github.com/huggingface/trl) for GRPO implementation

## 📄 License

Apache 2.0

---

**Model Card Contact**: ContextboxAI