--- license: apache-2.0 language: - en - vi base_model: Qwen/Qwen3-1.7B tags: - function-calling - tool-use - qwen3 - grpo - rl-fine-tuned datasets: - Salesforce/xlam-function-calling-60k - Team-ACE/ToolACE - Agent-Ark/Toucan-1.5M pipeline_tag: text-generation library_name: transformers --- # Qwen3-1.7B-FC: Function Calling Specialist A function calling model based on Qwen3-1.7B, fine-tuned using **RLVR (Reinforcement Learning with Verifiable Rewards)** to improve tool-use capabilities on the BFCL V3 benchmark. ## 🏆 Performance Highlights | Model | Size | BFCL Overall | Category Avg | |-------|------|--------------|--------------| | **Qwen3-1.7B-FC (Our)** | **1.7B** | **54.2%** | **50.8%** | | Qwen3-1.7B (Base) | 1.7B | 48.8% | 45.8% | | Qwen3-8B | 8B | 51.9% | 48.6% | | Qwen3-14B | 14B | 51.6% | 49.0% | ### Response Efficiency | Model | Avg Response Tokens | Efficiency vs Base | |-------|--------------------|--------------------| | Base Qwen3-1.7B | 35.6 tokens | - | | **Qwen3-1.7B-FC (Our)** | **22.7 tokens** | **-36%** | The fine-tuned model generates **36% fewer tokens** while maintaining higher accuracy, thanks to: - Direct tool calls without verbose preambles - Concise refusal messages ("None of the provided tools can answer this question") - Reduced `` reasoning blocks ## 📊 Detailed Benchmark Results (BFCL V3) ### Core Function Calling | Category | Qwen3-1.7B-FC (Our) | Base 1.7B | Qwen3-8B | Qwen3-14B | |----------|---------------|-----------|----------|----------| | simple | **81.0%** | 61.5% | 69.2% | 65.5% | | multiple | **79.0%** | 55.5% | 66.0% | 57.0% | | parallel | 78.0% | 68.0% | **78.0%** | 77.0% | | parallel_multiple | 64.5% | 51.5% | **66.5%** | **66.5%** | | irrelevance | 81.2% | 86.2% | 85.4% | **90.4%** | ### Executable Python | Category | Qwen3-1.7B-FC (Our) | Base 1.7B | 8B | 14B | |----------|---------------|-----------|-----|-----| | exec_simple | 84.0% | 82.0% | 84.0% | **87.0%** | | exec_multiple | 70.0% | 70.0% | **78.0%** | **78.0%** | | exec_parallel | 80.0% | 76.0% | **86.0%** | **90.0%** | | exec_parallel_multiple | 60.0% | 60.0% | **67.5%** | 65.0% | ### Live API Categories | Category | Qwen3-1.7B-FC (Our) | Base 1.7B | Qwen3-8B | Qwen3-14B | |----------|---------------|-----------|----------|----------| | live_simple | **63.6%** | 43.8% | 51.2% | 51.6% | | live_multiple | **55.0%** | 36.8% | 43.7% | 42.5% | | live_parallel | **50.0%** | 18.8% | 43.8% | 43.8% | | live_parallel_multiple | **66.7%** | 37.5% | 54.2% | 50.0% | | live_irrelevance | 66.1% | **80.3%** | 78.7% | **79.9%** | ## 📚 Training Data ### Data Sources | Source | Samples | Type | Description | |--------|---------|------|-------------| | [**xLAM**](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | ~60,000 | Positive | High-quality function calling examples from Salesforce | | [**ToolACE**](https://huggingface.co/datasets/Team-ACE/ToolACE) | ~11,000 | Positive | Diverse multi-turn tool usage scenarios | | [**Toucan-1.5M**](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) | 40,000 | **Negative** | Irrelevant queries (Server Shuffle method) | | **Synthetic Negatives** | 6,000 | **Negative** | Domain mismatch, partial fulfillment, permission errors | ### Negative Sample Types The model is trained to **refuse appropriately** using diverse negative samples: | Type | Description | Example | |------|-------------|---------| | **Toucan Irrelevant** | Query has no matching tool in available functions | "What's the weather?" when only `get_stock_price` is available | | **Domain Mismatch** | Tools from wrong domain | Asking about finance when only cooking tools available | | **Action Mismatch** | Similar name but wrong action | Asking to "delete" when only "get" function exists | | **Partial Fulfillment** | Tools can't fully solve query | Need 2 steps but only 1 tool available | | **Permission/Auth** | Missing required permissions | Admin action without credentials | | **Format Mismatch** | Wrong data format requirements | Tool expects JSON but query provides CSV | ## 🔧 Training Methodology ### Two-Stage RLVR Fine-tuning 1. **Stage 1**: Accuracy-focused training (V3) - Trained from Qwen3-1.7B base - Dataset: ~40K samples (stage2.parquet) - Reward: Correctness (1.0) + Format (0.1) + Efficiency (0.3) + Refusal (0.3) - Config: max_steps=5000, LR=5e-7, temp=1.2 - **Best checkpoint: step 100** (early stopping, highest accuracy) 2. **Stage 2**: Efficiency optimization (V4) - Loaded from Stage 1 checkpoint-100 - Focus: Reduce verbosity, discourage `` tags - Reward weights: Efficiency=1.0, Correctness=0.5, Format=0.1, Refusal=0.3 - Config: max_steps=3000, LR=2e-7 - **Selected checkpoint: step 1100** - **Result**: 36% reduction in response tokens ### Reward Function Design ```python # Combined Reward Formula total_reward = ( format_weight * format_reward + # Valid JSON (0.0-1.0) correct_weight * correctness_reward + # Tool name + arguments match (0.0-1.0) refusal_weight * refusal_reward + # +1.0 correct refusal, -1.0 hallucination efficiency_weight * efficiency_reward # Penalty for verbose ) # Stage 1 Weights (Accuracy Focus) STAGE1_WEIGHTS = { 'format': 0.2, 'correctness': 1.0, # Main focus 'efficiency': 0.2, 'refusal': 0.3, } # Stage 2 Weights (Efficiency Focus) STAGE2_WEIGHTS = { 'format': 0.1, 'correctness': 0.5, # Reduced - already accurate from Stage 1 'efficiency': 1.0, # Main focus - penalize tags 'refusal': 0.3, } ``` ### Individual Reward Components | Component | Description | Range | |-----------|-------------|-------| | **format_reward** | Valid `JSON` structure | 0.0 - 1.0 | | **correctness_reward** | Tool name match + argument similarity | 0.0 - 1.0 | | **refusal_reward** | +1.0 correct refusal, **-1.0 hallucination** | -1.0 to +1.0 | | **efficiency_reward** | Stage 1: -0.3 for ``, Stage 2: **-1.0** | -1.0 to +0.1 | ### Key Training Innovations 1. **Strong Refusal Penalty**: -1.0 for calling tools when `ground_truth = []` 2. **Toucan Irrelevant Data**: 40K high-quality "unanswerable" samples 3. **Efficiency Optimization**: Rewarding direct tool calls without preambles 4. **Discourage `` Tags**: Strong penalty (-1.0) for verbose reasoning blocks ## 🚀 Usage ### With Transformers ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "contextboxai/Qwen3-1.7B-FC" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto") # Define tools tools = [{ "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City name"} }, "required": ["location"] } }] messages = [{"role": "user", "content": "What's the weather in Tokyo?"}] prompt = tokenizer.apply_chat_template( messages, tools=tools, add_generation_prompt=True, tokenize=False, enable_thinking=False # Disable thinking for efficiency ) inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=256) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### Expected Output ```xml {"name": "get_weather", "arguments": {"location": "Tokyo"}} ``` ### Refusal Example When asked "What is the meaning of life?" with only `get_weather` tool available: ``` None of the provided tools can answer this question. ``` ### With vLLM (Recommended for Production) ```python from vllm import LLM, SamplingParams llm = LLM(model="contextboxai/Qwen3-1.7B-FC") sampling_params = SamplingParams(temperature=0, max_tokens=256) # Generate with same prompt format as above outputs = llm.generate([prompt], sampling_params) ``` ## 💡 Key Features ### ✅ Strengths - **Compact Size**: Only 1.7B parameters, runs on consumer GPUs - **High Accuracy**: Outperforms larger models (8B, 14B) on function calling - **Efficient Responses**: Direct tool calls without verbose preambles - **Strong Refusal**: Trained on 46K negative samples to avoid hallucination - **Multilingual**: Supports English and Vietnamese - **Chat Compatible**: Maintains general chat ability (100% on chatable benchmark) ### ⚠️ Limitations - **Irrelevance**: Slightly more aggressive at calling tools (-5% vs base) ## 📝 Use Cases ### 🎯 Ideal For This model is optimized for **edge deployment** and **customer service automation** where a small, efficient model is needed: | Use Case | Description | |----------|-------------| | **Edge Device Deployment** | Run locally on devices with limited GPU/RAM | | **Customer Service Chatbot** | Automate order lookup, ticket creation, FAQ with tool calls | | **Voice Agent / Call Center** | Real-time voice-to-action for phone support systems | | **IoT/Smart Home** | Control devices via function calling on edge hardware | | **Mobile AI Assistant** | On-device tool execution without cloud dependency | | **Cost-Efficient API Gateway** | Route requests to appropriate backend services | ### 💼 Customer Service Examples ```python # Example: Customer asks about their order tools = [ {"name": "lookup_order", "parameters": {"order_id": "string"}}, {"name": "create_ticket", "parameters": {"issue": "string", "priority": "string"}}, {"name": "get_faq", "parameters": {"topic": "string"}} ] # User: "Đơn hàng #12345 của tôi ở đâu rồi?" # Model output: # # {"name": "lookup_order", "arguments": {"order_id": "12345"}} # # User: "Tôi muốn đổi trả sản phẩm" # Model output: # # {"name": "create_ticket", "arguments": {"issue": "product_return", "priority": "normal"}} # ``` ### ⚡ Why Small Model? | Benefit | Description | |---------|-------------| | **Low Latency** | ~50ms inference on consumer GPU | | **Low Cost** | 8x cheaper than 14B model to deploy | | **Privacy** | Run entirely on-premise, no data leaves device | | **Offline Capable** | Works without internet connection | ### 🧠 Reduced Catastrophic Forgetting This model uses **RLVR (Reinforcement Learning from Verifiable Rewards)** instead of traditional SFT, which helps reduce capability loss: - **Less forgetting than SFT**: RLVR fine-tunes through reward signals rather than directly overwriting weights - **100% chatable score**: Model maintains normal conversation ability on BFCL benchmark - **Multilingual preserved**: English and Vietnamese capabilities remain functional - **Lower risk**: Compared to SFT, RLVR typically causes less regression on non-target tasks ## 🔬 Technical Details | Attribute | Value | |-----------|-------| | Base Model | Qwen/Qwen3-1.7B | | Training Method | RLVR (RL fine-tuning) | | Training Steps | 100 (V3) + 3000 (V4) | | Peak LR | 1e-6 → 2e-7 | | Training Data | 117K samples (71K positive + 46K negative) | | Precision | bfloat16 | | Max Sequence Length | 32768 tokens | | Tool Format | XML-style (`...`) | ## 📚 Citation If you use this model, please cite: ```bibtex @misc{qwen3-fc, title={Qwen3-1.7B-FC: Efficient Function Calling via GRPO Fine-tuning}, author={ContextboxAI}, year={2024}, howpublished={\url{https://huggingface.co/contextboxai/Qwen3-1.7B-FC}}, } ``` ## 🙏 Acknowledgments - [Qwen Team](https://github.com/QwenLM/Qwen3) for the excellent base model - [Jan-nano](https://arxiv.org/pdf/2506.22760) for training methodology inspiration - [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html) for the benchmark - [xLAM (Salesforce)](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) for function calling data - [ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) for multi-turn tool usage data - [Toucan-1.5M (Agent-Ark)](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) for irrelevant/negative samples - [TRL](https://github.com/huggingface/trl) for GRPO implementation ## 📄 License Apache 2.0 --- **Model Card Contact**: ContextboxAI