160 lines
5.1 KiB
Markdown
160 lines
5.1 KiB
Markdown
---
|
|
base_model: Qwen/Qwen2.5-0.5B-Instruct
|
|
tags:
|
|
- text-generation-inference
|
|
- transformers
|
|
- unsloth
|
|
- qwen2
|
|
license: apache-2.0
|
|
language:
|
|
- en
|
|
datasets:
|
|
- quotientai/limbic-eval-tool-use-mcp
|
|
---
|
|
|
|
# Limbic-Tool-Use MCP Function Call Evaluator
|
|
|
|
This model is a fine-tuned version of Qwen2.5-0.5B-Instruct specifically designed for evaluating function calls in the context of Model Context Protocol (MCP) tools. It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values.
|
|
|
|
## Model Details
|
|
|
|
- **Base Model**: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
|
|
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
|
|
- **Task**: Function Call Evaluation for MCP (Model Context Protocol)
|
|
- **Training Data**: MCP Server Tools data from public MCP servers, with augmentation / synthetic data generation
|
|
- **Model Size**: ~40MB (LoRA adapters only)
|
|
- **Context Length**: 32,768 tokens
|
|
|
|
# Model Usage
|
|
|
|
## Model Prompts
|
|
|
|
The prompt for the model takes two inputs:
|
|
- `available_tools` - a list of the tool schemas
|
|
- `message_history` - the user request and model tool call response as a list of jsons
|
|
|
|
```
|
|
EVALUATOR_PROMPT = """\
|
|
# TOOL CALL EVALUATION RUBRIC
|
|
|
|
## EVALUATION CRITERIA
|
|
|
|
### 1. TOOL SELECTION
|
|
- [ ] Function name exists in available tools
|
|
- [ ] Function purpose matches user intent
|
|
|
|
### 2. PARAMETER STRUCTURE
|
|
- [ ] All required and relevant parameters are present
|
|
- [ ] No hallucinated parameter names
|
|
- [ ] Parameter names match tool schema exactly
|
|
|
|
### 3. PARAMETER VALUES
|
|
- [ ] Data types match expected types
|
|
- [ ] Values align with user request
|
|
- [ ] No fabricated or incorrect values
|
|
|
|
## CLASSIFICATION RULES
|
|
- All criteria passed → `correct`
|
|
- Failed criteria 1 → `incorrect_tool`
|
|
- Failed criteria 2 → `incorrect_parameter_names`
|
|
- Failed criteria 3 → `incorrect_parameter_values`
|
|
|
|
---
|
|
### AVAILABLE TOOLS
|
|
{available_tools}
|
|
|
|
---
|
|
### MESSAGE HISTORY
|
|
{message_history}
|
|
|
|
---
|
|
## OUTPUT REQUIREMENT
|
|
{{
|
|
"score": < correct | incorrect_tool | incorrect_parameter_names | incorrect_parameter_values >,
|
|
"reason": < [if incorrect, provide a brief list of reasons] >
|
|
}}
|
|
|
|
### EVALUATION:
|
|
"""
|
|
```
|
|
```
|
|
SYSTEM_PROMPT = "You are an expert evaluator of function calls. You will be given a function call and a list of available tools. You will need to evaluate the function call and return a score and a reason for the score."
|
|
```
|
|
|
|
### Example Inputs
|
|
```
|
|
available_tools = [
|
|
{
|
|
"name": "google-play-developer",
|
|
"description": "Get apps by a developer on Google Play",
|
|
"input_schema": {
|
|
"type": "object",
|
|
"properties": {
|
|
"devId": {"type": "string", "description": "Developer ID"},
|
|
"num": {"type": "number", "default": 60, "description": "Number of results"},
|
|
"lang": {"type": "string", "default": "en", "description": "Language code"},
|
|
"country": {"type": "string", "default": "us", "description": "Country code"}
|
|
},
|
|
"required": ["devId"]
|
|
}
|
|
}
|
|
]
|
|
|
|
message_history = [
|
|
{"role": "user", "content": "I'm looking to evaluate the performance of all the apps developed by 'Example Developer' on the Google Play Store. Could you provide me with a list of their recent applications, specifically in English and focused on the US market? Please limit the results to 50 apps for a quicker review."},
|
|
{"role": "assistant", "content": {"function": "name": "google-play-developer", "arguments": {"devId": "com.example.developer", "num": 50, "lang": "en", "country": "us"}}}
|
|
]
|
|
```
|
|
|
|
## Output Format
|
|
The model outputs evaluations in JSON format:
|
|
|
|
```json
|
|
{
|
|
"score": "correct|incorrect_tool|incorrect_parameter_names|incorrect_parameter_values",
|
|
"reason": ["reasons for failure if incorrect"]
|
|
}
|
|
```
|
|
|
|
#### Score Categories
|
|
|
|
- **correct**: Function call matches available tools and parameters exactly
|
|
- **incorrect_tool**: Function name doesn't exist in available tools
|
|
- **incorrect_parameter_names**: Function exists but parameter names are wrong
|
|
- **incorrect_parameter_values**: Function and parameters exist but values are inappropriate
|
|
|
|
|
|
## Load the Model
|
|
```
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")
|
|
model = AutoModelForCausalLM.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")
|
|
```
|
|
|
|
## Generate a Prediction
|
|
To make a prediction, you must convert the formatted prompt into its chat format.
|
|
```
|
|
chat_template = [
|
|
{"role": "system", "content": SYSTEM_PROMPT},
|
|
{"role": "user", "content": "<your-formatted-user-prompt>"}
|
|
]
|
|
# Apply the chat template
|
|
text = tokenizer.apply_chat_template(chat_template, tokenize=False, add_generation_prompt=True)
|
|
|
|
# Tokenize with truncation
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True).to("cuda")
|
|
|
|
# Generate your prediction
|
|
result = model.generate(**inputs, max_new_tokens=128, use_cache=True)
|
|
```
|
|
|
|
## Citation
|
|
```bibtex
|
|
@model{limbic-tool-use-0.5B-32K,
|
|
title={Limbic Tool Use Evaluator},
|
|
author={QuotientAI},
|
|
year={2025},
|
|
url={https://huggingface.co/quotientai/limbic-tool-use-0.5B-32K}
|
|
}
|
|
``` |