--- base_model: Qwen/Qwen2.5-0.5B-Instruct tags: - text-generation-inference - transformers - unsloth - qwen2 license: apache-2.0 language: - en datasets: - quotientai/limbic-eval-tool-use-mcp --- # Limbic-Tool-Use MCP Function Call Evaluator This model is a fine-tuned version of Qwen2.5-0.5B-Instruct specifically designed for evaluating function calls in the context of Model Context Protocol (MCP) tools. It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values. ## Model Details - **Base Model**: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) - **Fine-tuning Method**: LoRA (Low-Rank Adaptation) - **Task**: Function Call Evaluation for MCP (Model Context Protocol) - **Training Data**: MCP Server Tools data from public MCP servers, with augmentation / synthetic data generation - **Model Size**: ~40MB (LoRA adapters only) - **Context Length**: 32,768 tokens # Model Usage ## Model Prompts The prompt for the model takes two inputs: - `available_tools` - a list of the tool schemas - `message_history` - the user request and model tool call response as a list of jsons ``` EVALUATOR_PROMPT = """\ # TOOL CALL EVALUATION RUBRIC ## EVALUATION CRITERIA ### 1. TOOL SELECTION - [ ] Function name exists in available tools - [ ] Function purpose matches user intent ### 2. PARAMETER STRUCTURE - [ ] All required and relevant parameters are present - [ ] No hallucinated parameter names - [ ] Parameter names match tool schema exactly ### 3. PARAMETER VALUES - [ ] Data types match expected types - [ ] Values align with user request - [ ] No fabricated or incorrect values ## CLASSIFICATION RULES - All criteria passed → `correct` - Failed criteria 1 → `incorrect_tool` - Failed criteria 2 → `incorrect_parameter_names` - Failed criteria 3 → `incorrect_parameter_values` --- ### AVAILABLE TOOLS {available_tools} --- ### MESSAGE HISTORY {message_history} --- ## OUTPUT REQUIREMENT {{ "score": < correct | incorrect_tool | incorrect_parameter_names | incorrect_parameter_values >, "reason": < [if incorrect, provide a brief list of reasons] > }} ### EVALUATION: """ ``` ``` SYSTEM_PROMPT = "You are an expert evaluator of function calls. You will be given a function call and a list of available tools. You will need to evaluate the function call and return a score and a reason for the score." ``` ### Example Inputs ``` available_tools = [ { "name": "google-play-developer", "description": "Get apps by a developer on Google Play", "input_schema": { "type": "object", "properties": { "devId": {"type": "string", "description": "Developer ID"}, "num": {"type": "number", "default": 60, "description": "Number of results"}, "lang": {"type": "string", "default": "en", "description": "Language code"}, "country": {"type": "string", "default": "us", "description": "Country code"} }, "required": ["devId"] } } ] message_history = [ {"role": "user", "content": "I'm looking to evaluate the performance of all the apps developed by 'Example Developer' on the Google Play Store. Could you provide me with a list of their recent applications, specifically in English and focused on the US market? Please limit the results to 50 apps for a quicker review."}, {"role": "assistant", "content": {"function": "name": "google-play-developer", "arguments": {"devId": "com.example.developer", "num": 50, "lang": "en", "country": "us"}}} ] ``` ## Output Format The model outputs evaluations in JSON format: ```json { "score": "correct|incorrect_tool|incorrect_parameter_names|incorrect_parameter_values", "reason": ["reasons for failure if incorrect"] } ``` #### Score Categories - **correct**: Function call matches available tools and parameters exactly - **incorrect_tool**: Function name doesn't exist in available tools - **incorrect_parameter_names**: Function exists but parameter names are wrong - **incorrect_parameter_values**: Function and parameters exist but values are inappropriate ## Load the Model ``` from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("quotientai/limbic-tool-use-0.5B-32K") model = AutoModelForCausalLM.from_pretrained("quotientai/limbic-tool-use-0.5B-32K") ``` ## Generate a Prediction To make a prediction, you must convert the formatted prompt into its chat format. ``` chat_template = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": ""} ] # Apply the chat template text = tokenizer.apply_chat_template(chat_template, tokenize=False, add_generation_prompt=True) # Tokenize with truncation inputs = tokenizer(text, return_tensors="pt", truncation=True).to("cuda") # Generate your prediction result = model.generate(**inputs, max_new_tokens=128, use_cache=True) ``` ## Citation ```bibtex @model{limbic-tool-use-0.5B-32K, title={Limbic Tool Use Evaluator}, author={QuotientAI}, year={2025}, url={https://huggingface.co/quotientai/limbic-tool-use-0.5B-32K} } ```