feat: support pythonic tool call and index in tool call streaming (#5725)

2025-04-29 17:30:44 -07:00
parent e4b6133b78
commit 2b06484bd1
8 changed files with 541 additions and 3 deletions
--- a/docs/backend/function_calling.ipynb
+++ b/docs/backend/function_calling.ipynb
@@ -503,6 +503,173 @@
    "llm.shutdown()"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Pythonic Tool Call Format (Llama-3.2 / Llama-3.3 / Llama-4)\n",
+    "\n",
+    "Some Llama models (such as Llama-3.2-1B, Llama-3.2-3B, Llama-3.3-70B, and Llama-4) support a \"pythonic\" tool call format, where the model outputs function calls as Python code, e.g.:\n",
+    "\n",
+    "```python\n",
+    "[get_current_weather(city=\"San Francisco\", state=\"CA\", unit=\"celsius\")]\n",
+    "```\n",
+    "\n",
+    "- The output is a Python list of function calls, with arguments as Python literals (not JSON).\n",
+    "- Multiple tool calls can be returned in the same list:\n",
+    "```python\n",
+    "[get_current_weather(city=\"San Francisco\", state=\"CA\", unit=\"celsius\"),\n",
+    " get_current_weather(city=\"New York\", state=\"NY\", unit=\"fahrenheit\")]\n",
+    "```\n",
+    "\n",
+    "For more information, refer to Meta’s documentation on  [Zero shot function calling](https://github.com/meta-llama/llama-models/blob/main/models/llama4/prompt_format.md#zero-shot-function-calling---system-message).\n",
+    "\n",
+    "### How to enable\n",
+    "- Launch the server with `--tool-call-parser pythonic`\n",
+    "- You may also specify --chat-template with the improved template for the model (e.g., `--chat-template=examples/chat_template/tool_chat_template_llama4_pythonic.jinja`).\n",
+    "This is recommended because the model expects a special prompt format to reliably produce valid pythonic tool call outputs. The template ensures that the prompt structure (e.g., special tokens, message boundaries like `<|eom|>`, and function call delimiters) matches what the model was trained or fine-tuned on. If you do not use the correct chat template, tool calling may fail or produce inconsistent results.\n",
+    "\n",
+    "#### Forcing Pythonic Tool Call Output Without a Chat Template\n",
+    "If you don't want to specify a chat template, you must give the model extremely explicit instructions in your messages to enforce pythonic output. For example, for `Llama-3.2-1B-Instruct`, you need:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import openai\n",
+    "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
+    "from sglang.test.test_utils import is_in_ci\n",
+    "\n",
+    "\n",
+    "if is_in_ci():\n",
+    "    from patch import launch_server_cmd\n",
+    "else:\n",
+    "    from sglang.utils import launch_server_cmd\n",
+    "\n",
+    "server_process, port = launch_server_cmd(\n",
+    "    \" python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --tool-call-parser pythonic --tp 1\"  # llama-3.2-1b-instruct\n",
+    ")\n",
+    "wait_for_server(f\"http://localhost:{port}\")\n",
+    "\n",
+    "tools = [\n",
+    "    {\n",
+    "        \"type\": \"function\",\n",
+    "        \"function\": {\n",
+    "            \"name\": \"get_weather\",\n",
+    "            \"description\": \"Get the current weather for a given location.\",\n",
+    "            \"parameters\": {\n",
+    "                \"type\": \"object\",\n",
+    "                \"properties\": {\n",
+    "                    \"location\": {\n",
+    "                        \"type\": \"string\",\n",
+    "                        \"description\": \"The name of the city or location.\",\n",
+    "                    }\n",
+    "                },\n",
+    "                \"required\": [\"location\"],\n",
+    "            },\n",
+    "        },\n",
+    "    },\n",
+    "    {\n",
+    "        \"type\": \"function\",\n",
+    "        \"function\": {\n",
+    "            \"name\": \"get_tourist_attractions\",\n",
+    "            \"description\": \"Get a list of top tourist attractions for a given city.\",\n",
+    "            \"parameters\": {\n",
+    "                \"type\": \"object\",\n",
+    "                \"properties\": {\n",
+    "                    \"city\": {\n",
+    "                        \"type\": \"string\",\n",
+    "                        \"description\": \"The name of the city to find attractions for.\",\n",
+    "                    }\n",
+    "                },\n",
+    "                \"required\": [\"city\"],\n",
+    "            },\n",
+    "        },\n",
+    "    },\n",
+    "]\n",
+    "\n",
+    "\n",
+    "def get_messages():\n",
+    "    return [\n",
+    "        {\n",
+    "            \"role\": \"system\",\n",
+    "            \"content\": (\n",
+    "                \"You are a travel assistant. \"\n",
+    "                \"When asked to call functions, ALWAYS respond ONLY with a python list of function calls, \"\n",
+    "                \"using this format: [func_name1(param1=value1, param2=value2), func_name2(param=value)]. \"\n",
+    "                \"Do NOT use JSON, do NOT use variables, do NOT use any other format. \"\n",
+    "                \"Here is an example:\\n\"\n",
+    "                '[get_weather(location=\"Paris\"), get_tourist_attractions(city=\"Paris\")]'\n",
+    "            ),\n",
+    "        },\n",
+    "        {\n",
+    "            \"role\": \"user\",\n",
+    "            \"content\": (\n",
+    "                \"I'm planning a trip to Tokyo next week. What's the weather like and what are some top tourist attractions? \"\n",
+    "                \"Propose parallel tool calls at once, using the python list of function calls format as shown above.\"\n",
+    "            ),\n",
+    "        },\n",
+    "    ]\n",
+    "\n",
+    "\n",
+    "messages = get_messages()\n",
+    "\n",
+    "client = openai.Client(base_url=f\"http://localhost:{port}/v1\", api_key=\"xxxxxx\")\n",
+    "model_name = client.models.list().data[0].id\n",
+    "\n",
+    "\n",
+    "response_non_stream = client.chat.completions.create(\n",
+    "    model=model_name,\n",
+    "    messages=messages,\n",
+    "    temperature=0.8,\n",
+    "    top_p=0.8,\n",
+    "    stream=False,  # Non-streaming\n",
+    "    tools=tools,\n",
+    ")\n",
+    "print_highlight(\"Non-stream response:\")\n",
+    "print(response_non_stream)\n",
+    "\n",
+    "response_stream = client.chat.completions.create(\n",
+    "    model=model_name,\n",
+    "    messages=messages,\n",
+    "    temperature=0.8,\n",
+    "    top_p=0.8,\n",
+    "    stream=True,\n",
+    "    tools=tools,\n",
+    ")\n",
+    "texts = \"\"\n",
+    "tool_calls = []\n",
+    "name = \"\"\n",
+    "arguments = \"\"\n",
+    "\n",
+    "for chunk in response_stream:\n",
+    "    if chunk.choices[0].delta.content:\n",
+    "        texts += chunk.choices[0].delta.content\n",
+    "    if chunk.choices[0].delta.tool_calls:\n",
+    "        tool_calls.append(chunk.choices[0].delta.tool_calls[0])\n",
+    "\n",
+    "print_highlight(\"Streaming Response:\")\n",
+    "print_highlight(\"==== Text ====\")\n",
+    "print(texts)\n",
+    "\n",
+    "print_highlight(\"==== Tool Call ====\")\n",
+    "for tool_call in tool_calls:\n",
+    "    print(tool_call)\n",
+    "\n",
+    "terminate_process(server_process)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> **Note:**  \n",
+    "> The model may still default to JSON if it was heavily finetuned on that format. Prompt engineering (including examples) is the only way to increase the chance of pythonic output if you are not using a chat template."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},