A fine-tuned FunctionGemma 270M model that converts natural language into structured robot action and emotion function calls. Designed for real-time inference on edge devices like the NVIDIA Jetson AGX Thor.
Overview
This model takes a user's voice or text input and outputs two function calls:
robot_action — a physical action for the robot to perform
show_emotion — an emotion to display on the robot's avatar screen (Rive animations)
General conversation defaults to stand_still with a contextually appropriate emotion.
Example
Input: "Can you shake hands with me?"
Output: robot_action(action_name="shake_hand") + show_emotion(emotion="happy")
Input: "What is that?"
Output: robot_action(action_name="stand_still") + show_emotion(emotion="confused")
Input: "I feel sad"
Output: robot_action(action_name="stand_still") + show_emotion(emotion="sad")
Supported Actions
Action
Description
shake_hand
Handshake gesture
face_wave
Wave hello
hands_up
Raise both hands
stand_still
Stay idle (default for general conversation)
show_hand
Show open hand
Supported Emotions
Emotion
Animation
happy
Happy.riv
sad
Sad.riv
excited
Excited.riv
confused
Confused.riv
curious
Curious.riv
think
Think.riv
Performance on NVIDIA Jetson AGX Thor
Benchmarked with constrained decoding (2 forward passes instead of 33 autoregressive steps):