--- license: apache-2.0 language: - en library_name: transformers tags: - turn-taking - voice-ai - conversational-ai - dialogue - qwen2 - onnx base_model: Qwen/Qwen2.5-0.5B-Instruct pipeline_tag: text-generation --- # Semantic Turn-Taking Model A fine-tuned [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) model that predicts turn-taking actions in conversations. Given a conversation context, the model predicts what action a voice AI agent should take next. Unlike acoustic-based approaches (VAD, silence detection), this model uses the **semantic content** of the conversation to make turn-taking decisions. ## Action Classes The model predicts one of 4 actions: | Action | Token | Description | |--------|-------|-------------| | `start_speaking` | `<\|start_speaking\|>` | User finished their turn, agent should respond | | `continue_listening` | `<\|continue_listening\|>` | User is mid-utterance, keep listening | | `start_listening` | `<\|start_listening\|>` | User interrupted the agent, stop talking | | `continue_speaking` | `<\|continue_speaking\|>` | User gave a backchannel, agent keeps talking | ## Usage ### PyTorch ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "anyreach-ai/semantic-turn-taking" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).cuda() model.eval() # Format conversation as ChatML with <|predict|> trigger conversation = """<|im_start|>user I need help with my bill<|im_end|> <|im_start|>assistant Sure I can help with that what seems to be the issue<|im_end|> <|im_start|>user I was charged twice for the same order<|im_end|> <|predict|>""" inputs = tokenizer(conversation, return_tensors="pt").to("cuda") with torch.no_grad(): logits = model(**inputs).logits[:, -1, :] # Get action probabilities action_tokens = { "start_speaking": tokenizer.convert_tokens_to_ids("<|start_speaking|>"), "continue_listening": tokenizer.convert_tokens_to_ids("<|continue_listening|>"), "start_listening": tokenizer.convert_tokens_to_ids("<|start_listening|>"), "continue_speaking": tokenizer.convert_tokens_to_ids("<|continue_speaking|>"), } action_logits = {name: logits[0, tid].item() for name, tid in action_tokens.items()} probs = torch.softmax(torch.tensor(list(action_logits.values())), dim=0) for (name, _), p in zip(action_logits.items(), probs): print(f" {name}: {p:.4f}") # → start_speaking: 0.95+ (user is done, agent should respond) ``` ### ONNX (CPU) ```python import numpy as np import onnxruntime as ort from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("anyreach-ai/semantic-turn-taking") sess_options = ort.SessionOptions() sess_options.intra_op_num_threads = 4 session = ort.InferenceSession( "onnx/model_q8.onnx", # download from this repo providers=["CPUExecutionProvider"], sess_options=sess_options, ) # Tokenize conversation = "..." # ChatML format as above inputs = tokenizer(conversation, return_tensors="np") input_ids = inputs["input_ids"].astype("int64") seq_len = input_ids.shape[1] # Build feed (empty KV cache for single forward pass) feed = { "input_ids": input_ids, "attention_mask": inputs["attention_mask"].astype("int64"), "position_ids": np.arange(seq_len, dtype="int64").reshape(1, -1), } for i in range(24): feed[f"past_key_values.{i}.key"] = np.zeros((1, 2, 0, 64), dtype="float32") feed[f"past_key_values.{i}.value"] = np.zeros((1, 2, 0, 64), dtype="float32") # Run inference logits = session.run(None, feed)[0] # [1, seq_len, vocab_size] last_logits = logits[0, -1, :] # Extract action probabilities ACTION_IDS = [151666, 151665, 151667, 151668] # SS, CL, SLi, CS action_logits = last_logits[ACTION_IDS] probs = np.exp(action_logits) / np.sum(np.exp(action_logits)) ``` ## Benchmark Results Evaluated on [anyreach-ai/semantic-turn-taking-benchmark](https://huggingface.co/datasets/anyreach-ai/semantic-turn-taking-benchmark). ### Binary (EOU vs Not-EOU) Only `start_speaking` and `continue_listening` examples. Predictions mapped: SS/CS → EOU, CL/SLi → Not-EOU. | Subset | N | Accuracy | F1 (macro) | |--------|--:|--:|--:| | TEN | 428 | 91.82% | 91.80% | | SwDA | 2,688 | 65.96% | 51.46% | | Synthetic | 36 | 86.11% | 85.57% | ### Multi-class | Subset | N | Classes | Accuracy | F1 (macro) | |--------|--:|--------:|--:|--:| | TEN | 428 | 2 | 91.82% | 91.80% | | SwDA | 3,523 | 3 | 68.98% | 46.92% | | Synthetic | 60 | 4 | 76.67% | 72.07% | ## Latency Measured on single examples, CPU (4 threads) and GPU (NVIDIA T4). | Format | Size | Short (8 tok) | Medium (28 tok) | Long (54 tok) | |--------|-----:|--:|--:|--:| | PyTorch GPU (fp16) | 942 MB | 26 ms | 30 ms | 34 ms | | PyTorch CPU (fp32) | 942 MB | 165 ms | 247 ms | 289 ms | | ONNX CPU (q8) | 473 MB | 128 ms | 151 ms | 191 ms | ## Model Details - **Base model**: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) (494M parameters) - **Training**: Full fine-tuning on ~154K synthetic conversation examples - **Input format**: Qwen ChatML with `<|predict|>` trigger token - **Max sequence length**: 1024 tokens (left truncation) - **Special tokens**: 5 added (`<|predict|>`, 4 action tokens) ## Files | File | Description | |------|-------------| | `model.safetensors` | PyTorch model weights (fp32) | | `onnx/model_q8.onnx` | ONNX INT8 quantized (dynamic quantization) | | `config.json` | Model configuration | | `tokenizer.json` | Tokenizer | ## Citation ```bibtex @misc{semantic-turn-taking-2026, title={Semantic Turn-Taking Model}, author={Shangeth Rajaa}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/anyreach-ai/semantic-turn-taking} } ``` ## Authors - [**Shangeth Rajaa**](https://github.com/shangeth) ## License Apache 2.0