---
license: apache-2.0
base_model: Qwen/Qwen3-1.7B
language:
  - ms
  - en
  - zh
  - ta
tags:
  - turn-detection
  - call-center
  - code-switching
  - multilingual
pipeline_tag: text-generation
---

# Turn Detector Qwen3-1.7B

Fine-tuned [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for **real-time turn-end detection** in multilingual call center conversations.

The model predicts `P(<|im_end|>)` — the probability that a speaker has finished their turn. Designed for low-latency voice agent pipelines (e.g. LiveKit) to determine when to respond.

## How It Works

Given a conversation so far, the model outputs the probability of `<|im_end|>` as the next token:

- **P(im_end) > 0.5** → speaker is done talking (turn complete)
- **P(im_end) < 0.5** → speaker is still talking (turn incomplete)

## Usage

```python
import torch
import math
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Scicom-intl/Malaysian-Turn-Detector-Qwen3-1.7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).cuda().eval()

IM_END_ID = tokenizer.convert_tokens_to_ids("<|im_end|>")

def get_turn_end_prob(text):
    if text.endswith("<|im_end|>"):
        text = text[:-len("<|im_end|>")]
    inputs = tokenizer(text, return_tensors="pt").to("cuda")
    with torch.no_grad():
        logits = model(**inputs).logits
    prob = F.softmax(logits[0, -1], dim=-1)[IM_END_ID].item()
    return prob
````

## Eval Results

**Test set:** 1200 samples (600 positive + 600 negative), 50 conversations per language pair.

### Overall (threshold = 0.5)

| Metric    | Score  |
| --------- | ------ |
| Accuracy  | 96.67% |
| Precision | 99.82% |
| Recall    | 93.50% |
| F1        | 96.56% |

### Per Language

| Language Pair   | Overall | Positive | Negative |
| --------------- | ------- | -------- | -------- |
| chinese-english | 95.00%  | 90.00%   | 100.00%  |
| chinese-malay   | 97.00%  | 94.00%   | 100.00%  |
| chinese-tamil   | 97.00%  | 94.00%   | 100.00%  |
| english-chinese | 97.00%  | 96.00%   | 98.00%   |
| english-malay   | 94.00%  | 88.00%   | 100.00%  |
| english-tamil   | 95.00%  | 90.00%   | 100.00%  |
| malay-chinese   | 97.00%  | 94.00%   | 100.00%  |
| malay-english   | 96.00%  | 92.00%   | 100.00%  |
| malay-tamil     | 97.00%  | 94.00%   | 100.00%  |
| tamil-chinese   | 100.00% | 100.00%  | 100.00%  |
| tamil-english   | 97.00%  | 94.00%   | 100.00%  |
| tamil-malay     | 98.00%  | 96.00%   | 100.00%  |

### Threshold Sweep

| Threshold | Accuracy   | Precision  | Recall     | F1         |
| --------- | ---------- | ---------- | ---------- | ---------- |
| 0.1       | 99.00%     | 99.66%     | 98.33%     | 98.99%     |
| 0.2       | 98.67%     | 99.66%     | 97.67%     | 98.65%     |
| 0.3       | 98.00%     | 99.66%     | 96.33%     | 97.97%     |
| 0.4       | 97.58%     | 99.65%     | 95.50%     | 97.53%     |
| **0.5**   | **96.67%** | **99.82%** | **93.50%** | **96.56%** |
| 0.6       | 95.50%     | 99.82%     | 91.17%     | 95.30%     |
| 0.7       | 93.67%     | 99.81%     | 87.50%     | 93.25%     |
| 0.8       | 91.17%     | 100.00%    | 82.33%     | 90.31%     |
| 0.9       | 83.83%     | 100.00%    | 67.67%     | 80.72%     |

### Confusion Matrix (threshold = 0.5)

|            | Pred Pos | Pred Neg |
| ---------- | -------- | -------- |
| Actual Pos | 561      | 39       |
| Actual Neg | 1        | 599      |

### Probability Distribution

| Class                      | Mean   | Median | Min    | Max    |
| -------------------------- | ------ | ------ | ------ | ------ |
| Positive (turn complete)   | 0.8813 | 0.9673 | 0.0063 | 1.0000 |
| Negative (turn incomplete) | 0.0020 | 0.0000 | 0.0000 | 0.7022 |

## Dataset

Tokenized parquet datasets (chinidataset format) available at [Scicom-intl/turn-detector-Qwen3-0.6B-dataset](https://huggingface.co/datasets/Scicom-intl/turn-detector-Qwen3-0.6B-dataset).

```
turn-detector-Qwen3-0.6B-dataset/
├── train-merged/
├── train/
└── test/
```

## Training

* **Base model:** Qwen/Qwen3-1.7B
* **Training data:** Positive samples only (complete conversations ending with `<|im_end|>`)
* **Loss:** Liger Fused Linear Cross Entropy
* **Attention:** Flash Attention 3
* **Precision:** bfloat16
* **Block size:** 8192 (multipacked)
* **Batch size:** 2 x 16 gradient accumulation
* **Learning rate:** 2e-5 (constant)
* **Epochs:** 1

### Training Data Sources

| Dataset                            | Source                                                                                                                                                         |
| ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Call Center Language Switching     | [https://huggingface.co/datasets/Scicom-intl/Call-Center-Language-Switching](https://huggingface.co/datasets/Scicom-intl/Call-Center-Language-Switching)       |
| Function Call                      | [https://huggingface.co/datasets/Scicom-intl/Function-Call](https://huggingface.co/datasets/Scicom-intl/Function-Call)                                         |
| Malaysian Multiturn Chat Assistant | [https://huggingface.co/datasets/mesolitica/Malaysian-Multiturn-Chat-Assistant](https://huggingface.co/datasets/mesolitica/Malaysian-Multiturn-Chat-Assistant) |
| Malaysian Speech Instructions      | [https://huggingface.co/datasets/mesolitica/Malaysian-Speech-Instructions](https://huggingface.co/datasets/mesolitica/Malaysian-Speech-Instructions)           |

```