145 lines
5.6 KiB
Markdown
145 lines
5.6 KiB
Markdown
---
|
|
license: apache-2.0
|
|
base_model: Qwen/Qwen3-1.7B
|
|
language:
|
|
- ms
|
|
- en
|
|
- zh
|
|
- ta
|
|
tags:
|
|
- turn-detection
|
|
- call-center
|
|
- code-switching
|
|
- multilingual
|
|
pipeline_tag: text-generation
|
|
---
|
|
|
|
# Turn Detector Qwen3-1.7B
|
|
|
|
Fine-tuned [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for **real-time turn-end detection** in multilingual call center conversations.
|
|
|
|
The model predicts `P(<|im_end|>)` — the probability that a speaker has finished their turn. Designed for low-latency voice agent pipelines (e.g. LiveKit) to determine when to respond.
|
|
|
|
## How It Works
|
|
|
|
Given a conversation so far, the model outputs the probability of `<|im_end|>` as the next token:
|
|
|
|
- **P(im_end) > 0.5** → speaker is done talking (turn complete)
|
|
- **P(im_end) < 0.5** → speaker is still talking (turn incomplete)
|
|
|
|
## Usage
|
|
|
|
```python
|
|
import torch
|
|
import math
|
|
import torch.nn.functional as F
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
|
model_id = "Scicom-intl/Malaysian-Turn-Detector-Qwen3-1.7B"
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).cuda().eval()
|
|
|
|
IM_END_ID = tokenizer.convert_tokens_to_ids("<|im_end|>")
|
|
|
|
def get_turn_end_prob(text):
|
|
if text.endswith("<|im_end|>"):
|
|
text = text[:-len("<|im_end|>")]
|
|
inputs = tokenizer(text, return_tensors="pt").to("cuda")
|
|
with torch.no_grad():
|
|
logits = model(**inputs).logits
|
|
prob = F.softmax(logits[0, -1], dim=-1)[IM_END_ID].item()
|
|
return prob
|
|
````
|
|
|
|
## Eval Results
|
|
|
|
**Test set:** 1200 samples (600 positive + 600 negative), 50 conversations per language pair.
|
|
|
|
### Overall (threshold = 0.5)
|
|
|
|
| Metric | Score |
|
|
| --------- | ------ |
|
|
| Accuracy | 96.67% |
|
|
| Precision | 99.82% |
|
|
| Recall | 93.50% |
|
|
| F1 | 96.56% |
|
|
|
|
### Per Language
|
|
|
|
| Language Pair | Overall | Positive | Negative |
|
|
| --------------- | ------- | -------- | -------- |
|
|
| chinese-english | 95.00% | 90.00% | 100.00% |
|
|
| chinese-malay | 97.00% | 94.00% | 100.00% |
|
|
| chinese-tamil | 97.00% | 94.00% | 100.00% |
|
|
| english-chinese | 97.00% | 96.00% | 98.00% |
|
|
| english-malay | 94.00% | 88.00% | 100.00% |
|
|
| english-tamil | 95.00% | 90.00% | 100.00% |
|
|
| malay-chinese | 97.00% | 94.00% | 100.00% |
|
|
| malay-english | 96.00% | 92.00% | 100.00% |
|
|
| malay-tamil | 97.00% | 94.00% | 100.00% |
|
|
| tamil-chinese | 100.00% | 100.00% | 100.00% |
|
|
| tamil-english | 97.00% | 94.00% | 100.00% |
|
|
| tamil-malay | 98.00% | 96.00% | 100.00% |
|
|
|
|
### Threshold Sweep
|
|
|
|
| Threshold | Accuracy | Precision | Recall | F1 |
|
|
| --------- | ---------- | ---------- | ---------- | ---------- |
|
|
| 0.1 | 99.00% | 99.66% | 98.33% | 98.99% |
|
|
| 0.2 | 98.67% | 99.66% | 97.67% | 98.65% |
|
|
| 0.3 | 98.00% | 99.66% | 96.33% | 97.97% |
|
|
| 0.4 | 97.58% | 99.65% | 95.50% | 97.53% |
|
|
| **0.5** | **96.67%** | **99.82%** | **93.50%** | **96.56%** |
|
|
| 0.6 | 95.50% | 99.82% | 91.17% | 95.30% |
|
|
| 0.7 | 93.67% | 99.81% | 87.50% | 93.25% |
|
|
| 0.8 | 91.17% | 100.00% | 82.33% | 90.31% |
|
|
| 0.9 | 83.83% | 100.00% | 67.67% | 80.72% |
|
|
|
|
### Confusion Matrix (threshold = 0.5)
|
|
|
|
| | Pred Pos | Pred Neg |
|
|
| ---------- | -------- | -------- |
|
|
| Actual Pos | 561 | 39 |
|
|
| Actual Neg | 1 | 599 |
|
|
|
|
### Probability Distribution
|
|
|
|
| Class | Mean | Median | Min | Max |
|
|
| -------------------------- | ------ | ------ | ------ | ------ |
|
|
| Positive (turn complete) | 0.8813 | 0.9673 | 0.0063 | 1.0000 |
|
|
| Negative (turn incomplete) | 0.0020 | 0.0000 | 0.0000 | 0.7022 |
|
|
|
|
## Dataset
|
|
|
|
Tokenized parquet datasets (chinidataset format) available at [Scicom-intl/turn-detector-Qwen3-0.6B-dataset](https://huggingface.co/datasets/Scicom-intl/turn-detector-Qwen3-0.6B-dataset).
|
|
|
|
```
|
|
turn-detector-Qwen3-0.6B-dataset/
|
|
├── train-merged/
|
|
├── train/
|
|
└── test/
|
|
```
|
|
|
|
## Training
|
|
|
|
* **Base model:** Qwen/Qwen3-1.7B
|
|
* **Training data:** Positive samples only (complete conversations ending with `<|im_end|>`)
|
|
* **Loss:** Liger Fused Linear Cross Entropy
|
|
* **Attention:** Flash Attention 3
|
|
* **Precision:** bfloat16
|
|
* **Block size:** 8192 (multipacked)
|
|
* **Batch size:** 2 x 16 gradient accumulation
|
|
* **Learning rate:** 2e-5 (constant)
|
|
* **Epochs:** 1
|
|
|
|
### Training Data Sources
|
|
|
|
| Dataset | Source |
|
|
| ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| Call Center Language Switching | [https://huggingface.co/datasets/Scicom-intl/Call-Center-Language-Switching](https://huggingface.co/datasets/Scicom-intl/Call-Center-Language-Switching) |
|
|
| Function Call | [https://huggingface.co/datasets/Scicom-intl/Function-Call](https://huggingface.co/datasets/Scicom-intl/Function-Call) |
|
|
| Malaysian Multiturn Chat Assistant | [https://huggingface.co/datasets/mesolitica/Malaysian-Multiturn-Chat-Assistant](https://huggingface.co/datasets/mesolitica/Malaysian-Multiturn-Chat-Assistant) |
|
|
| Malaysian Speech Instructions | [https://huggingface.co/datasets/mesolitica/Malaysian-Speech-Instructions](https://huggingface.co/datasets/mesolitica/Malaysian-Speech-Instructions) |
|
|
|
|
```
|