初始化项目,由ModelHub XC社区提供模型

Model: livekit/turn-detector
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-06-03 17:07:13 +08:00
commit 252e9f870f
15 changed files with 294496 additions and 0 deletions

37
.gitattributes vendored Normal file
View File

@@ -0,0 +1,37 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
model.safetensors filter=lfs diff=lfs merge=lfs -text
model_quantized.onnx filter=lfs diff=lfs merge=lfs -text

115
LICENSE Normal file
View File

@@ -0,0 +1,115 @@
LIVEKIT MODEL LICENSE AGREEMENT
1. Introduction
LiveKit Incorporated ("LiveKit") is making available its proprietary models for
use pursuant to the terms and conditions of this Agreement. As further
described below, you may use these LiveKit models freely but can only use them
together with the LiveKit Agents framework. You cannot use the LiveKit models
on a standalone basis or with any other frameworks.
BY CLICKING "I ACCEPT," OR BY DOWNLOADING, INSTALLING, OR OTHERWISE ACCESSING
OR USING THE LIVEKIT MATERIALS, YOU AGREE THAT YOU HAVE READ AND UNDERSTOOD,
AND, AS A CONDITION TO YOUR USE OF THE LIVEKIT MATERIALS, YOU AGREE TO BE
BOUND BY, THE FOLLOWING TERMS AND CONDITIONS.
2. Definitions
"Agreement" means this LiveKit Model License Agreement.
"Documentation" means the specifications, manuals, and documentation
accompanying any LiveKit Model and distributed by LiveKit.
"Licensee" or "you" means the individual or entity agreeing to be bound by
this Agreement.
"LiveKit Agents" means the proprietary LiveKit software framework for building
real-time multimodal AI applications with programmable backend participants.
"LiveKit Materials" means, collectively, the LiveKit Models and Documentation.
"LiveKit Model" means any of LiveKit's proprietary software models or
algorithms, including machine-learning software code, model weights,
inference-enabling software code, training-enabling software code, and
fine-tuning enabling software code. Any derivative works of a LiveKit Model,
whether developed by LiveKit, you, or any third party, will be deemed the
"LiveKit Model" for the purposes of this Agreement.
3. License Rights
Right to Use LiveKit Materials. Subject to the terms and conditions of this
Agreement, including the requirements of Section 3.b, LiveKit grants you a
nonexclusive, nontransferable, worldwide, royalty-free license under LiveKit's
intellectual property rights to use, reproduce, distribute, copy, and create
derivative works of the LiveKit Materials.
Limitation on Use. As a condition to your use of the LiveKit Materials, you
agree: (i) not to use any LiveKit Models on a standalone basis or with any
frameworks other than LiveKit Agents; (ii) not to use any LiveKit Materials or
any output from, or results of using, LiveKit Models (including any derivative
works thereof) to improve or otherwise develop any other models that are not
LiveKit Models; or (iii) distribute or otherwise make available the LiveKit
Materials (including any derivative works thereof) except (x) pursuant to the
terms of this Agreement, and (y) you reproduce the above copyright notice.
4. Intellectual Property
The LiveKit Materials are owned by LiveKit and its licensors. Except for the
rights granted to you under this Agreement, all rights are reserved and no
other express or implied rights are granted.
You will own any derivative works that you created from the LiveKit Materials,
subject to the terms of this Agreement.
5. Disclaimer
UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, LIVEKIT PROVIDES
THE LIVEKIT MATERIALS, AND ANY OUTPUT OR RESULTS THEREFROM, ON AN "AS IS"
BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED,
INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE,
NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU
ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR
REDISTRIBUTING THE LIVEKIT MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR
USE OF THE LIVEKIT MATERIALS AND ANY OUTPUT AND RESULTS.
6. Limitation of Liability
IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE),
CONTRACT, OR OTHERWISE, UNLESS REQUIRED BY APPLICABLE LAW (SUCH AS DELIBERATE
AND GROSSLY NEGLIGENT ACTS) OR AGREED TO IN WRITING, WILL LIVEKIT BE LIABLE TO
YOU FOR INDIRECT DAMAGES, INCLUDING ANY SPECIAL, INCIDENTAL, OR CONSEQUENTIAL
DAMAGES OF ANY CHARACTER ARISING AS A RESULT OF THIS AGREEMENT OR OUT OF THE
USE OR INABILITY TO USE THE LIVEKIT MATERIALS OR ANY OUTPUT OR RESULTS
THEREFROM (INCLUDING BUT NOT LIMITED TO DAMAGES FOR LOSS OF GOODWILL, WORK
STOPPAGE, COMPUTER FAILURE OR MALFUNCTION, OR ANY AND ALL OTHER COMMERCIAL
DAMAGES OR LOSSES), EVEN IF LIVEKIT HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
7. Trademarks
This Agreement does not grant permission to use the trade names, trademarks,
service marks, or product names of LiveKit, except as required for reasonable
and customary use in describing the origin of the LiveKit Materials.
8. Term and Termination
The term of this Agreement commences upon your acceptance of this Agreement
and continues in effect until you cease using the LiveKit Materials or it is
terminated by either party (on immediate written notice to the other party).
This Agreement will automatically terminate if you breach any of its terms.
Upon termination, you must immediately cease all use of the LiveKit Materials.
Sections 4, 5, 6, and 9 will survive termination.
9. Governing Law and Venue
This Agreement is subject to the laws of the State of California, without
regard to its conflict of laws principles. The UN Convention on Contracts for
the International Sale of Goods does not apply to this Agreement. The courts
located in San Francisco, California, have exclusive jurisdiction for any
dispute arising out of this Agreement.
+ + + +
Last Updated: November 25, 2024

185
README.md Normal file
View File

@@ -0,0 +1,185 @@
---
language:
- en
- es
- fr
- de
- it
- pt
- nl
- zh
- ja
- ko
- id
- tr
- ru
- hi
license: other
license_name: livekit-model-license
license_link: LICENSE
library_name: transformers
pipeline_tag: text-classification
base_model: Qwen/Qwen2.5-0.5B-Instruct
tags:
- voice-ai
- turn-detection
- end-of-utterance
- end-of-turn
- conversational-ai
- livekit
- onnx
- quantized
- knowledge-distillation
---
# LiveKit Turn Detector
An open-weights language model for contextually-aware end-of-utterance (EOU) detection in voice AI applications. The model predicts whether a user has finished speaking based on the semantic content of their transcribed speech, providing a critical complement to voice activity detection (VAD) systems.
> **📖 For installation, usage examples, and integration guides, see the [LiveKit documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/).**
## Table of Contents
- [Overview](#overview)
- [Model Variants](#model-variants)
- [How It Works](#how-it-works)
- [Architecture and Training](#architecture-and-training)
- [Supported Languages](#supported-languages)
- [Benchmarks](#benchmarks)
- [Usage](#usage)
- [Deployment Requirements](#deployment-requirements)
- [Limitations](#limitations)
- [License](LICENSE)
- [Resources](#resources)
## Overview
Traditional voice agents rely on voice activity detection (VAD) to determine when a user has finished speaking. VAD works by detecting the presence or absence of speech in an audio signal and applying a silence timer. While effective for detecting pauses, VAD lacks language understanding and frequently causes false positives. For example, a user who says *"I need to think about that for a moment..."* and then pauses will be interrupted by a VAD-only system, even though they clearly intend to continue.
This model adds semantic understanding to the turn detection process. It analyzes the transcribed text of a conversation in real time and predicts the probability that the user has completed their turn. When integrated into a voice pipeline alongside VAD, it substantially reduces unwanted interruptions while maintaining responsiveness.
The model is particularly effective in scenarios involving structured data input — such as dictating addresses, phone numbers, email addresses, and credit card numbers — where natural pauses between segments do not indicate completion.
## Model Variants
**Multilingual** (recommended) and **English-only** (deprecated) are distributed as INT8 quantized ONNX models (`model_q8.onnx`) optimized for CPU inference.
> **⚠️ The English-only model (`EnglishModel`) is deprecated.** Use the **multilingual model (`MultilingualModel`)** for all new projects, including English-only applications. The multilingual model provides better accuracy across all languages — including English — thanks to knowledge distillation from a larger teacher model and an expanded training dataset. The English-only variant will not receive further updates.
## How It Works
The model operates on transcribed text from a speech-to-text (STT) system, not raw audio.
1. **Input**: The recent conversation history (up to **6 turns**, truncated to **128 tokens**) is formatted using the [Qwen chat template](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) with `<|im_start|>` / `<|im_end|>` delimiters. The final user message is left *without* the closing `<|im_end|>` token.
2. **Prediction**: The model predicts the probability of the `<|im_end|>` token appearing next. A **high probability** indicates the user has likely finished their utterance. A **low probability** indicates they are likely to continue.
3. **Thresholding**: Per-language thresholds (stored in `languages.json`) convert the raw probability into a binary decision. These thresholds are tuned to balance responsiveness and accuracy for each supported language.
4. **Integration with VAD**: In the LiveKit Agents framework, the model works alongside the [Silero VAD](https://docs.livekit.io/agents/logic/turns/vad/) plugin. VAD handles speech presence detection and interruption triggering, while this model provides the semantic signal for when to commit a turn.
### Text Preprocessing
The **multilingual** variant applies the following normalization before inference:
- NFKC unicode normalization
- Lowercasing
- Punctuation removal (preserving apostrophes and hyphens)
- Whitespace collapsing
The **English-only** variant passes raw transcribed text without normalization.
## Architecture and Training
### Base Model
Both variants are fine-tuned from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), selected for its strong performance on this task while enabling low-latency CPU inference.
### Knowledge Distillation
A **Qwen2.5-7B-Instruct** model was first fine-tuned as a teacher on end-of-turn prediction. Its knowledge was then distilled into the 0.5B student model. The distilled model approaches teacher-level accuracy while maintaining the efficiency of the smaller architecture, converging after approximately 1,500 training steps.
### Training Data
The training dataset is a mix of:
- **Real call center transcripts** covering diverse conversational patterns
- **Synthetic dialogues** emphasizing structured data input — addresses, email addresses, phone numbers, and credit card numbers
- **Multi-format STT outputs** to handle provider variation (e.g., "forty two" vs. "42"), ensuring consistent predictions across different STT engines without runtime overhead
Although structured data enhancements were added only to the English training set, performance improvements generalized across languages due to the multilingual knowledge encoded in the Qwen2.5 base model.
### Quantization
The trained model is exported to ONNX format and quantized to INT8 (`model_q8.onnx`), enabling efficient CPU-only inference with ONNX Runtime.
## Supported Languages
The multilingual model supports 14 languages. The model relies on the STT provider to report the detected language, which is then used to select the appropriate per-language threshold.
English, Spanish, French, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Indonesian, Turkish, Russian, Hindi
## Benchmarks
### Detection Accuracy (Multilingual Variant)
- **True positive** — the model correctly identifies the user has finished speaking.
- **True negative** — the model correctly identifies the user will continue speaking.
| Language | True Positive Rate | True Negative Rate |
|---|---|---|
| Hindi | 99.4% | 96.3% |
| Korean | 99.3% | 94.5% |
| French | 99.3% | 88.9% |
| Indonesian | 99.3% | 89.4% |
| Japanese | 99.3% | 88.8% |
| Dutch | 99.3% | 88.1% |
| Russian | 99.3% | 88.0% |
| German | 99.3% | 87.8% |
| Portuguese | 99.4% | 87.4% |
| Turkish | 99.3% | 87.3% |
| English | 99.3% | 87.0% |
| Chinese | 99.3% | 86.6% |
| Spanish | 99.3% | 86.0% |
| Italian | 99.3% | 85.1% |
### Improvement Over Prior Version
The multilingual v0.4.1 release achieved a **39.23% relative improvement** in handling structured inputs (emails, addresses, phone numbers, credit card numbers) compared to the prior version, reducing premature interruptions during data collection scenarios.
## Usage
The model is designed for use as a turn detection plugin within the [LiveKit Agents](https://github.com/livekit/agents) framework.
For complete installation instructions, code examples (Python and Node.js), and configuration options, see the **[LiveKit turn detector plugin documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/)**.
For broader context on how turn detection fits into the voice pipeline — including VAD configuration, interruption handling, and manual turn control — see the **[Turns overview](https://docs.livekit.io/agents/logic/turns/)**.
## Deployment Requirements
- **Runtime**: CPU-only (no GPU required). Uses [ONNX Runtime](https://onnxruntime.ai/) with the `CPUExecutionProvider`.
- **RAM**: <500 MB for the multilingual model.
- **Instance type**: Use compute-optimized instances (e.g., AWS c6i, c7i). Avoid burstable instances (e.g., AWS t3, t4g) to prevent inference timeouts from CPU credit exhaustion.
- **LiveKit Cloud**: The model is deployed globally on LiveKit Cloud. Agents running there automatically use the optimized remote inference service with no local resource requirements.
## Limitations
- **Text-only input**: The model operates on STT-transcribed text and cannot incorporate prosodic cues such as pauses, intonation, or emphasis. Future versions may integrate multimodal audio features.
- **STT dependency**: Prediction quality depends on the accuracy and output format of the upstream STT provider. Mismatches between training and deployment STT formats may degrade performance.
- **Context window**: Limited to 128 tokens across a maximum of 6 conversation turns.
- **Language coverage**: Currently supports 14 languages. Performance on unsupported languages is undefined.
- **Realtime model compatibility**: Cannot be used with audio-native realtime models (e.g., OpenAI Realtime API) without adding a separate STT service, which incurs additional cost and latency.
## License
This model is released under the [LiveKit Model License](./LICENSE).
## Resources
- **[Documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/)**: Full plugin documentation, installation, and integration guide.
- **[Turns Overview](https://docs.livekit.io/agents/logic/turns/)**: How turn detection fits into the LiveKit Agents voice pipeline.
- **[Blog: Improved End-of-Turn Model](https://blog.livekit.io/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/)**: Technical deep dive on the multilingual distillation approach and benchmarks.
- **[Blog: Using a Transformer for Turn Detection](https://blog.livekit.io/using-a-transformer-to-improve-end-of-turn-detection/)**: Original blog post introducing the concept and architecture.
- **[Video: LiveKit Turn Detector](https://youtu.be/OZG0oZKctgw)**: Overview video demonstrating the plugin.
- **[GitHub: Plugin Source](https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-turn-detector)**: Source code for the `livekit-plugins-turn-detector` package.
- **[PyPI](https://pypi.org/project/livekit-plugins-turn-detector/)** | **[npm](https://www.npmjs.com/package/@livekit/agents-plugin-livekit)**: Package registries.

4
added_tokens.json Normal file
View File

@@ -0,0 +1,4 @@
{
"<|assistant|>": 49153,
"<|user|>": 49152
}

32
config.json Normal file
View File

@@ -0,0 +1,32 @@
{
"_attn_implementation_autoset": true,
"_name_or_path": "/tmp/tmpdhw4_gdh",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 0,
"eos_token_id": 0,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 576,
"initializer_range": 0.041666666666666664,
"intermediate_size": 1536,
"is_llama_config": true,
"max_position_embeddings": 8192,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 9,
"num_hidden_layers": 30,
"num_key_value_heads": 3,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_interleaved": false,
"rope_scaling": null,
"rope_theta": 100000,
"tie_word_embeddings": true,
"transformers_version": "4.46.3",
"use_cache": true,
"vocab_size": 49154
}

1
configuration.json Normal file
View File

@@ -0,0 +1 @@
{"framework": "pytorch", "task": "text-generation", "allow_remote": true}

6
generation_config.json Normal file
View File

@@ -0,0 +1,6 @@
{
"_from_model_config": true,
"bos_token_id": 0,
"eos_token_id": 0,
"transformers_version": "4.46.3"
}

48901
merges.txt Normal file

File diff suppressed because it is too large Load Diff

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2f7b4c93c1cdb6d1e858b01f63e5f0f15bdb979dd0d177e8996a244c80e03925
size 538095016

3
model_quantized.onnx Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:4e685767c3643b0363c9f826a98325683f29e9c7d550162c8e8740ba33aa31aa
size 165035487

33
ort_config.json Normal file
View File

@@ -0,0 +1,33 @@
{
"one_external_file": true,
"opset": null,
"optimization": {},
"quantization": {
"activations_dtype": "QUInt8",
"activations_symmetric": false,
"format": "QOperator",
"is_static": false,
"mode": "IntegerOps",
"nodes_to_exclude": [],
"nodes_to_quantize": [],
"operators_to_quantize": [
"Conv",
"MatMul",
"Attention",
"LSTM",
"Gather",
"Transpose",
"EmbedLayerNormalization"
],
"per_channel": false,
"qdq_add_pair_to_weight": false,
"qdq_dedicated_pair": false,
"qdq_op_type_per_channel_support_to_axis": {
"MatMul": 1
},
"reduce_range": false,
"weights_dtype": "QInt8",
"weights_symmetric": true
},
"use_external_data_format": false
}

36
special_tokens_map.json Normal file
View File

@@ -0,0 +1,36 @@
{
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|user|>",
"<|assistant|>"
],
"bos_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

244967
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

172
tokenizer_config.json Normal file
View File

@@ -0,0 +1,172 @@
{
"add_prefix_space": false,
"added_tokens_decoder": {
"0": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"3": {
"content": "<repo_name>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"4": {
"content": "<reponame>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"5": {
"content": "<file_sep>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"6": {
"content": "<filename>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<gh_stars>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"8": {
"content": "<issue_start>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"9": {
"content": "<issue_comment>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"10": {
"content": "<issue_closed>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"11": {
"content": "<jupyter_start>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"12": {
"content": "<jupyter_text>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"13": {
"content": "<jupyter_code>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"14": {
"content": "<jupyter_output>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"15": {
"content": "<jupyter_script>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"16": {
"content": "<empty_output>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"49152": {
"content": "<|user|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"49153": {
"content": "<|assistant|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|user|>",
"<|assistant|>"
],
"bos_token": "<|endoftext|>",
"chat_template": "{% for message in messages %}{{'<|im_start|>' + '<|' + message['role'] + '|>' + message['content'] + '<|im_end|>'}}{% endfor %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<|endoftext|>",
"model_max_length": 8192,
"pad_token": "<|endoftext|>",
"tokenizer_class": "GPT2Tokenizer",
"unk_token": "<|endoftext|>",
"vocab_size": 49152
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long