初始化项目,由ModelHub XC社区提供模型

Model: newmindai/Mecellem-Qwen3-1.7B-TR
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-04-12 07:45:01 +08:00
commit 215278806c
16 changed files with 152171 additions and 0 deletions

40
.gitattributes vendored Normal file
View File

@@ -0,0 +1,40 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_7b_qwen_armo.png filter=lfs diff=lfs merge=lfs -text
comparison_rewards_by_token_length-filtered.png filter=lfs diff=lfs merge=lfs -text
qwen3-1.7_dataset.png filter=lfs diff=lfs merge=lfs -text
qwen3-1.7b_loss.png filter=lfs diff=lfs merge=lfs -text

3
1_7b_qwen_armo.png Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:bc162d2207f95e7b68810665651af040c5510de50d1d375c05b16e8b87a30b62
size 107771

267
README.md Normal file
View File

@@ -0,0 +1,267 @@
---
base_model: Qwen/Qwen3-1.7B
language:
- tr
- en
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation
- turkish
- legal
- turkish-legal
- mecellem
- qwen
- decoder-only
- continual-pretraining
- TRUBA
- MN5
---
# Mecellem-Qwen3-1.7B-TR
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
Mecellem-Qwen3-1.7B-TR is a Turkish legal language model presented in [Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain](https://huggingface.co/papers/2601.16018).
**Resources:**
- **Code:** [GitHub Repository](https://github.com/newmindai/mecellem-models)
- **Paper:** [arXiv:2601.16018](https://arxiv.org/abs/2601.16018)
## Model Description
Mecellem-Qwen3-1.7B-TR is a Turkish legal language model adapted through Continual Pre-training (CPT) on Turkish legal and official texts. The model is based on Qwen3-1.7B decoder architecture (1.7B parameters) and trained using a four-phase curriculum learning strategy specifically designed to account for Turkish linguistic complexity. The CPT process progressively transitions from general-purpose texts to domain-specific legal content, achieving 36.2% perplexity reduction on Turkish legal text compared to the base Qwen3-1.7B model.
**Key Features:**
- Continual pre-training on approximately 225 billion tokens across four phases
- Four-phase curriculum learning:
- Phase 1: ~3.7B tokens
- Phase 2: ~57B tokens
- Phase 3: ~165B tokens
- Phase 4: ~24.9B tokens
- Dataset includes Turkish legal sources (Yargıtay, Danıştay, YÖKTEZ) and general Turkish web data (FineWeb2, CulturaX)
- Preserves general language capabilities while injecting domain-specific legal knowledge
**Model Type:** Decoder-only Language Model
**Parameters:** 1.7B
**Base Model:** Qwen/Qwen3-1.7B
**Architecture:** Qwen3 decoder with grouped query attention (GQA)
### Architecture Details
- **Max Position Embeddings:** 40,960 tokens
- **Number of Layers:** 28 transformer layers
- **Hidden Size:** 2,048
- **FFN Hidden Size:** 6,144
- **Number of Heads:** 16
- **Number of KV Heads (GQA):** 8
- **Activation Function:** SwiGLU
- **Position Encodings:** RoPE (Rotary Position Embeddings)
- **Layer Norm:** RMSNorm
### Training Details
**Continual Pre-training (CPT):**
- **Total Training Tokens:** ~225 billion tokens (250,739,476,454 tokens across four phases)
- **Training Method:** Four-phase curriculum learning
- **Framework:** NVIDIA NeMo with Megatron-Core
- **Hardware:** MareNostrum 5 supercomputer (BSC), H100 GPUs
- **Precision:** BF16
**Dataset Composition:**
- **Legal Sources:**
- Court of Cassation (Yargıtay): 10.3M sequences, ~3.43B tokens
- Council of State (Danıştay): 151K sequences, ~0.11B tokens
- Academic theses (YÖKTEZ): 21.1M sequences, ~9.61B tokens (after DocsOCR processing)
- **General Turkish Sources:**
- FineWeb2: General Turkish web data
- CulturaX: Multilingual corpus (Turkish subset)
- Total general Turkish: 212M sequences, ~96.17B tokens
- **Additional Categories:** English, Mathematics, Python code, multilingual content (Spanish, Arabic, Russian, Chinese)
**Phase 1 (~3.7B tokens):**
- Focus: Short, general-purpose Turkish texts
- Purpose: Adapt model to Turkish language patterns while maintaining stability
- Learning Rate: Higher with extended warmup
- Dataset: Academic-focused data with semantic deduplication and FineWeb quality filtering
**Phase 2 (~57B tokens):**
- Focus: Legal content with domain-specific terminology
- Includes: Court decisions, legal articles, regulatory documents
- Data Replay: YÖKTEZ academic legal data from Phase 1
- Dataset: Lighter pipeline with FineWeb quality filtering, preserving topical diversity
**Phase 3 (~165B tokens):**
- Focus: Long, structurally complex normative texts
- Includes: Full court decisions, legislative documents, academic legal theses
- Purpose: Refine model's understanding of legal reasoning patterns
- Dataset: Long-form documents with merged consecutive pages
**Phase 4 (~24.9B tokens):**
- Focus: Extended domain-specific refinement
- Includes: Mixed complexity documents
- Purpose: Consolidate knowledge and improve generalization
**Training Hyperparameters:**
- Sequence Length: 4,096 tokens
- Optimizer: Adam with cosine learning rate schedule
- Max Learning Rate: 5×10⁻⁵
- Min Learning Rate: 5×10⁻⁶
- Weight Decay: 0.01
- Warmup Steps: Phase-dependent (200-2,340 steps)
- Precision: BF16 mixed precision
- Framework: NVIDIA NeMo with Megatron-Core
**Hardware Infrastructure:**
- **System:** MareNostrum 5 ACC partition at Barcelona Supercomputing Center (BSC)
- **Node Configuration:** Each node equipped with 4× NVIDIA Hopper H100 64GB GPUs (SXM), 80 CPU cores, 512GB DDR5 memory
- **Interconnect:** 800 Gb/s InfiniBand for distributed training
- **GPU Interconnect:** NVLink for intra-node GPU communication (4 GPUs per node connected via NVLink)
- **Distributed Training:** Data-parallel multi-node and multi-GPU distributed architecture with 4 GPUs per node
- **InfiniBand Network:** Enabled efficient processing of large-scale token flow and ensured high scalability and training stability in long-term CPT training
- **Phase-Specific Hardware:**
- **Phase 1:** 50 nodes, 200 GPUs, ~3.7B tokens, 3.77M tokens/sec throughput, 20.7% median MFU
- **Phase 2:** 50 nodes, 200 GPUs, ~57B tokens, 3.59M tokens/sec throughput, 20.7% median MFU
- **Phase 3:** 100 nodes, 400 GPUs, ~165B tokens, 7.35M tokens/sec throughput, 20.3% median MFU
- **Phase 4:** 50 nodes, 200 GPUs, ~24.9B tokens, 3.25M tokens/sec throughput, 20.6% median MFU
**Catastrophic Forgetting Mitigation:**
- Curriculum learning: Progressive transition from general to specialized knowledge
- Replay buffer: YÖKTEZ data from Phase 1 included in Phase 2
- Conservative learning rates and extended warmup periods
**Performance:** Achieved 36.2% perplexity reduction on Turkish legal text compared to base Qwen3-1.7B model.
### Training Visualization
The following visualizations show the model's training progress and dataset distribution:
![Dataset Distribution](qwen3-1.7_dataset.png)
*Qwen3-1.7B CPT Dataset Distribution across Four Phases. The curriculum learning strategy progressively introduces more complex legal content.*
![Training Loss](qwen3-1.7b_loss.png)
*Qwen3-1.7B CPT Training and Validation Loss Across Four Phases. The model shows consistent improvement throughout all training phases.*
### Benchmark Performance
The model was evaluated using the Muhakim reward model on Turkish legal tasks:
![Benchmark Performance](1_7b_qwen_armo.png)
*Benchmark Performance of 1.7B Decoder-Only Models Across Context Lengths Using the Muhakim Reward Model. Mecellem-Qwen3-1.7B-TR consistently outperforms the base Qwen3-1.7B model across all five legal quality objectives, with particularly pronounced gains for depth of coverage, statute reference usage, and legal accuracy.*
### Rewards Comparison Analysis
The following visualization compares rewards across different token lengths for base vs CPT models:
![Rewards Comparison](comparison_rewards_by_token_length-filtered.png)
*Rewards Comparison: Base vs CPT Models Across Token Lengths. Mecellem-Qwen3-1.7B-TR shows consistent improvements over the base model across all context length settings, demonstrating the effectiveness of Turkish legal domain adaptation.*
## Usage
### Installation
```bash
pip install transformers torch
```
### Text Generation
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("newmindai/Mecellem-Qwen3-1.7B-TR")
model = AutoModelForCausalLM.from_pretrained("newmindai/Mecellem-Qwen3-1.7B-TR")
# Example prompt
prompt = "Türk hukuk sisteminde sözleşme feshi"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
top_p=0.9
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
### Chat Format
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("newmindai/Mecellem-Qwen3-1.7B-TR")
model = AutoModelForCausalLM.from_pretrained("newmindai/Mecellem-Qwen3-1.7B-TR")
messages = [
{"role": "user", "content": "Türk hukuk sisteminde sözleşme feshi nasıl yapılır?"}
]
# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
# Generate response
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## Use Cases
- Turkish legal text generation
- Legal document summarization
- Legal question answering
- Legal text completion
- Domain-specific language modeling for Turkish legal domain
- Retrieval-Augmented Generation (RAG) applications
## Acknowledgments
This work was supported by the EuroHPC Joint Undertaking through project etur46 with access to the MareNostrum 5 supercomputer, hosted by Barcelona Supercomputing Center (BSC), Spain. MareNostrum 5 is owned by EuroHPC JU and operated by BSC. We are grateful to the BSC support team for their assistance with job scheduling, environment configuration, and technical guidance throughout the project.
The numerical calculations reported in this work were fully/partially performed at TÜBİTAK ULAKBİM, High Performance and Grid Computing Center (TRUBA resources). The authors gratefully acknowledge the know-how provided by the MINERVA Support for expert guidance and collaboration opportunities in HPC-AI integration.
## References
If you use this model, please cite our paper:
```bibtex
@article{mecellem2026,
title={Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain},
author={Uğur, Özgür and Göksu, Mahmut and Çimen, Mahmut and Yılmaz, Musa and Şavirdi, Esra and Demir, Alp Talha and Güllüce, Rumeysa and İclal Çetin and Sağbaş, Ömer Can},
journal={arXiv preprint arXiv:2601.16018},
year={2026},
month={January},
url={https://arxiv.org/abs/2601.16018},
doi={10.48550/arXiv.2601.16018},
eprint={2601.16018},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Base Model References
```bibtex
@article{qwen2024,
title={Qwen3: A Large Language Model Series},
author={Qwen Team},
journal={arXiv preprint arXiv:2409.00000},
year={2024}
}
```

28
added_tokens.json Normal file
View File

@@ -0,0 +1,28 @@
{
"</think>": 151668,
"</tool_call>": 151658,
"</tool_response>": 151666,
"<think>": 151667,
"<tool_call>": 151657,
"<tool_response>": 151665,
"<|box_end|>": 151649,
"<|box_start|>": 151648,
"<|endoftext|>": 151643,
"<|file_sep|>": 151664,
"<|fim_middle|>": 151660,
"<|fim_pad|>": 151662,
"<|fim_prefix|>": 151659,
"<|fim_suffix|>": 151661,
"<|im_end|>": 151645,
"<|im_start|>": 151644,
"<|image_pad|>": 151655,
"<|object_ref_end|>": 151647,
"<|object_ref_start|>": 151646,
"<|quad_end|>": 151651,
"<|quad_start|>": 151650,
"<|repo_name|>": 151663,
"<|video_pad|>": 151656,
"<|vision_end|>": 151653,
"<|vision_pad|>": 151654,
"<|vision_start|>": 151652
}

85
chat_template.jinja Normal file
View File

@@ -0,0 +1,85 @@
{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{{- messages[0].content + '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
{%- if messages[0].role == 'system' %}
{{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
{%- set index = (messages|length - 1) - loop.index0 %}
{%- if ns.multi_step_tool and message.role == "user" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
{%- set ns.multi_step_tool = false %}
{%- set ns.last_query_index = index %}
{%- endif %}
{%- endfor %}
{%- for message in messages %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{%- set content = message.content %}
{%- set reasoning_content = '' %}
{%- if message.reasoning_content is defined and message.reasoning_content is not none %}
{%- set reasoning_content = message.reasoning_content %}
{%- else %}
{%- if '</think>' in message.content %}
{%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
{%- set reasoning_content = message.content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
{%- endif %}
{%- endif %}
{%- if loop.index0 > ns.last_query_index %}
{%- if loop.last or (not loop.last and reasoning_content) %}
{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- if message.tool_calls %}
{%- for tool_call in message.tool_calls %}
{%- if (loop.first and content) or (not loop.first) %}
{{- '\n' }}
{%- endif %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '<tool_call>\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{%- if tool_call.arguments is string %}
{{- tool_call.arguments }}
{%- else %}
{{- tool_call.arguments | tojson }}
{%- endif %}
{{- '}\n</tool_call>' }}
{%- endfor %}
{%- endif %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>user' }}
{%- endif %}
{{- '\n<tool_response>\n' }}
{{- message.content }}
{{- '\n</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- endif %}
{%- endif %}

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:9f877585b9786dc0db2a661c5b7711eb3db83b0d222b13b39cc21f8eca0944da
size 265593

60
config.json Normal file
View File

@@ -0,0 +1,60 @@
{
"architectures": [
"Qwen3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 6144,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 40960,
"max_window_layers": 28,
"model_type": "qwen3",
"num_attention_heads": 16,
"num_hidden_layers": 28,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.53.0",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}

6
generation_config.json Normal file
View File

@@ -0,0 +1,6 @@
{
"_from_model_config": true,
"bos_token_id": 151643,
"eos_token_id": 151645,
"transformers_version": "4.53.0"
}

151388
merges.txt Normal file

File diff suppressed because it is too large Load Diff

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6b2b2cfc1af4638fd11b9a727315771cc0265679e2043bbffcf1abd049068928
size 4063515640

3
qwen3-1.7_dataset.png Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b941c5c30943bb8fbe14ab853ee3a4146601ea5bd6ba33f509459665c34596c4
size 188822

3
qwen3-1.7b_loss.png Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:8bfd8c7edd496e564464198d7f06ad20e75a9770b8d1f47724044460f0fc6f7e
size 90818

38
special_tokens_map.json Normal file
View File

@@ -0,0 +1,38 @@
{
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"eos_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"sep_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

3
tokenizer.json Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
size 11422654

240
tokenizer_config.json Normal file
View File

@@ -0,0 +1,240 @@
{
"add_bos_token": false,
"add_prefix_space": false,
"added_tokens_decoder": {
"151643": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151644": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151645": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151646": {
"content": "<|object_ref_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151647": {
"content": "<|object_ref_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151648": {
"content": "<|box_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151649": {
"content": "<|box_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151650": {
"content": "<|quad_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151651": {
"content": "<|quad_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151652": {
"content": "<|vision_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151653": {
"content": "<|vision_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151654": {
"content": "<|vision_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151655": {
"content": "<|image_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151656": {
"content": "<|video_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151657": {
"content": "<tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151658": {
"content": "</tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151659": {
"content": "<|fim_prefix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151660": {
"content": "<|fim_middle|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151661": {
"content": "<|fim_suffix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151662": {
"content": "<|fim_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151663": {
"content": "<|repo_name|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151664": {
"content": "<|file_sep|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151665": {
"content": "<tool_response>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151666": {
"content": "</tool_response>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151667": {
"content": "<think>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151668": {
"content": "</think>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"bos_token": null,
"clean_up_tokenization_spaces": false,
"eos_token": "<|endoftext|>",
"errors": "replace",
"extra_special_tokens": {},
"model_max_length": 131072,
"pad_token": "<|endoftext|>",
"sep_token": "<|endoftext|>",
"split_special_tokens": false,
"tokenizer_class": "Qwen2Tokenizer",
"unk_token": null
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long