初始化项目,由ModelHub XC社区提供模型
Model: vngrs-ai/Kumru-2B Source: Original Platform
This commit is contained in:
50
.gitattributes
vendored
Normal file
50
.gitattributes
vendored
Normal file
@@ -0,0 +1,50 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zstandard filter=lfs diff=lfs merge=lfs -text
|
||||
*.tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
*.db* filter=lfs diff=lfs merge=lfs -text
|
||||
*.ark* filter=lfs diff=lfs merge=lfs -text
|
||||
**/*ckpt*data* filter=lfs diff=lfs merge=lfs -text
|
||||
**/*ckpt*.meta filter=lfs diff=lfs merge=lfs -text
|
||||
**/*ckpt*.index filter=lfs diff=lfs merge=lfs -text
|
||||
|
||||
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||
*.gguf* filter=lfs diff=lfs merge=lfs -text
|
||||
*.ggml filter=lfs diff=lfs merge=lfs -text
|
||||
*.llamafile* filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
|
||||
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
||||
model.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
73
README.md
Normal file
73
README.md
Normal file
@@ -0,0 +1,73 @@
|
||||
---
|
||||
library_name: transformers
|
||||
license: apache-2.0
|
||||
language:
|
||||
- tr
|
||||
datasets:
|
||||
- vngrs-ai/vngrs-web-corpus
|
||||
---
|
||||
|
||||
## Model Information
|
||||
<img src="https://cdn-uploads.huggingface.co/production/uploads/6147363543eb04c443cd4e39/1X8noMmS6Mlvj4BalQkuZ.png" alt="preview" width="600"/>
|
||||
|
||||
Kumru-2B is the lightweight, open-source version of Kumru LLM, developed for Turkish from scratch by VNGRS.
|
||||
|
||||
- It is pre-trained on a cleaned, deduplicated corpora of 500 GB for 300B tokens, and supervised fine-tuned on 1M examples.
|
||||
- It comes with a modern tokenizer developed for Turkish, supporting code, math and chat template.
|
||||
- Kumru has a native context length of 8,192 tokens by default.
|
||||
- This is the **instruct fine-tuned** version.
|
||||
- Pre-trained Base version is [here](https://huggingface.co/vngrs-ai/Kumru-2B-Base)
|
||||
|
||||
Try the demo of 7B version [here](https://kumru.ai/).
|
||||
|
||||
## Use
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_name = "vngrs-ai/Kumru-2B"
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
|
||||
|
||||
def generate_response(query):
|
||||
messages = [
|
||||
{'role': 'system', 'content': 'Adın Kumru. VNGRS tarafından Türkçe için sıfırdan eğitilmiş bir dil modelisin.'},
|
||||
{'role': 'user', 'content': query}
|
||||
]
|
||||
model_inputs = tokenizer.apply_chat_template(messages, return_tensors='pt', add_generation_prompt=True).to(model.device)
|
||||
model_outputs = model.generate(model_inputs, max_new_tokens=512, do_sample=True, top_p=0.9, temperature=0.7, repetition_penalty=1.1)
|
||||
output_tokens = model_outputs[0].cpu().detach().numpy().tolist()
|
||||
generated_tokens = output_tokens[model_inputs[0].shape[0]:]
|
||||
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
|
||||
return response
|
||||
|
||||
query = "Efes antik kentinin önemi nedir?"
|
||||
response = generate_response(query)
|
||||
print(response)
|
||||
```
|
||||
|
||||
|
||||
## Evaluation Results
|
||||
Both Kumru-7B and Kumru-2B are evaluated on Cetvel benchmark.
|
||||
|
||||
<img src="https://cdn-uploads.huggingface.co/production/uploads/6147363543eb04c443cd4e39/eu2TuwVpLwRWAh3MjWc1v.png" alt="preview" width="750"/>
|
||||
|
||||
We observe that Kumru overall surpasses significantly larger models such as LLaMA-3.3–70B, Gemma-3–27B, Qwen-2–72B and Aya-32B. It excels at tasks related to the nuances of the Turkish language, such as grammatical error correction and text summarization.
|
||||
|
||||
## Tokenizer Efficiency
|
||||
Kumru tokenizer is a modern BPE tokenizer with a vocabulary size of 50,176, pre-tokenization regex and a chat template.
|
||||
|
||||
<img src="https://cdn-uploads.huggingface.co/production/uploads/6147363543eb04c443cd4e39/zz6E1kba8UCq9N7oMDnKB.png" alt="preview" width="750"/>
|
||||
|
||||
Other open-source models spend between 38% to 98% more tokens than Kumru while still having larger vocabulary sizes.
|
||||
This means Kumru can represent more texts in its context length and process faster and cheaper. Although the native context length of Kumru is 8,192, its effective context length can be considered between 1128 and 1618, compared to other multilingual models out there.
|
||||
This shows the efficiency of having a native Turkish tokenizer in terms of representation power, speed and cost.
|
||||
|
||||
## Citation
|
||||
```
|
||||
@misc{turker2025kumru,
|
||||
title={Kumru},
|
||||
author={Turker, Meliksah and Ari, Erdi and Han, Aydin},
|
||||
year={2025},
|
||||
url={https://huggingface.co/vngrs-ai/Kumru-2B}
|
||||
}
|
||||
```
|
||||
29
config.json
Normal file
29
config.json
Normal file
@@ -0,0 +1,29 @@
|
||||
{
|
||||
"_name_or_path": "vngrs-ai/Kumru-2B-v0.2.1",
|
||||
"architectures": [
|
||||
"MistralForCausalLM"
|
||||
],
|
||||
"attention_dropout": 0.0,
|
||||
"bos_token_id": 2,
|
||||
"dtype": "bfloat16",
|
||||
"eos_token_id": 3,
|
||||
"head_dim": 128,
|
||||
"hidden_act": "silu",
|
||||
"hidden_size": 3072,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 10752,
|
||||
"max_position_embeddings": 8192,
|
||||
"model_type": "mistral",
|
||||
"num_attention_heads": 16,
|
||||
"num_hidden_layers": 18,
|
||||
"num_key_value_heads": 4,
|
||||
"pad_token_id": 0,
|
||||
"rms_norm_eps": 1e-05,
|
||||
"rope_theta": 500000,
|
||||
"sliding_window": null,
|
||||
"tie_word_embeddings": false,
|
||||
"torch_dtype": "bfloat16",
|
||||
"transformers_version": "4.49.0",
|
||||
"use_cache": true,
|
||||
"vocab_size": 50176
|
||||
}
|
||||
1
configuration.json
Normal file
1
configuration.json
Normal file
@@ -0,0 +1 @@
|
||||
{"framework": "pytorch", "task": "text-generation", "allow_remote": true}
|
||||
7
generation_config.json
Normal file
7
generation_config.json
Normal file
@@ -0,0 +1,7 @@
|
||||
{
|
||||
"_from_model_config": true,
|
||||
"bos_token_id": 2,
|
||||
"eos_token_id": 3,
|
||||
"pad_token_id": 0,
|
||||
"transformers_version": "4.49.0"
|
||||
}
|
||||
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:6486c82f0a9dfe34d55cd042f892d490c457e7e72e33d119e0360efbdc41cecf
|
||||
size 4750295696
|
||||
30
special_tokens_map.json
Normal file
30
special_tokens_map.json
Normal file
@@ -0,0 +1,30 @@
|
||||
{
|
||||
"bos_token": {
|
||||
"content": "<BOS>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"eos_token": {
|
||||
"content": "<EOS>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"pad_token": {
|
||||
"content": "<PAD>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"unk_token": {
|
||||
"content": "<UNK>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
}
|
||||
}
|
||||
3
tokenizer.json
Normal file
3
tokenizer.json
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:1f00b5bbea83c47d7d126990e5213ad049cb2942c04b91cdea95d032a251dee6
|
||||
size 3943435
|
||||
2109
tokenizer_config.json
Normal file
2109
tokenizer_config.json
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user