Create README.md

2024-04-04 10:40:10 +00:00
parent 25d58f3a97
commit 97122e6e18
1 changed files with 62 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,62 @@
+---
+license: apache-2.0
+language:
+- tr
+---
+
+# Turkcell-LLM-7b-v1
+
+This model is an extended version of a Mistral-based Large Language Model (LLM) for Turkish. It was trained on a cleaned Turkish raw dataset containing 5 billion tokens. The training process involved using the DORA method followed by fine-tuning with the LORA method.
+
+## Model Details
+
+- **Base Model**: Mistral 7B based LLM
+- **Tokenizer Extension**: Specifically extended for Turkish
+- **Training Dataset**: Cleaned Turkish raw data with 5 billion tokens
+- **Training Method**: Initially with DORA, followed by fine-tuning with LORA
+
+### DORA Configuration
+
+- `lora_alpha`: 128
+- `lora_dropout`: 0.05
+- `r`: 64
+- `target_modules`: "all-linear"
+
+
+### LORA Fine-Tuning Configuration
+
+- `lora_alpha`: 128
+- `lora_dropout`: 0.05
+- `r`: 256
+- `target_modules`: "all-linear"
+
+## Usage Examples
+
+```python
+
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+device = "cuda" # the device to load the model onto
+
+model = AutoModelForCausalLM.from_pretrained("TURKCELL/Turkcell-LLM-7b-v1")
+tokenizer = AutoTokenizer.from_pretrained("TURKCELL/Turkcell-LLM-7b-v1")
+
+messages = [
+    {"role": "user", "content": "Türkiye'nin başkenti neresidir?"},
+]
+
+encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
+
+eos_token = tokenizer("<|im_end|>",add_special_tokens=False)["input_ids"][0]
+
+model_inputs = encodeds.to(device)
+model.to(device)
+
+generated_ids = model.generate(model_inputs, 
+                               max_new_tokens=1024, 
+                               do_sample=True, 
+                               eos_token_id=eos_token)
+                               
+decoded = tokenizer.batch_decode(generated_ids)
+print(decoded[0])
+