1.6 KiB
1.6 KiB
library_name, tags, datasets, language, base_model
| library_name | tags | datasets | language | base_model | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| transformers |
|
|
|
|
Model Card: Gemma3-1B Turkish CPT (15K Subset, 3 Epoch)
Overview
This model is a Turkish Continued Pretraining (CPT) variant of google/gemma-3-1b-pt.
The base model was further trained for 3 epochs on the first 15,000 samples of a Turkish web corpus to improve Turkish language modeling capability and domain familiarity.
This release is intended for research and experimental use.
Base Model
google/gemma-3-1b-pt
Training Setup
- Dataset:
canbingol/vngrs-web-corpus-200k - Subset Used: First 15,000 samples
- Training Objective: Continued Pretraining (Causal LM / Next-Token Prediction)
- Epochs: 3
- Data Regime: Plain text (no instruction formatting)
- Token Count (approximate): ~19.5M tokens
Citation
If you use this model, please cite the base model:
- Base:
google/gemma-3-1b-pt
Usage Example
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "canbingol/gemma3_1B_base-tr-cpt-3epoch_15k_data"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = model.to(device)
prompt = "Benim adım"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
top_p=0.9
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)