--- library_name: transformers tags: - trl - cpt datasets: - canbingol/vngrs-web-corpus-200k language: - tr - en base_model: - google/gemma-3-1b-pt --- # Model Card: Gemma3-1B Turkish CPT (15K Subset, 3 Epoch) ## Overview This model is a Turkish Continued Pretraining (CPT) variant of `google/gemma-3-1b-pt`. The base model was further trained for **3 epochs** on the first **15,000 samples** of a Turkish web corpus to improve Turkish language modeling capability and domain familiarity. This release is intended for **research and experimental use**. --- ## Base Model - `google/gemma-3-1b-pt` --- ## Training Setup - Dataset: `canbingol/vngrs-web-corpus-200k` - Subset Used: First 15,000 samples - Training Objective: Continued Pretraining (Causal LM / Next-Token Prediction) - Epochs: 3 - Data Regime: Plain text (no instruction formatting) - Token Count (approximate): ~19.5M tokens --- ## Citation If you use this model, please cite the base model: - Base: `google/gemma-3-1b-pt` --- ## Usage Example ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "canbingol/gemma3_1B_base-tr-cpt-3epoch_15k_data" device = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) model = model.to(device) prompt = "Benim adım" inputs = tokenizer(prompt, return_tensors="pt").to(device) outputs = model.generate( **inputs, max_new_tokens=50, do_sample=True, temperature=0.8, top_p=0.9 ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text)