Model: canbingol/gemma3_1B_base-tr-cpt-1epoch_stage3 Source: Original Platform
library_name, tags, datasets, language, base_model, new_version
| library_name | tags | datasets | language | base_model | new_version | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| transformers |
|
|
|
|
canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage1 |
Model Card: Gemma3-1B Turkish CPT (100K–150K Subset, 1 Epoch – Stage 3)
Overview
This model is the Stage 3 Turkish Continued Pretraining (CPT) variant of Gemma-3-1B.
Unlike Stage 1, which was initialized from google/gemma-3-1b-pt,
this model was initialized from:
canbingol/gemma3_1B_base-tr-cpt-1epoch_stage2
Stage 3 continues domain adaptation by exposing the model to new data rather than repeating the same subset.
The model was trained for 1 epoch on samples 100,000 to 150,000 of the Turkish web corpus.
Importantly, this model is a direct continuation of Stage 2.
Therefore, cumulatively it has been trained on samples 0–150,000 of the corpus (Stage 1: 0–50K, Stage 2: 50K–100K, Stage 3: 100K–150K).
Training Lineage
- Stage 0:
google/gemma-3-1b-pt - Stage 1: Samples 0–50,000 (1 epoch)
- Stage 2: Samples 50,000–100,000 (1 epoch)
- Stage 3 (this release): Samples 100,000–150,000 (1 epoch)
Cumulative data exposure: 0–150,000 samples
This represents sequential CPT across disjoint data shards.
Training Setup
- Dataset:
canbingol/vngrs-web-corpus-200k - Subset Used: Samples 100,000–150,000
- Initialization: Stage 2 checkpoint
- Training Objective: Continued Pretraining
- Epochs: 1
- Data Regime: Plain text
- Token Count: ~21.5M tokens
- Cumulative Token Exposure (Stage 1 + Stage 2 + Stage 3): ~64.5M tokens (approximate)
Usage Example
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "canbingol/gemma3_1B_base-tr-cpt-1epoch_stage3"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = model.to(device)
prompt = "bundan böyle"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
top_p=0.9
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Description