library_name, tags, datasets, language, base_model, new_version
library_name tags datasets language base_model new_version
transformers
trl
cpt
canbingol/vngrs-web-corpus-200k
tr
en
canbingol/gemma3_1B_base-tr-cpt-1epoch_stage3
canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage1

Model Card: Gemma3-1B Turkish CPT (150K200K Subset, 1 Epoch Stage 4)

Overview

This model is the Stage 4 Turkish Continued Pretraining (CPT) variant of Gemma-3-1B.

Unlike Stage 1, which was initialized from google/gemma-3-1b-pt,
this model was initialized from:

  • canbingol/gemma3_1B_base-tr-cpt-1epoch_stage3

Stage 4 continues domain adaptation by exposing the model to new data rather than repeating the same subset.

The model was trained for 1 epoch on samples 150,000 to 200,000 of the Turkish web corpus.

Importantly, this model is a direct continuation of Stage 3.
Therefore, cumulatively it has been trained on samples 0200,000 of the corpus (Stage 1: 050K, Stage 2: 50K100K, Stage 3: 100K150K, Stage 4: 150K200K).

This stage corresponds to the end of the 1-epoch pass over the full 200K-sample dataset (i.e., completion of the first full epoch via sequential shards).


Training Lineage

  • Stage 0: google/gemma-3-1b-pt
  • Stage 1: Samples 050,000 (1 epoch)
  • Stage 2: Samples 50,000100,000 (1 epoch)
  • Stage 3: Samples 100,000150,000 (1 epoch)
  • Stage 4 (this release): Samples 150,000200,000 (1 epoch, end of epoch-1)

Cumulative data exposure: 0200,000 samples

This represents sequential CPT across disjoint data shards.


Training Setup

  • Dataset: canbingol/vngrs-web-corpus-200k
  • Subset Used: Samples 150,000200,000
  • Initialization: Stage 3 checkpoint
  • Training Objective: Continued Pretraining
  • Epochs: 1
  • Data Regime: Plain text
  • Token Count: ~21.6M tokens
  • Cumulative Token Exposure (Stage 1 + Stage 2 + Stage 3 + Stage 4): ~86.1M tokens (approximate)

Notes on cumulative exposure:

  • Although Stage 4 trains only on the 150K200K shard, it inherits all adaptations learned from previous stages.
  • After this stage, the model has effectively completed exposure to the entire 0200K dataset range through sequential continuation.

Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "canbingol/gemma3_1B_base-tr-cpt-1epoch_stage4"

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = model.to(device)

prompt = "bundan böyle"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_p=0.9
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Description
Model synced from source: canbingol/gemma3_1B_base-tr-cpt-1epoch_stage4
Readme 26 KiB