canbingol/gemma3_1B_base-tr-cpt-1epoch_stage4

Go to file

ModelHub XC c44133af68 初始化项目，由ModelHub XC社区提供模型

Model: canbingol/gemma3_1B_base-tr-cpt-1epoch_stage4
Source: Original Platform

2026-05-04 19:17:39 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-04 19:17:39 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-04 19:17:39 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-04 19:17:39 +08:00

model.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-04 19:17:39 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-04 19:17:39 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-04 19:17:39 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-04 19:17:39 +08:00

README.md

library_name, tags, datasets, language, base_model, new_version

library_name

Model Card: Gemma3-1B Turkish CPT (150K–200K Subset, 1 Epoch – Stage 4)

Overview

This model is the Stage 4 Turkish Continued Pretraining (CPT) variant of Gemma-3-1B.

Unlike Stage 1, which was initialized from google/gemma-3-1b-pt,
this model was initialized from:

canbingol/gemma3_1B_base-tr-cpt-1epoch_stage3

Stage 4 continues domain adaptation by exposing the model to new data rather than repeating the same subset.

The model was trained for 1 epoch on samples 150,000 to 200,000 of the Turkish web corpus.

Importantly, this model is a direct continuation of Stage 3.
Therefore, cumulatively it has been trained on samples 0–200,000 of the corpus (Stage 1: 0–50K, Stage 2: 50K–100K, Stage 3: 100K–150K, Stage 4: 150K–200K).

This stage corresponds to the end of the 1-epoch pass over the full 200K-sample dataset (i.e., completion of the first full epoch via sequential shards).

Training Lineage

Stage 0: google/gemma-3-1b-pt
Stage 1: Samples 0–50,000 (1 epoch)
Stage 2: Samples 50,000–100,000 (1 epoch)
Stage 3: Samples 100,000–150,000 (1 epoch)
Stage 4 (this release): Samples 150,000–200,000 (1 epoch, end of epoch-1)

Cumulative data exposure: 0–200,000 samples

This represents sequential CPT across disjoint data shards.

Training Setup

Dataset: canbingol/vngrs-web-corpus-200k
Subset Used: Samples 150,000–200,000
Initialization: Stage 3 checkpoint
Training Objective: Continued Pretraining
Epochs: 1
Data Regime: Plain text
Token Count: ~21.6M tokens
Cumulative Token Exposure (Stage 1 + Stage 2 + Stage 3 + Stage 4): ~86.1M tokens (approximate)

Notes on cumulative exposure:

Although Stage 4 trains only on the 150K–200K shard, it inherits all adaptations learned from previous stages.
After this stage, the model has effectively completed exposure to the entire 0–200K dataset range through sequential continuation.

Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "canbingol/gemma3_1B_base-tr-cpt-1epoch_stage4"

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = model.to(device)

prompt = "bundan böyle"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_p=0.9
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

README.md Unescape Escape

Model Card: Gemma3-1B Turkish CPT (150K–200K Subset, 1 Epoch – Stage 4)

Overview

Training Lineage

Training Setup

Usage Example

README.md