Files
gemma3_1B_base-tr-cpt-3epoc…/README.md
ModelHub XC 4b85797fcb 初始化项目,由ModelHub XC社区提供模型
Model: canbingol/gemma3_1B_base-tr-cpt-3epoch_15k_data
Source: Original Platform
2026-06-02 08:31:16 +08:00

1.6 KiB
Raw Blame History

library_name, tags, datasets, language, base_model
library_name tags datasets language base_model
transformers
trl
cpt
canbingol/vngrs-web-corpus-200k
tr
en
google/gemma-3-1b-pt

Model Card: Gemma3-1B Turkish CPT (15K Subset, 3 Epoch)

Overview

This model is a Turkish Continued Pretraining (CPT) variant of google/gemma-3-1b-pt.

The base model was further trained for 3 epochs on the first 15,000 samples of a Turkish web corpus to improve Turkish language modeling capability and domain familiarity.

This release is intended for research and experimental use.


Base Model

  • google/gemma-3-1b-pt

Training Setup

  • Dataset: canbingol/vngrs-web-corpus-200k
  • Subset Used: First 15,000 samples
  • Training Objective: Continued Pretraining (Causal LM / Next-Token Prediction)
  • Epochs: 3
  • Data Regime: Plain text (no instruction formatting)
  • Token Count (approximate): ~19.5M tokens

Citation

If you use this model, please cite the base model:

  • Base: google/gemma-3-1b-pt

Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "canbingol/gemma3_1B_base-tr-cpt-3epoch_15k_data"

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = model.to(device)

prompt = "Benim adım"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_p=0.9
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)