Files
gemma3_1B_base-tr-cpt-3epoc…/README.md
ModelHub XC 4b85797fcb 初始化项目,由ModelHub XC社区提供模型
Model: canbingol/gemma3_1B_base-tr-cpt-3epoch_15k_data
Source: Original Platform
2026-06-02 08:31:16 +08:00

79 lines
1.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
library_name: transformers
tags:
- trl
- cpt
datasets:
- canbingol/vngrs-web-corpus-200k
language:
- tr
- en
base_model:
- google/gemma-3-1b-pt
---
# Model Card: Gemma3-1B Turkish CPT (15K Subset, 3 Epoch)
## Overview
This model is a Turkish Continued Pretraining (CPT) variant of `google/gemma-3-1b-pt`.
The base model was further trained for **3 epochs** on the first **15,000 samples** of a Turkish web corpus to improve Turkish language modeling capability and domain familiarity.
This release is intended for **research and experimental use**.
---
## Base Model
- `google/gemma-3-1b-pt`
---
## Training Setup
- Dataset: `canbingol/vngrs-web-corpus-200k`
- Subset Used: First 15,000 samples
- Training Objective: Continued Pretraining (Causal LM / Next-Token Prediction)
- Epochs: 3
- Data Regime: Plain text (no instruction formatting)
- Token Count (approximate): ~19.5M tokens
---
## Citation
If you use this model, please cite the base model:
- Base: `google/gemma-3-1b-pt`
---
## Usage Example
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "canbingol/gemma3_1B_base-tr-cpt-3epoch_15k_data"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = model.to(device)
prompt = "Benim adım"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
top_p=0.9
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)