初始化项目,由ModelHub XC社区提供模型
Model: canbingol/gemma3_1B_base-tr-cpt-3epoch_15k_data Source: Original Platform
This commit is contained in:
79
README.md
Normal file
79
README.md
Normal file
@@ -0,0 +1,79 @@
|
||||
---
|
||||
library_name: transformers
|
||||
tags:
|
||||
- trl
|
||||
- cpt
|
||||
datasets:
|
||||
- canbingol/vngrs-web-corpus-200k
|
||||
language:
|
||||
- tr
|
||||
- en
|
||||
base_model:
|
||||
- google/gemma-3-1b-pt
|
||||
---
|
||||
|
||||
# Model Card: Gemma3-1B Turkish CPT (15K Subset, 3 Epoch)
|
||||
|
||||
## Overview
|
||||
|
||||
This model is a Turkish Continued Pretraining (CPT) variant of `google/gemma-3-1b-pt`.
|
||||
|
||||
The base model was further trained for **3 epochs** on the first **15,000 samples** of a Turkish web corpus to improve Turkish language modeling capability and domain familiarity.
|
||||
|
||||
This release is intended for **research and experimental use**.
|
||||
|
||||
---
|
||||
|
||||
## Base Model
|
||||
|
||||
- `google/gemma-3-1b-pt`
|
||||
|
||||
---
|
||||
|
||||
## Training Setup
|
||||
|
||||
- Dataset: `canbingol/vngrs-web-corpus-200k`
|
||||
- Subset Used: First 15,000 samples
|
||||
- Training Objective: Continued Pretraining (Causal LM / Next-Token Prediction)
|
||||
- Epochs: 3
|
||||
- Data Regime: Plain text (no instruction formatting)
|
||||
- Token Count (approximate): ~19.5M tokens
|
||||
---
|
||||
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this model, please cite the base model:
|
||||
|
||||
- Base: `google/gemma-3-1b-pt`
|
||||
|
||||
---
|
||||
|
||||
## Usage Example
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_id = "canbingol/gemma3_1B_base-tr-cpt-3epoch_15k_data"
|
||||
|
||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
|
||||
model = model.to(device)
|
||||
|
||||
prompt = "Benim adım"
|
||||
inputs = tokenizer(prompt, return_tensors="pt").to(device)
|
||||
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=50,
|
||||
do_sample=True,
|
||||
temperature=0.8,
|
||||
top_p=0.9
|
||||
)
|
||||
|
||||
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
||||
print(generated_text)
|
||||
Reference in New Issue
Block a user