514c18fc1fecb6a61555c6a96096a976e7c4e5cf
Model: pkupie/gemma-3-4b-kk-cpt Source: Original Platform
license, datasets, language, base_model, pipeline_tag
| license | datasets | language | base_model | pipeline_tag | |||
|---|---|---|---|---|---|---|---|
| gemma |
|
|
|
text-generation |
Gemma 3 PT 4B Continually Pretrained on Kazakh (Arabic Script)
This model is a continual pretraining (CPT) checkpoint built by further pretraining Gemma 3 PT 4B on the Kazakh (Arabic Script) portion of the MC^2 Corpus.
The model is intended to improve Kazakh (Arabic Script) language modeling and to support research on low-resource language adaptation.
Training details and methodology are described in: "Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion" (ACL 2026).
Training Data
- Corpus: Kazakh (Arabic Script) subset of MC^2 Corpus
- Language: Kazakh (
kk, Arabic Script) - Training paradigm: Continual pretraining (CPT) starting from Gemma 3 PT 4B
Intended Use
This checkpoint is released primarily for research purposes. Researchers are welcome to use this CPT checkpoint as a base model for future work, particularly in model merging and logit fusion.
Citation
If you use this model, please cite:
@article{zhang2026efficient,
title={Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion},
author={Zhang, Chen and Lin, Jiuheng and Liao, Zhiyuan and Feng, Yansong},
journal={arXiv preprint arXiv:2604.18106},
year={2026}
}
Description
Languages
Jinja
100%