ModelHub XC 14b62b4733 初始化项目,由ModelHub XC社区提供模型
Model: ai-forever/mGPT-13B
Source: Original Platform
2026-06-13 14:52:24 +08:00

license, language, tags
license language tags
mit
ar
he
vi
id
jv
ms
tl
lv
lt
eu
ml
ta
te
hy
bn
mr
hi
ur
af
da
en
de
sv
fr
it
pt
ro
es
el
os
tg
fa
ja
ka
ko
th
bxr
xal
mn
sw
yo
be
bg
ru
uk
pl
my
uz
ba
kk
ky
tt
az
cv
tr
tk
tyv
sax
et
fi
hu
multilingual
PyTorch
Transformers
gpt3
gpt2
transformers

🌻 mGPT 13B

Multilingual language model. This model was trained on the 61 languages from 25 language families (see the list below).

Dataset

Model was pretrained on a 600Gb of texts, mostly from MC4 and Wikipedia. Training data was deduplicated, the text deduplication includes 64-bit hashing of each text in the corpus for keeping texts with a unique hash. We also filter the documents based on their text compression rate using zlib4. The most strongly and weakly compressing deduplicated texts are discarded.

Here is the table with number of tokens for each language in the pretraining corpus on a logarithmic scale:

Languages

Afrikaans (af), Arabic (ar), Armenian (hy), Azerbaijani (az), Basque (eu), Bashkir (ba), Belarusian (be), Bengali (bn), Bulgarian (bg), Burmese (my), Buryat (bxr), Chuvash (cv), Danish (da), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Javanese (jv), Kalmyk (xal), Kazakh (kk), Korean (ko), Kyrgyz (ky), Latvian (lv), Lithuanian (lt), Malay (ms), Malayalam (ml), Marathi (mr), Mongolian (mn), Ossetian (os), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Spanish (es), Swedish (sv), Swahili (sw), Tatar (tt), Telugu (te), Thai (th), Turkish (tr), Turkmen (tk), Tuvan (tyv), Ukrainian (uk), Uzbek (uz), Vietnamese (vi), Yakut (sax), Yoruba (yo)

By language family

Language FamilyLanguages
Afro-AsiaticArabic (ar), Hebrew (he)
Austro-AsiaticVietnamese (vi)
AustronesianIndonesian (id), Javanese (jv), Malay (ms), Tagalog (tl)
BalticLatvian (lv), Lithuanian (lt)
BasqueBasque (eu)
DravidianMalayalam (ml), Tamil (ta), Telugu (te)
Indo-European (Armenian)Armenian (hy)
Indo-European (Indo-Aryan)Bengali (bn), Marathi (mr), Hindi (hi), Urdu (ur)
Indo-European (Germanic)Afrikaans (af), Danish (da), English (en), German (de), Swedish (sv)
Indo-European (Romance)French (fr), Italian (it), Portuguese (pt), Romanian (ro), Spanish (es)
Indo-European (Greek)Greek (el)
Indo-European (Iranian)Ossetian (os), Tajik (tg), Persian (fa)
JaponicJapanese (ja)
KartvelianGeorgian (ka)
KoreanicKorean (ko)
Kra-DaiThai (th)
MongolicBuryat (bxr), Kalmyk (xal), Mongolian (mn)
Niger-CongoSwahili (sw), Yoruba (yo)
SlavicBelarusian (be), Bulgarian (bg), Russian (ru), Ukrainian (uk), Polish (pl)
Sino-TibetanBurmese (my)
Turkic (Karluk)Uzbek (uz)
Turkic (Kipchak)Bashkir (ba), Kazakh (kk), Kyrgyz (ky), Tatar (tt)
Turkic (Oghuz)Azerbaijani (az), Chuvash (cv), Turkish (tr), Turkmen (tk)
Turkic (Siberian)Tuvan (tyv), Yakut (sax)
UralicEstonian (et), Finnish (fi), Hungarian (hu)

Technical details

The models are pretrained on 16 V100 GPUs for 600k training steps with a set of fixed hyperparameters: vocabulary size of 100k, context window of 2048, learning rate of 2e4, and batch size of 4.

The mGPT architecture is based on GPT-3. We use the architecture description by Brown et al., the code base on GPT-2 (Radford et al., 2019) in the HuggingFace library (Wolf et al., 2020) and Megatron-LM (Shoeybi et al., 2019).

Perplexity

The mGPT13B model achieves the best perplexities within the 2-to-10 score range for the majority of languages, including Dravidian (Malayalam, Tamil, Telugu), Indo-Aryan (Bengali, Hindi, Marathi), Slavic (Belarusian, Ukrainian, Russian, Bulgarian), Sino-Tibetan (Burmese), Kipchak (Bashkir, Kazakh) and others. Higher perplexities up to 20 are for only seven languages from different families.

Language-wise perplexity results

Family-wise perplexity results

The scores are averaged over the number of languages within each family.

Description
Model synced from source: ai-forever/mGPT-13B
Readme 3.1 MiB