初始化项目,由ModelHub XC社区提供模型

Model: chronopt-research/vietnamese-gpt2-medium
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-06-08 22:06:41 +08:00
commit 85b49f2cca
24 changed files with 150546 additions and 0 deletions

35
.gitattributes vendored Normal file
View File

@@ -0,0 +1,35 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text

59
README.md Normal file
View File

@@ -0,0 +1,59 @@
---
license: apache-2.0
datasets:
- duongttr/vi-dataset-for-pretrain
language:
- vi
metrics:
- perplexity
pipeline_tag: text-generation
widget:
- text: Việt Nam là quốc gia có
- text: Hoàng Sa, Trường Sa là của
model-index:
- name: chronopt-research/vietnamese-gpt2-medium
results:
- task:
type: text-generation
metrics:
- type: perplexity
value: 17.5948
verified: true
---
# Vietnamese `gpt2-medium`
<!-- Provide a quick summary of what the model is/does. -->
This is a pretrained `gpt2-medium` for Vietnamese language using casual language modeling (CLM) objective. It was introduced in
[this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
and first released at [this page](https://openai.com/blog/better-language-models/).
## Model Description
GPT-2 (*at first*) is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.
This is the **medium version** of GPT-2, with 380M parameters.
You could've found other pretrained version from here: [gpt2-base](https://huggingface.co/chronopt-research/vietnamese-gpt2-base), [gpt2-large]()
## Dataset used for pretraining
This is a combination of multiple Vietnamese dataset for pretraining CLMs such as GPT, GPT2, etc.
The dataset consists of:
- [`vietgpt/covid_19_news_vi`](https://huggingface.co/datasets/vietgpt/covid_19_news_vi)
- [`hieunguyen1053/binhvq-news-corpus`](https://huggingface.co/datasets/hieunguyen1053/binhvq-news-corpus)
- [`oscar (unshuffled_deduplicated_vi)`](https://huggingface.co/datasets/oscar)
- [`vietgpt/wikipedia_vi`](https://huggingface.co/datasets/vietgpt/wikipedia_vi)
You can find out the combined version here: [duongttr/vi-dataset-for-pretrain](https://huggingface.co/datasets/duongttr/vi-dataset-for-pretrain)
## Hyperparamters & Results
We trained the model ~100k steps, with `lr=1e-4`, `bs=1920`, `optimizer=adamw` on TPU-VM-3.8 from [TRC Program](https://sites.research.google/trc/about/). The training costs around **2.5 days**.
|Model|Eval Loss|Eval Perplexity|
|---|---|---|
|gpt2-base|3.939|51.35|
|**gpt2-medium**|**2.8676**|**17.5948**|
|gpt2-large|-|-|
## Contacts
Feel free to contact us via: [email]()

3
added_tokens.json Normal file
View File

@@ -0,0 +1,3 @@
{
"<|endoftext|>": 50257
}

41
config.json Normal file
View File

@@ -0,0 +1,41 @@
{
"_name_or_path": ".",
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.0,
"bos_token_id": 50256,
"embd_pdrop": 0.0,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 1024,
"n_head": 16,
"n_inner": null,
"n_layer": 24,
"n_positions": 1024,
"n_special": 0,
"predict_special_tokens": true,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.0,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"torch_dtype": "float32",
"transformers_version": "4.31.0.dev0",
"use_cache": true,
"vocab_size": 50257
}

4
eval_results.json Normal file
View File

@@ -0,0 +1,4 @@
{
"eval_loss": 9.505475044250488,
"eval_perplexity": 13433.072527481147
}

3
flax_model.msgpack Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:957c303a293fda99db02890235c9285d91937422f472ea39908a07cbf4b1d843
size 1419302302

6
generation_config.json Normal file
View File

@@ -0,0 +1,6 @@
{
"_from_model_config": true,
"bos_token_id": 50256,
"eos_token_id": 50256,
"transformers_version": "4.31.0.dev0"
}

49997
merges.txt Normal file

File diff suppressed because one or more lines are too long

3
pytorch_model.bin Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e12e83b22c991afd00b726f783495a4ddcaba900b46b96ae8d8597d463e03889
size 1419383773

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:9c35b6b9d67e4018ee0b2343c173f0e9f49965985ea8fbff4a01bcc900ffb757
size 2692416

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:ea07b99602cc3a5397a2bf68cf6eeaea6127e3418910376e3b0f4fea131fa29a
size 88

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:1e471e8e4a488d1da8ab3183b122247c2dfd2018247ca8284bf7b0759cbfa85d
size 88

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f5a18b28b521f971d03a129cc2fd5266c73fd3e88d53451fc1a2050c6a61a4e9
size 88

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:66f519181490a76a0a063ee234912a7ab5e1e7b9127af847ca32ba4557b3d8eb
size 88

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:5a7f07470f93190e986d793960b4dcf9efa6e1d2f434045edc5b6e4df4a1487a
size 88

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:fc83dd3183155cde6dbeb57d22afa5cb8dca01175ccf78f25f577e1f5e40534b
size 5422137

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e385fa35a3cf17e671b2f5514bfad2a2a49292cd0c9301352691abfed683e1ba
size 88

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2f1dada02b19864cab54178b5251c30e3f4976765a8f72560663a104cf25030a
size 88

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:927181aaf62a375ae0566a77f4d8df680cfdd260ef890d4281e254d14509272e
size 1492

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f20b948a7912a46a18cac0b3664a7ec5dd002fc350431496af74615cd8729e03
size 15452461

6
special_tokens_map.json Normal file
View File

@@ -0,0 +1,6 @@
{
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"pad_token": "<|endoftext|>",
"unk_token": "<|endoftext|>"
}

100346
tokenizer.json Normal file

File diff suppressed because one or more lines are too long

9
tokenizer_config.json Normal file
View File

@@ -0,0 +1,9 @@
{
"add_prefix_space": false,
"bos_token": "<|endoftext|>",
"clean_up_tokenization_spaces": true,
"eos_token": "<|endoftext|>",
"model_max_length": 1000000000000000019884624838656,
"tokenizer_class": "GPT2Tokenizer",
"unk_token": "<|endoftext|>"
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long