初始化项目,由ModelHub XC社区提供模型
Model: chronopt-research/vietnamese-gpt2-medium Source: Original Platform
This commit is contained in:
35
.gitattributes
vendored
Normal file
35
.gitattributes
vendored
Normal file
@@ -0,0 +1,35 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
59
README.md
Normal file
59
README.md
Normal file
@@ -0,0 +1,59 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
datasets:
|
||||
- duongttr/vi-dataset-for-pretrain
|
||||
language:
|
||||
- vi
|
||||
metrics:
|
||||
- perplexity
|
||||
pipeline_tag: text-generation
|
||||
widget:
|
||||
- text: Việt Nam là quốc gia có
|
||||
- text: Hoàng Sa, Trường Sa là của
|
||||
model-index:
|
||||
- name: chronopt-research/vietnamese-gpt2-medium
|
||||
results:
|
||||
- task:
|
||||
type: text-generation
|
||||
metrics:
|
||||
- type: perplexity
|
||||
value: 17.5948
|
||||
verified: true
|
||||
---
|
||||
|
||||
# Vietnamese `gpt2-medium`
|
||||
|
||||
<!-- Provide a quick summary of what the model is/does. -->
|
||||
|
||||
This is a pretrained `gpt2-medium` for Vietnamese language using casual language modeling (CLM) objective. It was introduced in
|
||||
[this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
|
||||
and first released at [this page](https://openai.com/blog/better-language-models/).
|
||||
|
||||
## Model Description
|
||||
GPT-2 (*at first*) is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.
|
||||
|
||||
This is the **medium version** of GPT-2, with 380M parameters.
|
||||
|
||||
You could've found other pretrained version from here: [gpt2-base](https://huggingface.co/chronopt-research/vietnamese-gpt2-base), [gpt2-large]()
|
||||
|
||||
## Dataset used for pretraining
|
||||
This is a combination of multiple Vietnamese dataset for pretraining CLMs such as GPT, GPT2, etc.
|
||||
|
||||
The dataset consists of:
|
||||
- [`vietgpt/covid_19_news_vi`](https://huggingface.co/datasets/vietgpt/covid_19_news_vi)
|
||||
- [`hieunguyen1053/binhvq-news-corpus`](https://huggingface.co/datasets/hieunguyen1053/binhvq-news-corpus)
|
||||
- [`oscar (unshuffled_deduplicated_vi)`](https://huggingface.co/datasets/oscar)
|
||||
- [`vietgpt/wikipedia_vi`](https://huggingface.co/datasets/vietgpt/wikipedia_vi)
|
||||
|
||||
You can find out the combined version here: [duongttr/vi-dataset-for-pretrain](https://huggingface.co/datasets/duongttr/vi-dataset-for-pretrain)
|
||||
|
||||
## Hyperparamters & Results
|
||||
We trained the model ~100k steps, with `lr=1e-4`, `bs=1920`, `optimizer=adamw` on TPU-VM-3.8 from [TRC Program](https://sites.research.google/trc/about/). The training costs around **2.5 days**.
|
||||
|Model|Eval Loss|Eval Perplexity|
|
||||
|---|---|---|
|
||||
|gpt2-base|3.939|51.35|
|
||||
|**gpt2-medium**|**2.8676**|**17.5948**|
|
||||
|gpt2-large|-|-|
|
||||
|
||||
## Contacts
|
||||
Feel free to contact us via: [email]()
|
||||
3
added_tokens.json
Normal file
3
added_tokens.json
Normal file
@@ -0,0 +1,3 @@
|
||||
{
|
||||
"<|endoftext|>": 50257
|
||||
}
|
||||
41
config.json
Normal file
41
config.json
Normal file
@@ -0,0 +1,41 @@
|
||||
{
|
||||
"_name_or_path": ".",
|
||||
"activation_function": "gelu_new",
|
||||
"architectures": [
|
||||
"GPT2LMHeadModel"
|
||||
],
|
||||
"attn_pdrop": 0.0,
|
||||
"bos_token_id": 50256,
|
||||
"embd_pdrop": 0.0,
|
||||
"eos_token_id": 50256,
|
||||
"initializer_range": 0.02,
|
||||
"layer_norm_epsilon": 1e-05,
|
||||
"model_type": "gpt2",
|
||||
"n_ctx": 1024,
|
||||
"n_embd": 1024,
|
||||
"n_head": 16,
|
||||
"n_inner": null,
|
||||
"n_layer": 24,
|
||||
"n_positions": 1024,
|
||||
"n_special": 0,
|
||||
"predict_special_tokens": true,
|
||||
"reorder_and_upcast_attn": false,
|
||||
"resid_pdrop": 0.0,
|
||||
"scale_attn_by_inverse_layer_idx": false,
|
||||
"scale_attn_weights": true,
|
||||
"summary_activation": null,
|
||||
"summary_first_dropout": 0.1,
|
||||
"summary_proj_to_labels": true,
|
||||
"summary_type": "cls_index",
|
||||
"summary_use_proj": true,
|
||||
"task_specific_params": {
|
||||
"text-generation": {
|
||||
"do_sample": true,
|
||||
"max_length": 50
|
||||
}
|
||||
},
|
||||
"torch_dtype": "float32",
|
||||
"transformers_version": "4.31.0.dev0",
|
||||
"use_cache": true,
|
||||
"vocab_size": 50257
|
||||
}
|
||||
4
eval_results.json
Normal file
4
eval_results.json
Normal file
@@ -0,0 +1,4 @@
|
||||
{
|
||||
"eval_loss": 9.505475044250488,
|
||||
"eval_perplexity": 13433.072527481147
|
||||
}
|
||||
3
flax_model.msgpack
Normal file
3
flax_model.msgpack
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:957c303a293fda99db02890235c9285d91937422f472ea39908a07cbf4b1d843
|
||||
size 1419302302
|
||||
6
generation_config.json
Normal file
6
generation_config.json
Normal file
@@ -0,0 +1,6 @@
|
||||
{
|
||||
"_from_model_config": true,
|
||||
"bos_token_id": 50256,
|
||||
"eos_token_id": 50256,
|
||||
"transformers_version": "4.31.0.dev0"
|
||||
}
|
||||
49997
merges.txt
Normal file
49997
merges.txt
Normal file
File diff suppressed because one or more lines are too long
3
pytorch_model.bin
Normal file
3
pytorch_model.bin
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:e12e83b22c991afd00b726f783495a4ddcaba900b46b96ae8d8597d463e03889
|
||||
size 1419383773
|
||||
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:9c35b6b9d67e4018ee0b2343c173f0e9f49965985ea8fbff4a01bcc900ffb757
|
||||
size 2692416
|
||||
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:ea07b99602cc3a5397a2bf68cf6eeaea6127e3418910376e3b0f4fea131fa29a
|
||||
size 88
|
||||
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:1e471e8e4a488d1da8ab3183b122247c2dfd2018247ca8284bf7b0759cbfa85d
|
||||
size 88
|
||||
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:f5a18b28b521f971d03a129cc2fd5266c73fd3e88d53451fc1a2050c6a61a4e9
|
||||
size 88
|
||||
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:66f519181490a76a0a063ee234912a7ab5e1e7b9127af847ca32ba4557b3d8eb
|
||||
size 88
|
||||
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:5a7f07470f93190e986d793960b4dcf9efa6e1d2f434045edc5b6e4df4a1487a
|
||||
size 88
|
||||
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:fc83dd3183155cde6dbeb57d22afa5cb8dca01175ccf78f25f577e1f5e40534b
|
||||
size 5422137
|
||||
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:e385fa35a3cf17e671b2f5514bfad2a2a49292cd0c9301352691abfed683e1ba
|
||||
size 88
|
||||
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:2f1dada02b19864cab54178b5251c30e3f4976765a8f72560663a104cf25030a
|
||||
size 88
|
||||
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:927181aaf62a375ae0566a77f4d8df680cfdd260ef890d4281e254d14509272e
|
||||
size 1492
|
||||
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:f20b948a7912a46a18cac0b3664a7ec5dd002fc350431496af74615cd8729e03
|
||||
size 15452461
|
||||
6
special_tokens_map.json
Normal file
6
special_tokens_map.json
Normal file
@@ -0,0 +1,6 @@
|
||||
{
|
||||
"bos_token": "<|endoftext|>",
|
||||
"eos_token": "<|endoftext|>",
|
||||
"pad_token": "<|endoftext|>",
|
||||
"unk_token": "<|endoftext|>"
|
||||
}
|
||||
100346
tokenizer.json
Normal file
100346
tokenizer.json
Normal file
File diff suppressed because one or more lines are too long
9
tokenizer_config.json
Normal file
9
tokenizer_config.json
Normal file
@@ -0,0 +1,9 @@
|
||||
{
|
||||
"add_prefix_space": false,
|
||||
"bos_token": "<|endoftext|>",
|
||||
"clean_up_tokenization_spaces": true,
|
||||
"eos_token": "<|endoftext|>",
|
||||
"model_max_length": 1000000000000000019884624838656,
|
||||
"tokenizer_class": "GPT2Tokenizer",
|
||||
"unk_token": "<|endoftext|>"
|
||||
}
|
||||
1
vocab.json
Normal file
1
vocab.json
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user