初始化项目，由ModelHub XC社区提供模型

Model: VLAI-AIVN/gpt2-aio-continued-pretraining-poem Source: Original Platform
2026-05-30 22:37:56 +08:00
commit ca711cd83c
11 changed files with 300637 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,35 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,178 @@
 ---
 language:
 - vi
 library_name: transformers
 pipeline_tag: text-generation
 tags:
 - vietnamese
 - gpt2
 - causal-lm
 - text-generation
 - poetry
 - continued-pretraining
 ---
 # GPT2 AIO Continued Pretraining Poem
 `VLAI-AIVN/gpt2-aio-continued-pretraining-poem` is a Vietnamese GPT-2 causal language model adapted for poem-style generation through continued pretraining on a Vietnamese poem stanza corpus.
 This checkpoint is a stage-2 domain adaptation model. It starts from a general Vietnamese GPT-2 checkpoint from the same project and is then continued-pretrained on poem data. It is not an instruction-tuned chat model.
 ## Model Summary
 - Architecture: `GPT2LMHeadModel`
 - Layers: 12
 - Hidden size: 768
 - Attention heads: 12
 - Context length: 1024 tokens
 - Vocabulary size: 50,257
 - Parameter count: 124,439,808
 - Saved weights format: `safetensors`
 - Framework: Hugging Face Transformers
 ## Training Data
 This model was continued-pretrained on the poem corpus:
 Dataset size used by the training script:
 - Total samples: `2766`
 - Train split: `2489`
 - Eval split: `277`
 Training examples are prepared as follows:
 - Each stanza is prefixed with `thơ:\n`
 - Text is normalized
 - An end-of-text token is appended
 - Samples are tokenized with `max_length=64`
 - Padding is applied to fixed length
 - Prefix tokens and padding tokens are masked out in the labels
 This means the model is best prompted with the same prefix used during training.
 ## Training Procedure
 Important detail: despite the local folder name `sft_poem`, this run is not supervised fine-tuning in the instruction-tuning sense. It is continued pretraining for poem-domain adaptation.
 The training script loads a previously pretrained Vietnamese GPT-2 checkpoint from the same project and continues training it on the poem corpus. In local project config, the base checkpoint is referenced by `MODEL_DIR`, set to `./artifacts/checkpoints/rand-init/checkpoint-8000`.
 Saved training arguments from this checkpoint:
 | Setting | Value |
 | --- | --- |
 | `num_train_epochs` | `30` |
 | `per_device_train_batch_size` | `32` |
 | `per_device_eval_batch_size` | `32` |
 | `learning_rate` | `5e-5` |
 | `weight_decay` | `0.1` |
 | `warmup_ratio` | `0.1` |
 | `lr_scheduler_type` | `cosine` |
 | `bf16` | `true` |
 | `fp16` | `false` |
 | `eval_strategy` | `epoch` |
 | `save_strategy` | `epoch` |
 | `logging_steps` | `10` |
 | `save_total_limit` | `2` |
 | `load_best_model_at_end` | `true` |
 | `seed` | `42` |
 ## Checkpoint Selection
 This point matters for anyone using or comparing the uploaded model:
 - Training ran to `global_step=2340` over `30` epochs
 - The best validation checkpoint was `checkpoint-468`
 - Best validation metric: `eval_loss=4.6743`
 - Best checkpoint epoch: `6.0`
 The project uses `load_best_model_at_end=True`, and the saved file hash confirms that:
 - `final/model.safetensors` is identical to `checkpoint-468/model.safetensors`
 - `final/model.safetensors` is different from `checkpoint-2340/model.safetensors`
 So the `final/` folder being uploaded contains the best checkpoint weights, not the last checkpoint weights.
 ## Training Metrics
 Recovered from `trainer_state.json`:
 - Best eval loss: `4.6743`
 - Approximate best perplexity: `107.16`
 - Best checkpoint step: `468`
 - Best checkpoint epoch: `6.0`
 For reference, the last logged training state before training ended was:
 - Final training step reached: `2340`
 - Final eval loss logged during training: `4.9351`
 - Last logged training loss: `2.6105`
 The uploaded `final/` model corresponds to the best checkpoint section above.
 ## Usage
 Use the training prefix `thơ:\n` in prompts for the most consistent behavior.
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
 model_id = "VLAI-AIVN/gpt2-aio-continued-pretraining-poem"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
 )
 prompt = "thơ:\nTrăng lên đầu núi"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ## Intended Uses
 - Vietnamese poem-style text generation experiments
 - Domain adaptation studies for Vietnamese language models
 - Further fine-tuning for poetry or literary generation tasks
 - Baseline experiments on small-domain continued pretraining
 ## Out-of-Scope Uses
 - Safety-critical decision making
 - Factual question answering without external verification
 - Use as a chat assistant without additional instruction tuning
 - Production deployment without evaluation, filtering, and prompt controls
 ## Limitations
 - The poem corpus is relatively small, so outputs may overfit stylistically or repeat phrasing patterns.
 - The model is optimized toward poem-like continuations, not broad conversational usefulness.
 - This is a domain-adapted generator, not an aligned assistant model.
 - The repository snapshot used here does not declare an explicit license file locally. Confirm licensing before broad redistribution or commercial use.
 ## Repository Context
 This checkpoint comes from the Vietnamese GPT-2 pretraining project in this repository, which includes:
 - Base pretraining on general Vietnamese corpora
 - Continued pretraining on poem data
 - Mixed one-step pretraining experiments
 - Tokenizer training and data preparation scripts
 ## Citation
 If you use this model, cite the repository or link back to the Hugging Face model page.
--- a/config.json
+++ b/config.json
@@ -0,0 +1,39 @@
 {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 0,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "bfloat16",
  "transformers_version": "4.55.4",
  "use_cache": true,
  "vocab_size": 50257
 }
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,6 @@
 {
  "_from_model_config": true,
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.55.4"
 }
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:82a7f5c932a1f09c6223c39b60e1597df6f2fe19ec6798ddf79fdd2391038ee9
 size 248894656
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,30 @@
 {
  "bos_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
 }
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,21 @@
 {
  "add_prefix_space": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<|endoftext|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "bos_token": "<|endoftext|>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|endoftext|>",
  "extra_special_tokens": {},
  "model_max_length": 1024,
  "pad_token": "<|endoftext|>",
  "tokenizer_class": "GPT2Tokenizer",
  "unk_token": "<|endoftext|>"
 }
--- a/training_args.bin
+++ b/training_args.bin
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:5ce422633490dfb702d0ae86d6191810b08e16d9ac7381ba7451210d909d04f1
 size 5841
--- a/vocab.json
+++ b/vocab.json