--- language: - vi library_name: transformers pipeline_tag: text-generation tags: - vietnamese - gpt2 - causal-lm - text-generation - poetry - continued-pretraining --- # GPT2 AIO Continued Pretraining Poem `VLAI-AIVN/gpt2-aio-continued-pretraining-poem` is a Vietnamese GPT-2 causal language model adapted for poem-style generation through continued pretraining on a Vietnamese poem stanza corpus. This checkpoint is a stage-2 domain adaptation model. It starts from a general Vietnamese GPT-2 checkpoint from the same project and is then continued-pretrained on poem data. It is not an instruction-tuned chat model. ## Model Summary - Architecture: `GPT2LMHeadModel` - Layers: 12 - Hidden size: 768 - Attention heads: 12 - Context length: 1024 tokens - Vocabulary size: 50,257 - Parameter count: 124,439,808 - Saved weights format: `safetensors` - Framework: Hugging Face Transformers ## Training Data This model was continued-pretrained on the poem corpus: Dataset size used by the training script: - Total samples: `2766` - Train split: `2489` - Eval split: `277` Training examples are prepared as follows: - Each stanza is prefixed with `thơ:\n` - Text is normalized - An end-of-text token is appended - Samples are tokenized with `max_length=64` - Padding is applied to fixed length - Prefix tokens and padding tokens are masked out in the labels This means the model is best prompted with the same prefix used during training. ## Training Procedure Important detail: despite the local folder name `sft_poem`, this run is not supervised fine-tuning in the instruction-tuning sense. It is continued pretraining for poem-domain adaptation. The training script loads a previously pretrained Vietnamese GPT-2 checkpoint from the same project and continues training it on the poem corpus. In local project config, the base checkpoint is referenced by `MODEL_DIR`, set to `./artifacts/checkpoints/rand-init/checkpoint-8000`. Saved training arguments from this checkpoint: | Setting | Value | | --- | --- | | `num_train_epochs` | `30` | | `per_device_train_batch_size` | `32` | | `per_device_eval_batch_size` | `32` | | `learning_rate` | `5e-5` | | `weight_decay` | `0.1` | | `warmup_ratio` | `0.1` | | `lr_scheduler_type` | `cosine` | | `bf16` | `true` | | `fp16` | `false` | | `eval_strategy` | `epoch` | | `save_strategy` | `epoch` | | `logging_steps` | `10` | | `save_total_limit` | `2` | | `load_best_model_at_end` | `true` | | `seed` | `42` | ## Checkpoint Selection This point matters for anyone using or comparing the uploaded model: - Training ran to `global_step=2340` over `30` epochs - The best validation checkpoint was `checkpoint-468` - Best validation metric: `eval_loss=4.6743` - Best checkpoint epoch: `6.0` The project uses `load_best_model_at_end=True`, and the saved file hash confirms that: - `final/model.safetensors` is identical to `checkpoint-468/model.safetensors` - `final/model.safetensors` is different from `checkpoint-2340/model.safetensors` So the `final/` folder being uploaded contains the best checkpoint weights, not the last checkpoint weights. ## Training Metrics Recovered from `trainer_state.json`: - Best eval loss: `4.6743` - Approximate best perplexity: `107.16` - Best checkpoint step: `468` - Best checkpoint epoch: `6.0` For reference, the last logged training state before training ended was: - Final training step reached: `2340` - Final eval loss logged during training: `4.9351` - Last logged training loss: `2.6105` The uploaded `final/` model corresponds to the best checkpoint section above. ## Usage Use the training prefix `thơ:\n` in prompts for the most consistent behavior. ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "VLAI-AIVN/gpt2-aio-continued-pretraining-poem" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto", ) prompt = "thơ:\nTrăng lên đầu núi" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=64, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.1, pad_token_id=tokenizer.eos_token_id, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Intended Uses - Vietnamese poem-style text generation experiments - Domain adaptation studies for Vietnamese language models - Further fine-tuning for poetry or literary generation tasks - Baseline experiments on small-domain continued pretraining ## Out-of-Scope Uses - Safety-critical decision making - Factual question answering without external verification - Use as a chat assistant without additional instruction tuning - Production deployment without evaluation, filtering, and prompt controls ## Limitations - The poem corpus is relatively small, so outputs may overfit stylistically or repeat phrasing patterns. - The model is optimized toward poem-like continuations, not broad conversational usefulness. - This is a domain-adapted generator, not an aligned assistant model. - The repository snapshot used here does not declare an explicit license file locally. Confirm licensing before broad redistribution or commercial use. ## Repository Context This checkpoint comes from the Vietnamese GPT-2 pretraining project in this repository, which includes: - Base pretraining on general Vietnamese corpora - Continued pretraining on poem data - Mixed one-step pretraining experiments - Tokenizer training and data preparation scripts ## Citation If you use this model, cite the repository or link back to the Hugging Face model page.