Model: VLAI-AIVN/gpt2-aio-continued-pretraining-poem Source: Original Platform
language, library_name, pipeline_tag, tags
| language | library_name | pipeline_tag | tags | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
transformers | text-generation |
|
GPT2 AIO Continued Pretraining Poem
VLAI-AIVN/gpt2-aio-continued-pretraining-poem is a Vietnamese GPT-2 causal language model adapted for poem-style generation through continued pretraining on a Vietnamese poem stanza corpus.
This checkpoint is a stage-2 domain adaptation model. It starts from a general Vietnamese GPT-2 checkpoint from the same project and is then continued-pretrained on poem data. It is not an instruction-tuned chat model.
Model Summary
- Architecture:
GPT2LMHeadModel - Layers: 12
- Hidden size: 768
- Attention heads: 12
- Context length: 1024 tokens
- Vocabulary size: 50,257
- Parameter count: 124,439,808
- Saved weights format:
safetensors - Framework: Hugging Face Transformers
Training Data
This model was continued-pretrained on the poem corpus:
Dataset size used by the training script:
- Total samples:
2766 - Train split:
2489 - Eval split:
277
Training examples are prepared as follows:
- Each stanza is prefixed with
thơ:\n - Text is normalized
- An end-of-text token is appended
- Samples are tokenized with
max_length=64 - Padding is applied to fixed length
- Prefix tokens and padding tokens are masked out in the labels
This means the model is best prompted with the same prefix used during training.
Training Procedure
Important detail: despite the local folder name sft_poem, this run is not supervised fine-tuning in the instruction-tuning sense. It is continued pretraining for poem-domain adaptation.
The training script loads a previously pretrained Vietnamese GPT-2 checkpoint from the same project and continues training it on the poem corpus. In local project config, the base checkpoint is referenced by MODEL_DIR, set to ./artifacts/checkpoints/rand-init/checkpoint-8000.
Saved training arguments from this checkpoint:
| Setting | Value |
|---|---|
num_train_epochs |
30 |
per_device_train_batch_size |
32 |
per_device_eval_batch_size |
32 |
learning_rate |
5e-5 |
weight_decay |
0.1 |
warmup_ratio |
0.1 |
lr_scheduler_type |
cosine |
bf16 |
true |
fp16 |
false |
eval_strategy |
epoch |
save_strategy |
epoch |
logging_steps |
10 |
save_total_limit |
2 |
load_best_model_at_end |
true |
seed |
42 |
Checkpoint Selection
This point matters for anyone using or comparing the uploaded model:
- Training ran to
global_step=2340over30epochs - The best validation checkpoint was
checkpoint-468 - Best validation metric:
eval_loss=4.6743 - Best checkpoint epoch:
6.0
The project uses load_best_model_at_end=True, and the saved file hash confirms that:
final/model.safetensorsis identical tocheckpoint-468/model.safetensorsfinal/model.safetensorsis different fromcheckpoint-2340/model.safetensors
So the final/ folder being uploaded contains the best checkpoint weights, not the last checkpoint weights.
Training Metrics
Recovered from trainer_state.json:
- Best eval loss:
4.6743 - Approximate best perplexity:
107.16 - Best checkpoint step:
468 - Best checkpoint epoch:
6.0
For reference, the last logged training state before training ended was:
- Final training step reached:
2340 - Final eval loss logged during training:
4.9351 - Last logged training loss:
2.6105
The uploaded final/ model corresponds to the best checkpoint section above.
Usage
Use the training prefix thơ:\n in prompts for the most consistent behavior.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "VLAI-AIVN/gpt2-aio-continued-pretraining-poem"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
prompt = "thơ:\nTrăng lên đầu núi"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=64,
do_sample=True,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Intended Uses
- Vietnamese poem-style text generation experiments
- Domain adaptation studies for Vietnamese language models
- Further fine-tuning for poetry or literary generation tasks
- Baseline experiments on small-domain continued pretraining
Out-of-Scope Uses
- Safety-critical decision making
- Factual question answering without external verification
- Use as a chat assistant without additional instruction tuning
- Production deployment without evaluation, filtering, and prompt controls
Limitations
- The poem corpus is relatively small, so outputs may overfit stylistically or repeat phrasing patterns.
- The model is optimized toward poem-like continuations, not broad conversational usefulness.
- This is a domain-adapted generator, not an aligned assistant model.
- The repository snapshot used here does not declare an explicit license file locally. Confirm licensing before broad redistribution or commercial use.
Repository Context
This checkpoint comes from the Vietnamese GPT-2 pretraining project in this repository, which includes:
- Base pretraining on general Vietnamese corpora
- Continued pretraining on poem data
- Mixed one-step pretraining experiments
- Tokenizer training and data preparation scripts
Citation
If you use this model, cite the repository or link back to the Hugging Face model page.