VLAI-AIVN/gpt2-aio-continued-pretraining-poem

Go to file

ModelHub XC ca711cd83c 初始化项目，由ModelHub XC社区提供模型

Model: VLAI-AIVN/gpt2-aio-continued-pretraining-poem
Source: Original Platform

2026-05-30 22:37:56 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-30 22:37:56 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-30 22:37:56 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-30 22:37:56 +08:00

merges.txt

初始化项目，由ModelHub XC社区提供模型

2026-05-30 22:37:56 +08:00

model.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-30 22:37:56 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-30 22:37:56 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-05-30 22:37:56 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-30 22:37:56 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-30 22:37:56 +08:00

training_args.bin

初始化项目，由ModelHub XC社区提供模型

2026-05-30 22:37:56 +08:00

vocab.json

初始化项目，由ModelHub XC社区提供模型

2026-05-30 22:37:56 +08:00

README.md

language, library_name, pipeline_tag, tags

language

library_name

pipeline_tag

GPT2 AIO Continued Pretraining Poem

VLAI-AIVN/gpt2-aio-continued-pretraining-poem is a Vietnamese GPT-2 causal language model adapted for poem-style generation through continued pretraining on a Vietnamese poem stanza corpus.

This checkpoint is a stage-2 domain adaptation model. It starts from a general Vietnamese GPT-2 checkpoint from the same project and is then continued-pretrained on poem data. It is not an instruction-tuned chat model.

Model Summary

Architecture: GPT2LMHeadModel
Layers: 12
Hidden size: 768
Attention heads: 12
Context length: 1024 tokens
Vocabulary size: 50,257
Parameter count: 124,439,808
Saved weights format: safetensors
Framework: Hugging Face Transformers

Training Data

This model was continued-pretrained on the poem corpus:

Dataset size used by the training script:

Total samples: 2766
Train split: 2489
Eval split: 277

Training examples are prepared as follows:

Each stanza is prefixed with thơ:\n
Text is normalized
An end-of-text token is appended
Samples are tokenized with max_length=64
Padding is applied to fixed length
Prefix tokens and padding tokens are masked out in the labels

This means the model is best prompted with the same prefix used during training.

Training Procedure

Important detail: despite the local folder name sft_poem, this run is not supervised fine-tuning in the instruction-tuning sense. It is continued pretraining for poem-domain adaptation.

The training script loads a previously pretrained Vietnamese GPT-2 checkpoint from the same project and continues training it on the poem corpus. In local project config, the base checkpoint is referenced by MODEL_DIR, set to ./artifacts/checkpoints/rand-init/checkpoint-8000.

Saved training arguments from this checkpoint:

Setting	Value
`num_train_epochs`	`30`
`per_device_train_batch_size`	`32`
`per_device_eval_batch_size`	`32`
`learning_rate`	`5e-5`
`weight_decay`	`0.1`
`warmup_ratio`	`0.1`
`lr_scheduler_type`	`cosine`
`bf16`	`true`
`fp16`	`false`
`eval_strategy`	`epoch`
`save_strategy`	`epoch`
`logging_steps`	`10`
`save_total_limit`	`2`
`load_best_model_at_end`	`true`
`seed`	`42`

Checkpoint Selection

This point matters for anyone using or comparing the uploaded model:

Training ran to global_step=2340 over 30 epochs
The best validation checkpoint was checkpoint-468
Best validation metric: eval_loss=4.6743
Best checkpoint epoch: 6.0

The project uses load_best_model_at_end=True, and the saved file hash confirms that:

final/model.safetensors is identical to checkpoint-468/model.safetensors
final/model.safetensors is different from checkpoint-2340/model.safetensors

So the final/ folder being uploaded contains the best checkpoint weights, not the last checkpoint weights.

Training Metrics

Recovered from trainer_state.json:

Best eval loss: 4.6743
Approximate best perplexity: 107.16
Best checkpoint step: 468
Best checkpoint epoch: 6.0

For reference, the last logged training state before training ended was:

Final training step reached: 2340
Final eval loss logged during training: 4.9351
Last logged training loss: 2.6105

The uploaded final/ model corresponds to the best checkpoint section above.

Usage

Use the training prefix thơ:\n in prompts for the most consistent behavior.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "VLAI-AIVN/gpt2-aio-continued-pretraining-poem"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = "thơ:\nTrăng lên đầu núi"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Uses

Vietnamese poem-style text generation experiments
Domain adaptation studies for Vietnamese language models
Further fine-tuning for poetry or literary generation tasks
Baseline experiments on small-domain continued pretraining

Out-of-Scope Uses

Safety-critical decision making
Factual question answering without external verification
Use as a chat assistant without additional instruction tuning
Production deployment without evaluation, filtering, and prompt controls

Limitations

The poem corpus is relatively small, so outputs may overfit stylistically or repeat phrasing patterns.
The model is optimized toward poem-like continuations, not broad conversational usefulness.
This is a domain-adapted generator, not an aligned assistant model.
The repository snapshot used here does not declare an explicit license file locally. Confirm licensing before broad redistribution or commercial use.

Repository Context

This checkpoint comes from the Vietnamese GPT-2 pretraining project in this repository, which includes:

Base pretraining on general Vietnamese corpora
Continued pretraining on poem data
Mixed one-step pretraining experiments
Tokenizer training and data preparation scripts

Citation

If you use this model, cite the repository or link back to the Hugging Face model page.