--- language: - vi license: mit library_name: transformers pipeline_tag: text-generation tags: - gpt2 - vietnamese - text-generation - causal-lm model_type: gpt2 datasets: - bkai-foundation-models/BKAINewsCorpus --- # ViGPT2 AIO ViGPT2 AIO is a Vietnamese GPT-2 style causal language model pretrained for open-ended text generation and general Vietnamese language modeling. This model was developed to support teaching and hands-on practice in the [AIO](https://aivietnam.edu.vn/) course, while also serving as a general Vietnamese pretrained language model for experimentation and downstream adaptation. ## Model Details - **Model name:** ViGPT2 AIO - **Architecture:** GPT-2 - **Task:** Causal language modeling / text generation - **Language:** Vietnamese - **Library:** Transformers - **Weights format:** Safetensors ## Training Data The model was pretrained on a mixture of Vietnamese news and Vietnamese Wikipedia text. ### Data Sources - **BKAINewsCorpus** from `bkai-foundation-models/BKAINewsCorpus` - **Vietnamese Wikipedia** collected through a custom crawling pipeline, then cleaned from raw wikitext into plain text ### Data Processing Before pretraining, the corpora were cleaned and deduplicated. - The tokenizer was trained on the raw corpora: - `bkai_train.parquet` - `vi_wiki_articles_clean.parquet` - The language model was pretrained on deduplicated versions of these corpora. - In the final training mixture, the Vietnamese Wikipedia corpus was upweighted relative to the news corpus. ### Training Mixture The pretraining mixture used: - `bkai_train.parquet` with weight **1** - `vi_wiki_articles_clean.parquet` with weight **3** ## Training Objective The model was pretrained with a standard causal language modeling objective, where the model learns to predict the next token in a sequence. ## Limitations - The model may generate incorrect, nonsensical, or fabricated content. - Outputs can reflect biases or artifacts present in the pretraining data. - The model is not a verified factual source and should not be used without human validation in high-stakes settings. ## Usage ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM repo_id = "VLAI-AIVN/vigpt2-aio" tokenizer = AutoTokenizer.from_pretrained(repo_id) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained(repo_id) model.config.pad_token_id = tokenizer.pad_token_id device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device).eval() prompt = "Việt Nam là một đất nước" inputs = tokenizer(prompt, return_tensors="pt").to(device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=80, do_sample=True, temperature=0.8, top_p=0.9, repetition_penalty=1.1, pad_token_id=tokenizer.pad_token_id, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))