--- base_model: Qwen/Qwen3-4B library_name: transformers model_name: qwen3-4b-legal-pretrain tags: - generated_from_trainer - sft - trl licence: license extra_gated_prompt: "You agree to not use the model to conduct experiments that cause harm to human subjects." extra_gated_fields: Email (use the email registered for the VLSP competition): text Phone number (use the phone number registered for the VLSP competition): text Team Name: text Country: country --- # 🧠 Vietnamese Legal Base Model - Qwen1.7B (Pretrained) This model is a Vietnamese legal-domain base model pretrained from **Qwen-4B**, adapted specifically for legal text understanding and legal question answering tasks. --- ## 📌 Overview - **Base model**: Qwen-4B - **Domain**: Vietnamese legal language - **Training objective**: Continual pretraining on legal-domain texts --- ## 📚 Training Data The model was continually pretrained on a curated corpus of Vietnamese legal texts, including: - Official legal documents (laws, codes, decrees, etc.) - Legal news articles and commentary ## 📊 Dataset Statistics The training corpus includes a total of approximately **144,000 Vietnamese texts** categorized as follows: - **~96,000 legal documents**: Official sources such as laws, decrees, circulars,... - **~48,000 legal news articles**: Collected from online legal news portals, featuring case studies, legal interpretations,... ## Training Configuration The model was trained using full-parameter fine-tuning (no quantization or LoRA). Below is the training setup used for continual pretraining: ### 🔧 Model & Tokenization - **Base model**: `Qwen/Qwen3-4B` - **Maximum sequence length**: `4096` - **Block size**: `4096` All texts were collected from publicly available and legally permitted sources, then preprocessed to ensure quality and consistency for domain adaptation. --- ## 🚀 Example Usage ```python # Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("VLSP2025-LegalSML/qwen3-4b-legal-pretrain") model = AutoModelForCausalLM.from_pretrained("VLSP2025-LegalSML/qwen3-4b-legal-pretrain") ``` ## 🧑‍💼 Maintainers This model is developed and maintained by the VLSP 2025 LegalSLM Task Organizers. For inquiries, please contact: **leanhcuong@tdtu.edu.vn** ## ⚠️ License & Usage This model is released **for research purposes only** under the scope of the VLSP 2025 Evaluation Campaign. Any use outside the competition must comply with relevant data and model licensing agreements.