qwen3-4b-legal-pretrain/README.md

---
base_model: Qwen/Qwen3-4B
library_name: transformers
model_name: qwen3-4b-legal-pretrain
tags:
- generated_from_trainer
- sft
- trl
licence: license
extra_gated_prompt: "You agree to not use the model to conduct experiments that cause harm to human subjects."
extra_gated_fields:
  Email (use the email registered for the VLSP competition): text
  Phone number (use the phone number registered for the VLSP competition): text
  Team Name: text
  Country: country
---
# 🧠 Vietnamese Legal Base Model - Qwen1.7B (Pretrained)

This model is a Vietnamese legal-domain base model pretrained from **Qwen-4B**, adapted specifically for legal text understanding and legal question answering tasks.
---
## 📌 Overview
- **Base model**: Qwen-4B
- **Domain**: Vietnamese legal language
- **Training objective**: Continual pretraining on legal-domain texts
---
## 📚 Training Data
The model was continually pretrained on a curated corpus of Vietnamese legal texts, including:

- Official legal documents (laws, codes, decrees, etc.)
- Legal news articles and commentary

## 📊 Dataset Statistics
The training corpus includes a total of approximately **144,000 Vietnamese texts** categorized as follows:

- **~96,000 legal documents**: Official sources such as laws, decrees, circulars,...
- **~48,000 legal news articles**: Collected from online legal news portals, featuring case studies, legal interpretations,...

## Training Configuration
The model was trained using full-parameter fine-tuning (no quantization or LoRA). Below is the training setup used for continual pretraining:

### 🔧 Model & Tokenization
- **Base model**: `Qwen/Qwen3-4B`
- **Maximum sequence length**: `4096`
- **Block size**: `4096`


All texts were collected from publicly available and legally permitted sources, then preprocessed to ensure quality and consistency for domain adaptation.
---
## 🚀 Example Usage
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("VLSP2025-LegalSML/qwen3-4b-legal-pretrain")
model = AutoModelForCausalLM.from_pretrained("VLSP2025-LegalSML/qwen3-4b-legal-pretrain")
```

## 🧑‍💼 Maintainers
This model is developed and maintained by the VLSP 2025 LegalSLM Task Organizers.

For inquiries, please contact: **leanhcuong@tdtu.edu.vn**

## ⚠️ License & Usage
This model is released **for research purposes only** under the scope of the VLSP 2025 Evaluation Campaign. Any use outside the competition must comply with relevant data and model licensing agreements.
初始化项目，由ModelHub XC社区提供模型 Model: VLSP2025-LegalSML/qwen3-4b-legal-pretrain Source: Original Platform 2026-06-01 07:46:18 +08:00			`---`
			`base_model: Qwen/Qwen3-4B`
			`library_name: transformers`
			`model_name: qwen3-4b-legal-pretrain`
			`tags:`
			`- generated_from_trainer`
			`- sft`
			`- trl`
			`licence: license`
			`extra_gated_prompt: "You agree to not use the model to conduct experiments that cause harm to human subjects."`
			`extra_gated_fields:`
			`Email (use the email registered for the VLSP competition): text`
			`Phone number (use the phone number registered for the VLSP competition): text`
			`Team Name: text`
			`Country: country`
			`---`
			`# 🧠 Vietnamese Legal Base Model - Qwen1.7B (Pretrained)`

			`This model is a Vietnamese legal-domain base model pretrained from Qwen-4B, adapted specifically for legal text understanding and legal question answering tasks.`
			`---`
			`## 📌 Overview`
			`- Base model: Qwen-4B`
			`- Domain: Vietnamese legal language`
			`- Training objective: Continual pretraining on legal-domain texts`
			`---`
			`## 📚 Training Data`
			`The model was continually pretrained on a curated corpus of Vietnamese legal texts, including:`

			`- Official legal documents (laws, codes, decrees, etc.)`
			`- Legal news articles and commentary`

			`## 📊 Dataset Statistics`
			`The training corpus includes a total of approximately 144,000 Vietnamese texts categorized as follows:`

			`- ~96,000 legal documents: Official sources such as laws, decrees, circulars,...`
			`- ~48,000 legal news articles: Collected from online legal news portals, featuring case studies, legal interpretations,...`

			`## Training Configuration`
			`The model was trained using full-parameter fine-tuning (no quantization or LoRA). Below is the training setup used for continual pretraining:`

			`### 🔧 Model & Tokenization`
			- Base model: `Qwen/Qwen3-4B`
			- Maximum sequence length: `4096`
			- Block size: `4096`


			`All texts were collected from publicly available and legally permitted sources, then preprocessed to ensure quality and consistency for domain adaptation.`
			`---`
			`## 🚀 Example Usage`
			```python
			`# Load model directly`
			`from transformers import AutoTokenizer, AutoModelForCausalLM`

			`tokenizer = AutoTokenizer.from_pretrained("VLSP2025-LegalSML/qwen3-4b-legal-pretrain")`
			`model = AutoModelForCausalLM.from_pretrained("VLSP2025-LegalSML/qwen3-4b-legal-pretrain")`
			```

			`## 🧑‍💼 Maintainers`
			`This model is developed and maintained by the VLSP 2025 LegalSLM Task Organizers.`

			`For inquiries, please contact: leanhcuong@tdtu.edu.vn`

			`## ⚠️ License & Usage`
			`This model is released for research purposes only under the scope of the VLSP 2025 Evaluation Campaign. Any use outside the competition must comply with relevant data and model licensing agreements.`