65 lines
2.5 KiB
Markdown
65 lines
2.5 KiB
Markdown
|
|
---
|
||
|
|
base_model: Qwen/Qwen3-4B
|
||
|
|
library_name: transformers
|
||
|
|
model_name: qwen3-4b-legal-pretrain
|
||
|
|
tags:
|
||
|
|
- generated_from_trainer
|
||
|
|
- sft
|
||
|
|
- trl
|
||
|
|
licence: license
|
||
|
|
extra_gated_prompt: "You agree to not use the model to conduct experiments that cause harm to human subjects."
|
||
|
|
extra_gated_fields:
|
||
|
|
Email (use the email registered for the VLSP competition): text
|
||
|
|
Phone number (use the phone number registered for the VLSP competition): text
|
||
|
|
Team Name: text
|
||
|
|
Country: country
|
||
|
|
---
|
||
|
|
# 🧠 Vietnamese Legal Base Model - Qwen1.7B (Pretrained)
|
||
|
|
|
||
|
|
This model is a Vietnamese legal-domain base model pretrained from **Qwen-4B**, adapted specifically for legal text understanding and legal question answering tasks.
|
||
|
|
---
|
||
|
|
## 📌 Overview
|
||
|
|
- **Base model**: Qwen-4B
|
||
|
|
- **Domain**: Vietnamese legal language
|
||
|
|
- **Training objective**: Continual pretraining on legal-domain texts
|
||
|
|
---
|
||
|
|
## 📚 Training Data
|
||
|
|
The model was continually pretrained on a curated corpus of Vietnamese legal texts, including:
|
||
|
|
|
||
|
|
- Official legal documents (laws, codes, decrees, etc.)
|
||
|
|
- Legal news articles and commentary
|
||
|
|
|
||
|
|
## 📊 Dataset Statistics
|
||
|
|
The training corpus includes a total of approximately **144,000 Vietnamese texts** categorized as follows:
|
||
|
|
|
||
|
|
- **~96,000 legal documents**: Official sources such as laws, decrees, circulars,...
|
||
|
|
- **~48,000 legal news articles**: Collected from online legal news portals, featuring case studies, legal interpretations,...
|
||
|
|
|
||
|
|
## Training Configuration
|
||
|
|
The model was trained using full-parameter fine-tuning (no quantization or LoRA). Below is the training setup used for continual pretraining:
|
||
|
|
|
||
|
|
### 🔧 Model & Tokenization
|
||
|
|
- **Base model**: `Qwen/Qwen3-4B`
|
||
|
|
- **Maximum sequence length**: `4096`
|
||
|
|
- **Block size**: `4096`
|
||
|
|
|
||
|
|
|
||
|
|
All texts were collected from publicly available and legally permitted sources, then preprocessed to ensure quality and consistency for domain adaptation.
|
||
|
|
---
|
||
|
|
## 🚀 Example Usage
|
||
|
|
```python
|
||
|
|
# Load model directly
|
||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||
|
|
|
||
|
|
tokenizer = AutoTokenizer.from_pretrained("VLSP2025-LegalSML/qwen3-4b-legal-pretrain")
|
||
|
|
model = AutoModelForCausalLM.from_pretrained("VLSP2025-LegalSML/qwen3-4b-legal-pretrain")
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🧑💼 Maintainers
|
||
|
|
This model is developed and maintained by the VLSP 2025 LegalSLM Task Organizers.
|
||
|
|
|
||
|
|
For inquiries, please contact: **leanhcuong@tdtu.edu.vn**
|
||
|
|
|
||
|
|
## ⚠️ License & Usage
|
||
|
|
This model is released **for research purposes only** under the scope of the VLSP 2025 Evaluation Campaign. Any use outside the competition must comply with relevant data and model licensing agreements.
|