VLSP2025-LegalSML/qwen3-1.7b-legal-pretrain

Go to file

ModelHub XC f5f7f6c22f 初始化项目，由ModelHub XC社区提供模型

Model: VLSP2025-LegalSML/qwen3-1.7b-legal-pretrain
Source: Original Platform

2026-06-02 13:43:52 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

added_tokens.json

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

merges.txt

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

model-00001-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

model-00002-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

model.safetensors.index.json

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

vocab.json

初始化项目，由ModelHub XC社区提供模型

2026-06-02 13:43:52 +08:00

README.md

base_model, library_name, model_name, tags, licence, extra_gated_prompt, extra_gated_fields

base_model

library_name

model_name

🧠 Vietnamese Legal Base Model - Qwen1.7B (Pretrained)

This model is a Vietnamese legal-domain base model pretrained from Qwen-1.7B, adapted specifically for legal text understanding and legal question answering tasks.

📌 Overview

Base model: Qwen-1.7B
Domain: Vietnamese legal language
Training objective: Continual pretraining on legal-domain texts

📚 Training Data

The model was continually pretrained on a curated corpus of Vietnamese legal texts, including:

Official legal documents (laws, codes, decrees, etc.)
Legal news articles and commentary

📊 Dataset Statistics

The training corpus includes a total of approximately 144,000 Vietnamese texts categorized as follows:

~96,000 legal documents: Official sources such as laws, decrees, circulars,...
~48,000 legal news articles: Collected from online legal news portals, featuring case studies, legal interpretations,...

Training Configuration

The model was trained using full-parameter fine-tuning (no quantization or LoRA). Below is the training setup used for continual pretraining:

🔧 Model & Tokenization

Base model: Qwen/Qwen3-1.7B
Maximum sequence length: 4096
Block size: 4096

All texts were collected from publicly available and legally permitted sources, then preprocessed to ensure quality and consistency for domain adaptation.

🚀 Example Usage

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("VLSP2025-LegalSML/qwen3-1.7b-legal-pretrain")
model = AutoModelForCausalLM.from_pretrained("VLSP2025-LegalSML/qwen3-1.7b-legal-pretrain")

🧑‍💼 Maintainers

This model is developed and maintained by the VLSP 2025 LegalSLM Task Organizers.

For inquiries, please contact: leanhcuong@tdtu.edu.vn

⚠️ License & Usage

This model is released for research purposes only under the scope of the VLSP 2025 Evaluation Campaign. Any use outside the competition must comply with relevant data and model licensing agreements.