初始化项目，由ModelHub XC社区提供模型

Model: VLSP2025-LegalSML/qwen3-1.7b-legal-pretrain Source: Original Platform
2026-06-02 13:43:52 +08:00
commit f5f7f6c22f
14 changed files with 152265 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,64 @@
+---
+base_model: Qwen/Qwen3-1.7B-Base
+library_name: transformers
+model_name: qwen3-1.7b-legal-pretrain
+tags:
+- generated_from_trainer
+- sft
+- trl
+licence: license
+extra_gated_prompt: "You agree to not use the model to conduct experiments that cause harm to human subjects."
+extra_gated_fields:
+  Email (use the email registered for the VLSP competition): text
+  Phone number (use the phone number registered for the VLSP competition): text
+  Team Name: text
+  Country: country
+---
+# 🧠 Vietnamese Legal Base Model - Qwen1.7B (Pretrained)
+
+This model is a Vietnamese legal-domain base model pretrained from **Qwen-1.7B**, adapted specifically for legal text understanding and legal question answering tasks.
+---
+## 📌 Overview
+- **Base model**: Qwen-1.7B
+- **Domain**: Vietnamese legal language
+- **Training objective**: Continual pretraining on legal-domain texts
+---
+## 📚 Training Data
+The model was continually pretrained on a curated corpus of Vietnamese legal texts, including:
+
+- Official legal documents (laws, codes, decrees, etc.)
+- Legal news articles and commentary
+
+## 📊 Dataset Statistics
+The training corpus includes a total of approximately **144,000 Vietnamese texts** categorized as follows:
+
+- **~96,000 legal documents**: Official sources such as laws, decrees, circulars,...
+- **~48,000 legal news articles**: Collected from online legal news portals, featuring case studies, legal interpretations,...
+
+## Training Configuration
+The model was trained using full-parameter fine-tuning (no quantization or LoRA). Below is the training setup used for continual pretraining:
+
+### 🔧 Model & Tokenization
+- **Base model**: `Qwen/Qwen3-1.7B`
+- **Maximum sequence length**: `4096`
+- **Block size**: `4096`
+
+
+All texts were collected from publicly available and legally permitted sources, then preprocessed to ensure quality and consistency for domain adaptation.
+---
+## 🚀 Example Usage
+```python
+# Load model directly
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+tokenizer = AutoTokenizer.from_pretrained("VLSP2025-LegalSML/qwen3-1.7b-legal-pretrain")
+model = AutoModelForCausalLM.from_pretrained("VLSP2025-LegalSML/qwen3-1.7b-legal-pretrain")
+```
+
+## 🧑‍💼 Maintainers
+This model is developed and maintained by the VLSP 2025 LegalSLM Task Organizers.
+
+For inquiries, please contact: **leanhcuong@tdtu.edu.vn**
+
+## ⚠️ License & Usage
+This model is released **for research purposes only** under the scope of the VLSP 2025 Evaluation Campaign. Any use outside the competition must comply with relevant data and model licensing agreements.