Files
ModelHub XC 7ba5cdc70e 初始化项目,由ModelHub XC社区提供模型
Model: bqbbao6/vietnamese-legal-embedding
Source: Original Platform
2026-05-28 02:20:16 +08:00

7.0 KiB

language, base_model, pipeline_tag
language base_model pipeline_tag
vi
intfloat/multilingual-e5-base
sentence-similarity

Vietnamese Legal Embedding

Model: bqbbao6/vietnamese-legal-embedding
Base model: intfloat/multilingual-e5-base


Model Description

vietnamese-legal-embedding is a text embedding model fine-tuned for Vietnamese legal document retrieval. Built on top of multilingual-e5-base, this model is optimized for semantic search and Retrieval-Augmented Generation (RAG) systems in the Vietnamese legal domain.

The model learns to map legal queries to their relevant legal passages, making it suitable for retrieving precise legal articles and regulations in response to user questions.


Model Details

Property Value
Base Model intfloat/multilingual-e5-base
Language Vietnamese
Max Sequence Length 512 tokens
Embedding Dimension 768
Similarity Function Cosine Similarity

Usage

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.
Ex1:

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "bqbbao6/VN_legal_embedding_512",
    trust_remote_code=True
)
query = "query: " + "Người lao động được nghỉ bao nhiêu ngày phép mỗi năm?"
embedding = model.encode(query)

print(f"Embedding shape: {embedding.shape}")  # (768,)

Ex2:

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "bqbbao6/VN_legal_embedding_512",
    trust_remote_code=True
)

query = "query: " + "Người lao động được nghỉ bao nhiêu ngày phép năm?"

corpus = [
    # Đúng
    "Người lao động làm việc đủ 12 tháng được nghỉ 12 ngày phép năm có hưởng lương.",
    
    # Giống chủ ngữ nhưng khác nội dung
    "Người lao động được hưởng chế độ bảo hiểm xã hội theo quy định của pháp luật.",
    
    # Giống vị ngữ nhưng khác chủ ngữ
    "Cán bộ công chức được nghỉ 12 ngày phép năm theo quy định.",
    
    # Giống một phần nhưng nói về đối tượng khác
    "Người lao động chưa đủ 12 tháng được nghỉ phép theo tỷ lệ tương ứng.",
    
    # Hoàn toàn không liên quan
    "Doanh nghiệp phải đóng thuế thu nhập doanh nghiệp hàng năm.",
]

corpus_prefixed = ["passage: " + p for p in corpus]

q_emb = model.encode(query, normalize_embeddings=True)
c_emb = model.encode(corpus_prefixed, normalize_embeddings=True)

scores = cos_sim(q_emb, c_emb)[0]
for i, (score, passage) in enumerate(zip(scores, corpus)):
    print(f"[{i+1}] Score: {score:.4f} | {passage}")

Training Data

The model was fine-tuned on a dataset of 250,000 triplets (query, positive passage, hard negative) in the Vietnamese legal domain, covering various legal fields including civil law, criminal law, labor law, and administrative law.

  • All texts were tokenized using pyvi for Vietnamese word segmentation.
  • Hard negatives were mined using BM25 to ensure challenging training examples.
  • Loss function: CachedMultipleNegativesRankingLoss

Evaluation

The model was evaluated on GreenNode/zalo-ai-legal-text-retrieval-vn and compared against the base model and a Vietnamese-specific embedding model.

Metric bqbbao6/vietnamese-legal-embedding intfloat/multilingual-e5-base bkai-foundation-models/vietnamese-bi-encoder
NDCG@10 0.8059 0.6030 0.6160
MRR@10 0.7543 0.5482 0.5579
MAP@10 0.7546 0.5491 0.5588
Recall@1 0.6269 0.4467 0.4442
Recall@5 0.9124 0.6916 0.7170
Recall@10 0.9613 0.7722 0.7951
Precision@1 0.6282 0.4480 0.4454
Hit Rate@10 0.9632 0.7728 0.7970

The fine-tuned model significantly outperforms both the base model and the Vietnamese-specific bi-encoder across all metrics, achieving a +20 point improvement in NDCG@10 over the base model, demonstrating the effectiveness of domain-specific fine-tuning for Vietnamese legal retrieval.


Citation


@inproceedings{10.1007/978-981-95-1746-6_17,
  address = {Singapore},
  author = {Pham, Bao Loc
and Hoang, Quoc Viet
and Luu, Quy Tung
and Vo, Trong Thu},
  booktitle = {Proceedings of the Fifth International Conference on Intelligent Systems and Networks},
  isbn = {978-981-95-1746-6},
  pages = {153--163},
  publisher = {Springer Nature Singapore},
  title = {GN-TRVN: A Benchmark for Vietnamese Table Markdown Retrieval Task},
  year = {2026},
}


@article{enevoldsen2025mmtebmassivemultilingualtext,
  title={MMTEB: Massive Multilingual Text Embedding Benchmark},
  author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin and Ömer Çağatan and Akash Kundu and Martin Bernstorff and Shitao Xiao and Akshita Sukhlecha and Bhavish Pahwa and Rafał Poświata and Kranthi Kiran GV and Shawon Ashraf and Daniel Auras and Björn Plüster and Jan Philipp Harries and Loïc Magne and Isabelle Mohr and Mariya Hendriksen and Dawei Zhu and Hippolyte Gisserot-Boukhlef and Tom Aarsen and Jan Kostkan and Konrad Wojtasik and Taemin Lee and Marek Šuppa and Crystina Zhang and Roberta Rocca and Mohammed Hamdy and Andrianos Michail and John Yang and Manuel Faysse and Aleksei Vatolin and Nandan Thakur and Manan Dey and Dipam Vasani and Pranjal Chitale and Simone Tedeschi and Nguyen Tai and Artem Snegirev and Michael Günther and Mengzhou Xia and Weijia Shi and Xing Han Lù and Jordan Clive and Gayatri Krishnakumar and Anna Maksimova and Silvan Wehrli and Maria Tikhonova and Henil Panchal and Aleksandr Abramov and Malte Ostendorff and Zheng Liu and Simon Clematide and Lester James Miranda and Alena Fenogenova and Guangyu Song and Ruqiya Bin Safi and Wen-Ding Li and Alessia Borghini and Federico Cassano and Hongjin Su and Jimmy Lin and Howard Yen and Lasse Hansen and Sara Hooker and Chenghao Xiao and Vaibhav Adlakha and Orion Weller and Siva Reddy and Niklas Muennighoff},
  publisher = {arXiv},
  journal={arXiv preprint arXiv:2502.13595},
  year={2025},
  url={https://arxiv.org/abs/2502.13595},
  doi = {10.48550/arXiv.2502.13595},
}

@article{muennighoff2022mteb,
  author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Loïc and Reimers, Nils},
  title = {MTEB: Massive Text Embedding Benchmark},
  publisher = {arXiv},
  journal={arXiv preprint arXiv:2210.07316},
  year = {2022}
  url = {https://arxiv.org/abs/2210.07316},
  doi = {10.48550/ARXIV.2210.07316},
}