language, license, library_name, pipeline_tag, tags, base_model, model_type
language license library_name pipeline_tag tags base_model model_type
multilingual
mit transformers sentence-similarity
embeddings
text-embeddings-inference
bge
bge-m3
multilingual
retrieval
semantic-search
xlm-roberta
custom-tokenizer
long-context
BAAI/bge-m3 xlm-roberta

BGE-M3 Custom Tokenizer (8.5K Vocab)

A customized version of :contentReference[oaicite:0]{index=0} with a newly trained tokenizer optimized for domain-specific multilingual retrieval workloads.

This model replaces the original XLM-R tokenizer vocabulary with a compact 8.5K-token tokenizer trained on a custom corpus.

Highlights

  • Based on BAAI/bge-m3
  • Custom tokenizer trained from scratch
  • Reduced vocabulary size: 8500
  • Long-context support: 8192 tokens
  • Multilingual retrieval and embedding model
  • Optimized for:
    • semantic search
    • RAG pipelines
    • dense retrieval
    • domain-specific embeddings

Model Details

Base Model

  • Architecture: XLM-RoBERTa
  • Original model: BAAI/bge-m3
  • Embedding dimension: 1024
  • Transformer encoder model

Tokenizer

The original tokenizer was replaced with a newly trained tokenizer using:

tokenizer = base_tokenizer.train_new_from_iterator(
    batch_iterator(),
    vocab_size=8500,
    min_frequency=2,
)
Description
Model synced from source: Adc05102002/bge-m3-vi-base
Readme 124 KiB