Model: Adc05102002/bge-m3-vi-base Source: Original Platform
language, license, library_name, pipeline_tag, tags, base_model, model_type
| language | license | library_name | pipeline_tag | tags | base_model | model_type | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
mit | transformers | sentence-similarity |
|
BAAI/bge-m3 | xlm-roberta |
BGE-M3 Custom Tokenizer (8.5K Vocab)
A customized version of :contentReference[oaicite:0]{index=0} with a newly trained tokenizer optimized for domain-specific multilingual retrieval workloads.
This model replaces the original XLM-R tokenizer vocabulary with a compact 8.5K-token tokenizer trained on a custom corpus.
Highlights
- Based on
BAAI/bge-m3 - Custom tokenizer trained from scratch
- Reduced vocabulary size: 8500
- Long-context support: 8192 tokens
- Multilingual retrieval and embedding model
- Optimized for:
- semantic search
- RAG pipelines
- dense retrieval
- domain-specific embeddings
Model Details
Base Model
- Architecture: XLM-RoBERTa
- Original model:
BAAI/bge-m3 - Embedding dimension: 1024
- Transformer encoder model
Tokenizer
The original tokenizer was replaced with a newly trained tokenizer using:
tokenizer = base_tokenizer.train_new_from_iterator(
batch_iterator(),
vocab_size=8500,
min_frequency=2,
)
Description