Files

247 lines
47 KiB
Markdown
Raw Permalink Normal View History

---
language:
- en
tags:
- ColBERT
- PyLate
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- transformers
pipeline_tag: sentence-similarity
library_name: PyLate
license: apache-2.0
---
<br><br>
<p align="center">
<svg xmlns="http://www.w3.org/2000/svg" xml:space="preserve" viewBox="0 0 2020 1130" width="150" height="150" aria-hidden="true"><path fill="#e95a0f" d="M398.167 621.992c-1.387-20.362-4.092-40.739-3.851-61.081.355-30.085 6.873-59.139 21.253-85.976 10.487-19.573 24.09-36.822 40.662-51.515 16.394-14.535 34.338-27.046 54.336-36.182 15.224-6.955 31.006-12.609 47.829-14.168 11.809-1.094 23.753-2.514 35.524-1.836 23.033 1.327 45.131 7.255 66.255 16.75 16.24 7.3 31.497 16.165 45.651 26.969 12.997 9.921 24.412 21.37 34.158 34.509 11.733 15.817 20.849 33.037 25.987 52.018 3.468 12.81 6.438 25.928 7.779 39.097 1.722 16.908 1.642 34.003 2.235 51.021.427 12.253.224 24.547 1.117 36.762 1.677 22.93 4.062 45.764 11.8 67.7 5.376 15.239 12.499 29.55 20.846 43.681l-18.282 20.328c-1.536 1.71-2.795 3.665-4.254 5.448l-19.323 23.533c-13.859-5.449-27.446-11.803-41.657-16.086-13.622-4.106-27.793-6.765-41.905-8.775-15.256-2.173-30.701-3.475-46.105-4.049-23.571-.879-47.178-1.056-70.769-1.029-10.858.013-21.723 1.116-32.57 1.926-5.362.4-10.69 1.255-16.464 1.477-2.758-7.675-5.284-14.865-7.367-22.181-3.108-10.92-4.325-22.554-13.16-31.095-2.598-2.512-5.069-5.341-6.883-8.443-6.366-10.884-12.48-21.917-18.571-32.959-4.178-7.573-8.411-14.375-17.016-18.559-10.34-5.028-19.538-12.387-29.311-18.611-3.173-2.021-6.414-4.312-9.952-5.297-5.857-1.63-11.98-2.301-17.991-3.376z"></path><path fill="#ed6d7b" d="M1478.998 758.842c-12.025.042-24.05.085-36.537-.373-.14-8.536.231-16.569.453-24.607.033-1.179-.315-2.986-1.081-3.4-.805-.434-2.376.338-3.518.81-.856.354-1.562 1.069-3.589 2.521-.239-3.308-.664-5.586-.519-7.827.488-7.544 2.212-15.166 1.554-22.589-1.016-11.451 1.397-14.592-12.332-14.419-3.793.048-3.617-2.803-3.332-5.331.499-4.422 1.45-8.803 1.77-13.233.311-4.316.068-8.672.068-12.861-2.554-.464-4.326-.86-6.12-1.098-4.415-.586-6.051-2.251-5.065-7.31 1.224-6.279.848-12.862 1.276-19.306.19-2.86-.971-4.473-3.794-4.753-4.113-.407-8.242-1.057-12.352-.975-4.663.093-5.192-2.272-4.751-6.012.733-6.229 1.252-12.483 1.875-18.726l1.102-10.495c-5.905-.309-11.146-.805-16.385-.778-3.32.017-5.174-1.4-5.566-4.4-1.172-8.968-2.479-17.944-3.001-26.96-.26-4.484-1.936-5.705-6.005-5.774-9.284-.158-18.563-.594-27.843-.953-7.241-.28-10.137-2.764-11.3-9.899-.746-4.576-2.715-7.801-7.777-8.207-7.739-.621-15.511-.992-23.207-1.961-7.327-.923-14.587-2.415-21.853-3.777-5.021-.941-10.003-2.086-15.003-3.14 4.515-22.952 13.122-44.382 26.284-63.587 18.054-26.344 41.439-47.239 69.102-63.294 15.847-9.197 32.541-16.277 50.376-20.599 16.655-4.036 33.617-5.715 50.622-4.385 33.334 2.606 63.836 13.955 92.415 31.15 15.864 9.545 30.241 20.86 42.269 34.758 8.113 9.374 15.201 19.78 21.718 30.359 10.772 17.484 16.846 36.922 20.611 56.991 1.783 9.503 2.815 19.214 3.318 28.876.758 14.578.755 29.196.65 44.311l-51.545 20.013c-7.779 3.059-15.847 5.376-21.753 12.365-4.73 5.598-10.658 10.316-16.547 14.774-9.9 7.496-18.437 15.988-25.083 26.631-3.333 5.337-7.901 10.381-12.999 14.038-11.355 8.144-17.397 18.973-19.615 32.423l-6.988 41.011z"></path><path fill="#ec663e" d="M318.11 923.047c-.702 17.693-.832 35.433-2.255 53.068-1.699 21.052-6.293 41.512-14.793 61.072-9.001 20.711-21.692 38.693-38.496 53.583-16.077 14.245-34.602 24.163-55.333 30.438-21.691 6.565-43.814 8.127-66.013 6.532-22.771-1.636-43.88-9.318-62.74-22.705-20.223-14.355-35.542-32.917-48.075-54.096-9.588-16.203-16.104-33.55-19.201-52.015-2.339-13.944-2.307-28.011-.403-42.182 2.627-19.545 9.021-37.699 17.963-55.067 11.617-22.564 27.317-41.817 48.382-56.118 15.819-10.74 33.452-17.679 52.444-20.455 8.77-1.282 17.696-1.646 26.568-2.055 11.755-.542 23.534-.562 35.289-1.11 8.545-.399 17.067-1.291 26.193-1.675 1.349 1.77 2.24 3.199 2.835 4.742 4.727 12.261 10.575 23.865 18.636 34.358 7.747 10.084 14.83 20.684 22.699 30.666 3.919 4.972 8.37 9.96 13.609 13.352 7.711 4.994 16.238 8.792 24.617 12.668 5.852 2.707 12.037 4.691 18.074 6.998z"></path><path fill="#ea580e" d="M1285.167 162.995c3.796-29.75 13.825-56.841 32.74-80.577 16.339-20.505 36.013-36.502 59.696-47.614 14.666-6.881 29.971-11.669 46.208-12.749 10.068-.669 20.239-1.582 30.255-.863 16.6 1.191 32.646 5.412 47.
</p>
<p align="center">
<b>The crispy, lightweight ColBERT family from <a href="https://mixedbread.com"><b>Mixedbread</b></a>.</b>
</p>
<p align="center">
<sup> 🍞 Looking for a simple end-to-end retrieval solution? Meet <a href="https://mixedbread.com">Mixedbread Search</a>, our multi-modal and multi-lingual search solution.</sup>
</p>
# mxbai-edge-colbert-v0-32m
This model is a lightweight, 32 million parameter ColBERT with a projection dimension of 64. It is built on top of [Ettin-32M](https://huggingface.co/jhu-clsp/ettin-encoder-32m), meaning it benefits from all of ModernBERT's architectural efficiencies. Despite this extreme efficiency, it is the best-performer "edge-sized" retriever, outperforming ColBERTv2 and many models with over 10 times more parameters. It can create multi-vector representations for documents of up to 32,000 tokens and is fully compatible with the [PyLate](https://github.com/lightonai/pylate) library.
## Usage
To use this model, you first need to install PyLate:
via uv
```bash
# uv
uv add pylate
# uv + pip
uv pip install pylate
```
or pip
```bash
# pip
pip install -U pylate
```
Once installed, the model is immediately ready to use to generate representations and index documents:
```python
from pylate import indexes, models, retrieve
# Step 1: Load the model
model = models.ColBERT(
model_name_or_path="mixedbread-ai/mxbai-edge-colbert-v0-32m",
)
# Step 2: Initialize an index (here, PLAID, for larger document collections)
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
override=True, # This overwrites the existing index if any
)
# Step 3: Encode your documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]
documents_embeddings = model.encode(
documents,
batch_size=32,
is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
show_progress_bar=True,
)
# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
```
That's all you need to do to encode a full collection! Your documents are indexed and ready to be queried:
```python
# Step 5.1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)
# Step 2: Encode the queries
queries_embeddings = model.encode(
["query for document 3", "query for document 1"],
batch_size=32,
is_query=True, # # Ensure that it is set to False to indicate that these are queries
show_progress_bar=True,
)
# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=10, # Retrieve the top 10 matches for each query
)
```
### Reranking
Thanks to its extreme parameter efficiency, this model is particularly well-suited to being used as a re-ranker following an even more lightweight first stage retrieval, such as static embeding models. Re-ranking is just as straigthforward:
```python
from pylate import rank, models
# Load the model
model = models.ColBERT(
model_name_or_path="mixedbread-ai/mxbai-edge-colbert-v0-32m",
)
# Define queries and documents
queries = [
"query A",
"query B",
]
documents = [
["document A", "document B"],
["document 1", "document C", "document B"],
]
documents_ids = [
[1, 2],
[1, 3, 2],
]
# Embed them
queries_embeddings = model.encode(
queries,
is_query=True,
)
documents_embeddings = model.encode(
documents,
is_query=False,
)
# Perform reranking
reranked_documents = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
```
## Evaluation
### **Results on BEIR**
| Model | AVG | MS MARCO | SciFact | Touche | FiQA | TREC-COVID | NQ | DBPedia |
| :---------------------------- | :-------: | :-------: | :-------: | :-------: | :-------: | :--------: | :-------: | :-------: |
| **Large Models (>100M)** | | | | | | | | |
| GTE-ModernColBERT-v1 | **0.547** | 0.453 | **0.763** | **0.312** | **0.453** | **0.836** | **0.618** | **0.480** |
| ColBERTv2 | 0.488 | **0.456** | 0.693 | 0.263 | 0.356 | 0.733 | 0.562 | 0.446 |
| **Medium Models (<35M)** | | | | | | | | |
| **mxbai-edge-colbert-v0-32m** | 0.521 | **0.450** | **0.740** | **0.313** | 0.390 | 0.775 | **0.600** | 0.455 |
| answerai-colbert-small-v1 | **0.534** | 0.434 | **0.740** | 0.250 | **0.410** | **0.831** | 0.594 | **0.464** |
| bge-small-en-v1.5 | 0.517 | 0.408 | 0.713 | 0.260 | 0.403 | 0.759 | 0.502 | 0.400 |
| snowflake-s | 0.519 | 0.402 | 0.722 | 0.235 | 0.407 | 0.801 | 0.509 | 0.410 |
| **Small Models (<25M)** | | | | | | | | |
| mxbai-edge-colbert-v0-17m | **0.490** | **0.416** | **0.719** | **0.316** | 0.326 | **0.713** | **0.551** | **0.410** |
| colbert-muvera-micro | 0.394 | 0.364 | 0.662 | 0.251 | 0.254 | 0.561 | 0.386 | 0.332 |
| all-MiniLM-L6-v2 | 0.419 | 0.365 | 0.645 | 0.169 | **0.369** | 0.472 | 0.439 | 0.323 |
### **Results on LongEmbed**
| Model | AVG |
| :-------------------------------------------- | :-------: |
| **Large Models (&gt;100M)** | |
| GTE-ModernColBERT-v1 (32k) | **0.898** |
| GTE-ModernColBERT-v1 (4k) | 0.809 |
| granite-embedding-english-r2 | 0.656 |
| ColBERTv2 | 0.428 |
| **Medium Models (&lt;50M)** | |
| **mxbai-edge-colbert-v0-32m (32k)** | **0.849** |
| **mxbai-edge-colbert-v0-32m (4k)** | 0.783 |
| granite-embedding-small-english-r2 | 0.637 |
| answerai-colbert-small-v1 | 0.441 |
| bge-small-en-v1.5 | 0.312 |
| snowflake-arctic-embed-s | 0.356 |
| **Small Models (&lt;25M)** | |
| mxbai-edge-colbert-v0-17m (32k) | **0.847** |
| mxbai-edge-colbert-v0-17m (4k) | 0.776 |
| all-MiniLM-L6-v2 | 0.298 |
| colbert-muvera-micro | 0.405 |
For more details on evaluations, please read our [Tech Report](https://mixedbread.com/papers/small_colbert_report.pdf).
## Community
Please join our [Discord Community](https://discord.gg/j5dWb3Qkm9) and share your feedback and thoughts! We are here to help and also always happy to chat.
## License
Apache 2.0
## Citation
If you use our model, please cite the associated tech report:
```bibtex
@misc{takehi2025fantasticsmallretrieverstrain,
title={Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0 Tech Report},
author={Rikiya Takehi and Benjamin Clavié and Sean Lee and Aamir Shakir},
year={2025},
eprint={2510.14880},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2510.14880},
}
```
If you specifically use its projection heads, or discuss their effect, please cite our report on using different projections for ColBERT models:
```bibtex
@misc{clavie2025simpleprojectionvariantsimprove,
title={Simple Projection Variants Improve ColBERT Performance},
author={Benjamin Clavié and Sean Lee and Rikiya Takehi and Aamir Shakir and Makoto P. Kato},
year={2025},
eprint={2510.12327},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2510.12327},
}
```
Finally, if you use PyLate in your work, please cite PyLate itself:
```bibtex
@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}
```