Files
mxbai-rerank-large-v1/README.md

192 lines
48 KiB
Markdown
Raw Normal View History

---
library_name: transformers
tags:
- reranker
- transformers.js
- sentence-transformers
license: apache-2.0
language:
- en
pipeline_tag: text-ranking
---
<br><br>
<p align="center">
<svg xmlns="http://www.w3.org/2000/svg" xml:space="preserve" viewBox="0 0 2020 1130" width="150" height="150" aria-hidden="true"><path fill="#e95a0f" d="M398.167 621.992c-1.387-20.362-4.092-40.739-3.851-61.081.355-30.085 6.873-59.139 21.253-85.976 10.487-19.573 24.09-36.822 40.662-51.515 16.394-14.535 34.338-27.046 54.336-36.182 15.224-6.955 31.006-12.609 47.829-14.168 11.809-1.094 23.753-2.514 35.524-1.836 23.033 1.327 45.131 7.255 66.255 16.75 16.24 7.3 31.497 16.165 45.651 26.969 12.997 9.921 24.412 21.37 34.158 34.509 11.733 15.817 20.849 33.037 25.987 52.018 3.468 12.81 6.438 25.928 7.779 39.097 1.722 16.908 1.642 34.003 2.235 51.021.427 12.253.224 24.547 1.117 36.762 1.677 22.93 4.062 45.764 11.8 67.7 5.376 15.239 12.499 29.55 20.846 43.681l-18.282 20.328c-1.536 1.71-2.795 3.665-4.254 5.448l-19.323 23.533c-13.859-5.449-27.446-11.803-41.657-16.086-13.622-4.106-27.793-6.765-41.905-8.775-15.256-2.173-30.701-3.475-46.105-4.049-23.571-.879-47.178-1.056-70.769-1.029-10.858.013-21.723 1.116-32.57 1.926-5.362.4-10.69 1.255-16.464 1.477-2.758-7.675-5.284-14.865-7.367-22.181-3.108-10.92-4.325-22.554-13.16-31.095-2.598-2.512-5.069-5.341-6.883-8.443-6.366-10.884-12.48-21.917-18.571-32.959-4.178-7.573-8.411-14.375-17.016-18.559-10.34-5.028-19.538-12.387-29.311-18.611-3.173-2.021-6.414-4.312-9.952-5.297-5.857-1.63-11.98-2.301-17.991-3.376z"></path><path fill="#ed6d7b" d="M1478.998 758.842c-12.025.042-24.05.085-36.537-.373-.14-8.536.231-16.569.453-24.607.033-1.179-.315-2.986-1.081-3.4-.805-.434-2.376.338-3.518.81-.856.354-1.562 1.069-3.589 2.521-.239-3.308-.664-5.586-.519-7.827.488-7.544 2.212-15.166 1.554-22.589-1.016-11.451 1.397-14.592-12.332-14.419-3.793.048-3.617-2.803-3.332-5.331.499-4.422 1.45-8.803 1.77-13.233.311-4.316.068-8.672.068-12.861-2.554-.464-4.326-.86-6.12-1.098-4.415-.586-6.051-2.251-5.065-7.31 1.224-6.279.848-12.862 1.276-19.306.19-2.86-.971-4.473-3.794-4.753-4.113-.407-8.242-1.057-12.352-.975-4.663.093-5.192-2.272-4.751-6.012.733-6.229 1.252-12.483 1.875-18.726l1.102-10.495c-5.905-.309-11.146-.805-16.385-.778-3.32.017-5.174-1.4-5.566-4.4-1.172-8.968-2.479-17.944-3.001-26.96-.26-4.484-1.936-5.705-6.005-5.774-9.284-.158-18.563-.594-27.843-.953-7.241-.28-10.137-2.764-11.3-9.899-.746-4.576-2.715-7.801-7.777-8.207-7.739-.621-15.511-.992-23.207-1.961-7.327-.923-14.587-2.415-21.853-3.777-5.021-.941-10.003-2.086-15.003-3.14 4.515-22.952 13.122-44.382 26.284-63.587 18.054-26.344 41.439-47.239 69.102-63.294 15.847-9.197 32.541-16.277 50.376-20.599 16.655-4.036 33.617-5.715 50.622-4.385 33.334 2.606 63.836 13.955 92.415 31.15 15.864 9.545 30.241 20.86 42.269 34.758 8.113 9.374 15.201 19.78 21.718 30.359 10.772 17.484 16.846 36.922 20.611 56.991 1.783 9.503 2.815 19.214 3.318 28.876.758 14.578.755 29.196.65 44.311l-51.545 20.013c-7.779 3.059-15.847 5.376-21.753 12.365-4.73 5.598-10.658 10.316-16.547 14.774-9.9 7.496-18.437 15.988-25.083 26.631-3.333 5.337-7.901 10.381-12.999 14.038-11.355 8.144-17.397 18.973-19.615 32.423l-6.988 41.011z"></path><path fill="#ec663e" d="M318.11 923.047c-.702 17.693-.832 35.433-2.255 53.068-1.699 21.052-6.293 41.512-14.793 61.072-9.001 20.711-21.692 38.693-38.496 53.583-16.077 14.245-34.602 24.163-55.333 30.438-21.691 6.565-43.814 8.127-66.013 6.532-22.771-1.636-43.88-9.318-62.74-22.705-20.223-14.355-35.542-32.917-48.075-54.096-9.588-16.203-16.104-33.55-19.201-52.015-2.339-13.944-2.307-28.011-.403-42.182 2.627-19.545 9.021-37.699 17.963-55.067 11.617-22.564 27.317-41.817 48.382-56.118 15.819-10.74 33.452-17.679 52.444-20.455 8.77-1.282 17.696-1.646 26.568-2.055 11.755-.542 23.534-.562 35.289-1.11 8.545-.399 17.067-1.291 26.193-1.675 1.349 1.77 2.24 3.199 2.835 4.742 4.727 12.261 10.575 23.865 18.636 34.358 7.747 10.084 14.83 20.684 22.699 30.666 3.919 4.972 8.37 9.96 13.609 13.352 7.711 4.994 16.238 8.792 24.617 12.668 5.852 2.707 12.037 4.691 18.074 6.998z"></path><path fill="#ea580e" d="M1285.167 162.995c3.796-29.75 13.825-56.841 32.74-80.577 16.339-20.505 36.013-36.502 59.696-47.614 14.666-6.881 29.971-11.669 46.208-12.749 10.068-.669 20.239-1.582 30.255-.863 16.6 1.191 32.646 5.412 47.
</p>
<p align="center">
<b>The crispy rerank family from <a href="https://mixedbread.ai"><b>Mixedbread</b></a>.</b>
</p>
<p align="center">
<sup> 🍞 Looking for a simple end-to-end retrieval solution? Meet Omni, our multimodal and multilingual model. <a href="https://mixedbread.com"><b>Get in touch for access.</a> </sup>
</p>
# mxbai-rerank-large-v1
This is the largest model in our family of powerful reranker models. You can learn more about the models in our [blog post](https://www.mixedbread.ai/blog/mxbai-rerank-v1).
We have three models:
- [mxbai-rerank-xsmall-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1)
- [mxbai-rerank-base-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-base-v1)
- [mxbai-rerank-large-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v1) (🍞)
## Quickstart
Currently, the best way to use our models is with the most recent version of sentence-transformers.
`pip install -U sentence-transformers`
Let's say you have a query, and you want to rerank a set of documents. You can do that with only one line of code:
```python
from sentence_transformers import CrossEncoder
# Load the model, here we use our base sized model
model = CrossEncoder("mixedbread-ai/mxbai-rerank-large-v1")
# Example query and documents
query = "Who wrote 'To Kill a Mockingbird'?"
documents = [
"'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
"The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
"Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
"Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
"The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
"'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
]
# Lets get the scores
results = model.rank(query, documents, return_documents=True, top_k=3)
```
<details>
<summary>JavaScript Example</summary>
Install [transformers.js](https://github.com/xenova/transformers.js)
`npm i @xenova/transformers`
Let's say you have a query, and you want to rerank a set of documents. In JavaScript, you need to add a function:
```javascript
import { AutoTokenizer, AutoModelForSequenceClassification } from '@xenova/transformers';
const model_id = 'mixedbread-ai/mxbai-rerank-large-v1';
const model = await AutoModelForSequenceClassification.from_pretrained(model_id);
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
/**
* Performs ranking with the CrossEncoder on the given query and documents. Returns a sorted list with the document indices and scores.
* @param {string} query A single query
* @param {string[]} documents A list of documents
* @param {Object} options Options for ranking
* @param {number} [options.top_k=undefined] Return the top-k documents. If undefined, all documents are returned.
* @param {number} [options.return_documents=false] If true, also returns the documents. If false, only returns the indices and scores.
*/
async function rank(query, documents, {
top_k = undefined,
return_documents = false,
} = {}) {
const inputs = tokenizer(
new Array(documents.length).fill(query),
{
text_pair: documents,
padding: true,
truncation: true,
}
)
const { logits } = await model(inputs);
return logits
.sigmoid()
.tolist()
.map(([score], i) => ({
corpus_id: i,
score,
...(return_documents ? { text: documents[i] } : {})
}))
.sort((a, b) => b.score - a.score)
.slice(0, top_k);
}
// Example usage:
const query = "Who wrote 'To Kill a Mockingbird'?"
const documents = [
"'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
"The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
"Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
"Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
"The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
"'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
]
const results = await rank(query, documents, { return_documents: true, top_k: 3 });
console.log(results);
```
</details>
## Using API
You can use the model via our API as follows:
```python
from mixedbread_ai.client import MixedbreadAI
mxbai = MixedbreadAI(api_key="{MIXEDBREAD_API_KEY}")
res = mxbai.reranking(
model="mixedbread-ai/mxbai-rerank-large-v1",
query="Who is the author of To Kill a Mockingbird?",
input=[
"To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
"The novel Moby-Dick was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
"Harper Lee, an American novelist widely known for her novel To Kill a Mockingbird, was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
"Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
"The Harry Potter series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
"The Great Gatsby, a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
],
top_k=3,
return_input=false
)
print(res.data)
```
The API comes with additional features, such as a continous trained reranker! Check out the [docs](https://www.mixedbread.ai/docs) for more information.
## Evaluation
Our reranker models are designed to elevate your search. They work extremely well in combination with keyword search and can even outperform semantic search systems in many cases.
| Model | NDCG@10 | Accuracy@3 |
| ------------------------------------------------------------------------------------- | -------- | ---------- |
| Lexical Search (Lucene) | 38.0 | 66.4 |
| [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | 41.6 | 66.9 |
| [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | 45.2 | 70.6 |
| cohere-embed-v3 (semantic search) | 47.5 | 70.9 |
| [mxbai-rerank-xsmall-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1) | **43.9** | **70.0** |
| [mxbai-rerank-base-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-base-v1) | **46.9** | **72.3** |
| [mxbai-rerank-large-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v1) | **48.8** | **74.9** |
The reported results are aggregated from 11 datasets of BEIR. We used [Pyserini](https://github.com/castorini/pyserini/) to evaluate the models. Find more in our [blog-post](https://www.mixedbread.ai/blog/mxbai-rerank-v1) and on this [spreadsheet](https://docs.google.com/spreadsheets/d/15ELkSMFv-oHa5TRiIjDvhIstH9dlc3pnZeO-iGz4Ld4/edit?usp=sharing).
## Community
Please join our [Discord Community](https://discord.gg/jDfMHzAVfU) and share your feedback and thoughts! We are here to help and also always happy to chat.
## Citation
```bibtex
@online{rerank2024mxbai,
title={Boost Your Search With The Crispy Mixedbread Rerank Models},
author={Aamir Shakir and Darius Koenig and Julius Lipp and Sean Lee},
year={2024},
url={https://www.mixedbread.ai/blog/mxbai-rerank-v1},
}
```
## License
Apache 2.0