309 lines
11 KiB
Markdown
309 lines
11 KiB
Markdown
---
|
||
library_name: sentence-transformers
|
||
tags:
|
||
- sentence-transformers
|
||
- sentence-similarity
|
||
- feature-extraction
|
||
- embeddings
|
||
- multilingual
|
||
- matryoshka
|
||
- 2d-matryoshka
|
||
- long-context
|
||
- modernbert
|
||
- retrieval
|
||
- rag
|
||
- agents
|
||
- routing
|
||
- memory
|
||
base_model: llm-semantic-router/mmbert-32k-yarn
|
||
datasets:
|
||
- BAAI/bge-m3-data
|
||
language:
|
||
- multilingual
|
||
license: apache-2.0
|
||
pipeline_tag: sentence-similarity
|
||
model-index:
|
||
- name: aegis-embed
|
||
results:
|
||
- task:
|
||
type: STS
|
||
dataset:
|
||
name: STS Benchmark
|
||
type: mteb/stsbenchmark-sts
|
||
metrics:
|
||
- type: spearman
|
||
value: 80.5
|
||
---
|
||
|
||
# aegis-embed
|
||
|
||
`aegis-embed` is a **multilingual long-context embedding model purpose-built for agent-native retrieval, memory, and decision workflows**.
|
||
|
||
It is designed for systems where embeddings sit on the semantic hot path rather than at the edge of the stack: **memory lookup, knowledge retrieval, tool matching, task routing, long-horizon recall, clustering, and multilingual indexing**. Its value is not just a benchmark score, but a practical operating profile that fits real agent runtimes: **32K context**, **2D Matryoshka adaptability across dimensions and layers**, **307M-class deployability**, and **strong latency-quality efficiency under repeated inference**.
|
||
|
||
In short, `aegis-embed` is built for teams that want one embedding space to support **fast routing, scalable retrieval, and high-confidence semantic matching** without paying the operational cost of a much larger model.
|
||
|
||
## Why it fits agentic workloads
|
||
|
||
Agentic systems do not call embeddings once. They call them **everywhere**: before retrieval, during routing, when matching tools, when searching memory, and while compressing or re-ranking state. That means a useful agent embedding model must be more than accurate — it must also be flexible under tight runtime budgets.
|
||
|
||
`aegis-embed` is designed around that reality.
|
||
|
||
### 1. One model, many budget tiers
|
||
|
||
This model supports **Matryoshka embeddings**, which means you can encode once at full size and truncate to smaller dimensions with limited quality loss.
|
||
|
||
That is especially useful for agent systems because different stages of the stack often need different budgets:
|
||
|
||
- **64d** for very cheap candidate generation, broad routing, or huge memory banks
|
||
- **256d** for balanced retrieval over large corpora
|
||
- **768d** for highest-quality retrieval, offline indexing, or final-stage matching
|
||
|
||
Instead of managing separate embedding models for each tier, you can keep **one semantic space** and choose the dimensional budget that matches the task.
|
||
|
||
### 2. 2D Matryoshka gives runtime flexibility, not just storage savings
|
||
|
||
The model is trained with **2D Matryoshka** behavior:
|
||
|
||
- **dimension reduction** for smaller vectors and lower storage / bandwidth cost
|
||
- **layer reduction** for lower-latency inference paths in custom runtimes
|
||
|
||
This matters for agents because the same system often mixes:
|
||
|
||
- latency-sensitive routing decisions
|
||
- high-volume memory scans
|
||
- higher-quality retrieval for final evidence gathering
|
||
|
||
A single model that can serve multiple latency / quality profiles is much easier to operate than a stack of unrelated specialized encoders.
|
||
|
||
### 3. Long context helps when agent state is not naturally short
|
||
|
||
Many agent workloads are not short isolated queries. They involve:
|
||
|
||
- tool descriptions
|
||
- execution traces
|
||
- long notes
|
||
- merged memory summaries
|
||
- multi-hop research snippets
|
||
- large document chunks
|
||
|
||
With **32,768 tokens** of context length, `aegis-embed` can represent larger semantic units before you are forced into aggressive chunking. That helps preserve cross-section meaning in long documents and richer memory entries.
|
||
|
||
### 4. Small enough to be operationally practical
|
||
|
||
At roughly **307M parameters**, this model sits in a useful middle ground:
|
||
|
||
- substantially lighter than large embedding models in the 600M+ or multi-billion range
|
||
- still expressive enough for multilingual retrieval and similarity work
|
||
- easier to host in systems where embedding is part of a hot path rather than an occasional offline batch
|
||
|
||
For agentic platforms, that usually means better economics and simpler scaling.
|
||
|
||
### 5. One embedding space across the stack
|
||
|
||
Agent systems are easier to operate when **routing, retrieval, memory search, and semantic matching** all live in the same vector space.
|
||
|
||
`aegis-embed` is well suited to that pattern:
|
||
|
||
- **64d** can serve broad routing and large-memory scanning
|
||
- **256d** can cover the main retrieval tier
|
||
- **768d** can stay reserved for the highest-fidelity matching paths
|
||
|
||
That means one model can cover multiple semantic stages without forcing the system to juggle incompatible encoders, duplicated indexes, or divergent retrieval behavior.
|
||
|
||
## Model at a glance
|
||
|
||
| Feature | Value |
|
||
|---------|-------|
|
||
| **Parameters** | 307M |
|
||
| **Architecture** | ModernBERT encoder with YaRN scaling |
|
||
| **Hidden Size** | 768 |
|
||
| **Layers** | 22 |
|
||
| **Context Length** | 32,768 tokens |
|
||
| **Pooling** | Mean pooling |
|
||
| **Similarity** | Cosine |
|
||
| **Languages** | Multilingual |
|
||
| **Matryoshka Dimensions** | 768, 512, 256, 128, 64 |
|
||
|
||
## Headline results
|
||
|
||
| Metric | Score |
|
||
|--------|-------|
|
||
| **MTEB Mean (24 tasks)** | **61.4** |
|
||
| **STS Benchmark** | **80.5** |
|
||
| **Dimension Retention** | **99% @ 256d**, **98% @ 64d** |
|
||
| **Layer Speedup** | **3.3× @ 6L**, **5.8× @ 3L** |
|
||
| **Latency vs BGE-M3** | **1.6-3.1× faster** on longer sequences / larger batches |
|
||
|
||
These numbers make the model particularly attractive for systems that must balance **quality, latency, vector size, and deployment simplicity** instead of optimizing only for leaderboard peak score.
|
||
|
||
## Usage
|
||
|
||
### Basic usage with Sentence Transformers
|
||
|
||
```python
|
||
from sentence_transformers import SentenceTransformer
|
||
|
||
model = SentenceTransformer("/path/to/aegis-embed")
|
||
|
||
texts = [
|
||
"Find tool descriptions related to browser automation.",
|
||
"检索和用户历史偏好相关的记忆。",
|
||
"Retrieve notes about deployment failures in staging.",
|
||
]
|
||
|
||
embeddings = model.encode(texts)
|
||
print(embeddings.shape) # (3, 768)
|
||
```
|
||
|
||
### Matryoshka truncation for smaller vectors
|
||
|
||
```python
|
||
import torch.nn.functional as F
|
||
from sentence_transformers import SentenceTransformer
|
||
|
||
model = SentenceTransformer("/path/to/aegis-embed")
|
||
embeddings = model.encode(texts, convert_to_tensor=True)
|
||
|
||
# Balanced retrieval tier
|
||
embeddings_256d = F.normalize(embeddings[:, :256], p=2, dim=1)
|
||
|
||
# Ultra-cheap routing / large memory-bank tier
|
||
embeddings_64d = F.normalize(embeddings[:, :64], p=2, dim=1)
|
||
```
|
||
|
||
### Long-context encoding
|
||
|
||
```python
|
||
from sentence_transformers import SentenceTransformer
|
||
|
||
model = SentenceTransformer("/path/to/aegis-embed")
|
||
model.max_seq_length = 8192 # can be increased up to 32768
|
||
|
||
long_note = "..."
|
||
embedding = model.encode(long_note)
|
||
```
|
||
|
||
## Why Matryoshka matters for agents
|
||
|
||
A common agent stack has several retrieval-like stages:
|
||
|
||
1. **broad candidate fetch** over a very large store
|
||
2. **narrower semantic lookup** over a smaller candidate set
|
||
3. **high-confidence final matching** before action or answer synthesis
|
||
|
||
Matryoshka lets one model support all three stages:
|
||
|
||
| Stage | Suggested Dim | Why |
|
||
|------|---------------|-----|
|
||
| Broad routing / candidate generation | 64d | Maximize speed and minimize storage |
|
||
| Main retrieval | 256d | Strong balance of quality and cost |
|
||
| Final matching / offline indexing | 768d | Best semantic fidelity |
|
||
|
||
That is often a better operational story than mixing several incompatible embedding models across the same pipeline.
|
||
|
||
## Evaluation details
|
||
|
||
### MTEB benchmark (24 tasks)
|
||
|
||
| Category | Score |
|
||
|----------|-------|
|
||
| STS (7 tasks) | **79.3** |
|
||
| Classification (6) | 62.4 |
|
||
| Pair Classification (2) | 76.2 |
|
||
| Reranking (2) | 64.4 |
|
||
| Clustering (4) | 36.9 |
|
||
| Retrieval (3) | 38.2 |
|
||
| **Overall Mean** | **61.4** |
|
||
|
||
### STS benchmark comparison
|
||
|
||
| Model | Parameters | STS Score |
|
||
|-------|------------|-----------|
|
||
| Qwen3-Embed-0.6B | 600M | 76.17 |
|
||
| **aegis-embed** | **307M** | **80.5** |
|
||
| Qwen3-Embed-8B | 8B | 81.08 |
|
||
|
||
### 2D Matryoshka quality matrix (STS)
|
||
|
||
| Layers | 768d | 256d | 64d |
|
||
|--------|------|------|-----|
|
||
| 22L | **80.5** | 79.9 | 78.5 |
|
||
| 11L | 53.7 | 48.0 | 44.4 |
|
||
| 6L | 45.2 | 45.2 | 43.5 |
|
||
| 3L | 44.0 | 44.1 | 41.8 |
|
||
|
||
### Long-context retrieval (4K tokens)
|
||
|
||
| Metric | Score |
|
||
|--------|-------|
|
||
| R@1 | 68.8% |
|
||
| R@10 | 81.2% |
|
||
| MRR | 71.9% |
|
||
|
||
### Throughput (AMD MI300X)
|
||
|
||
| Layers | Throughput | Speedup |
|
||
|--------|------------|---------|
|
||
| 22L | 477/s | 1.0× |
|
||
| 11L | 916/s | 1.9× |
|
||
| 6L | 1573/s | 3.3× |
|
||
| 3L | 2761/s | 5.8× |
|
||
|
||
## Training
|
||
|
||
### Data
|
||
|
||
Trained on [BAAI/bge-m3-data](https://huggingface.co/datasets/BAAI/bge-m3-data) with multilingual triplets across diverse domains.
|
||
|
||
### Configuration
|
||
|
||
- **Base model**: [llm-semantic-router/mmbert-32k-yarn](https://huggingface.co/llm-semantic-router/mmbert-32k-yarn)
|
||
- **Loss**: `Matryoshka2dLoss` (combines adaptive layer loss and Matryoshka loss)
|
||
- **Matryoshka dimensions**: `[768, 512, 256, 128, 64]`
|
||
- **Max sequence length**: `32768`
|
||
- **Batch size**: `16` (effective `32` with gradient accumulation)
|
||
- **Learning rate**: `2e-5`
|
||
- **Hardware**: AMD Instinct MI300X
|
||
|
||
## Recommended use cases
|
||
|
||
`aegis-embed` is especially well suited for:
|
||
|
||
- **Agent memory retrieval** across long, mixed-format notes or histories
|
||
- **Tool and skill selection** where descriptions need semantic matching
|
||
- **Knowledge-base retrieval** for assistants and RAG systems
|
||
- **Multilingual search** across mixed-language corpora
|
||
- **Large memory banks** that benefit from 64d / 256d vector tiers
|
||
- **Long-document semantic indexing** where short-context encoders lose structure
|
||
|
||
## Model lineage and packaging
|
||
|
||
`aegis-embed` is derived from `llm-semantic-router/mmbert-embed-32k-2d-matryoshka` and distributed here as a lean Sentence Transformers / PyTorch package.
|
||
|
||
This build intentionally omits bundled ONNX artifacts so the model remains smaller and easier to move, mirror, cache, and deploy in environments that primarily rely on native Transformers runtimes.
|
||
|
||
## Limitations
|
||
|
||
- Full-quality mode is still the best default for important retrieval decisions; aggressive layer reduction trades away quality.
|
||
- Although the model supports up to 32K tokens, very long inputs still increase compute and memory cost.
|
||
- The model is optimized for retrieval and semantic similarity; some downstream tasks may benefit from task-specific fine-tuning.
|
||
- If your deployment stack requires ONNX out of the box, you will need to export that separately.
|
||
|
||
## Citation
|
||
|
||
If you use this model, please cite the upstream work it is derived from:
|
||
|
||
```bibtex
|
||
@misc{mmbert-embed-2d-matryoshka,
|
||
title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training},
|
||
author={vLLM Semantic Router Team},
|
||
year={2025},
|
||
url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka}
|
||
}
|
||
```
|
||
|
||
## License
|
||
|
||
Apache 2.0
|