309 lines
11 KiB
Markdown
309 lines
11 KiB
Markdown
|
|
---
|
|||
|
|
library_name: sentence-transformers
|
|||
|
|
tags:
|
|||
|
|
- sentence-transformers
|
|||
|
|
- sentence-similarity
|
|||
|
|
- feature-extraction
|
|||
|
|
- embeddings
|
|||
|
|
- multilingual
|
|||
|
|
- matryoshka
|
|||
|
|
- 2d-matryoshka
|
|||
|
|
- long-context
|
|||
|
|
- modernbert
|
|||
|
|
- retrieval
|
|||
|
|
- rag
|
|||
|
|
- agents
|
|||
|
|
- routing
|
|||
|
|
- memory
|
|||
|
|
base_model: llm-semantic-router/mmbert-32k-yarn
|
|||
|
|
datasets:
|
|||
|
|
- BAAI/bge-m3-data
|
|||
|
|
language:
|
|||
|
|
- multilingual
|
|||
|
|
license: apache-2.0
|
|||
|
|
pipeline_tag: sentence-similarity
|
|||
|
|
model-index:
|
|||
|
|
- name: aegis-embed
|
|||
|
|
results:
|
|||
|
|
- task:
|
|||
|
|
type: STS
|
|||
|
|
dataset:
|
|||
|
|
name: STS Benchmark
|
|||
|
|
type: mteb/stsbenchmark-sts
|
|||
|
|
metrics:
|
|||
|
|
- type: spearman
|
|||
|
|
value: 80.5
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# aegis-embed
|
|||
|
|
|
|||
|
|
`aegis-embed` is a **multilingual long-context embedding model purpose-built for agent-native retrieval, memory, and decision workflows**.
|
|||
|
|
|
|||
|
|
It is designed for systems where embeddings sit on the semantic hot path rather than at the edge of the stack: **memory lookup, knowledge retrieval, tool matching, task routing, long-horizon recall, clustering, and multilingual indexing**. Its value is not just a benchmark score, but a practical operating profile that fits real agent runtimes: **32K context**, **2D Matryoshka adaptability across dimensions and layers**, **307M-class deployability**, and **strong latency-quality efficiency under repeated inference**.
|
|||
|
|
|
|||
|
|
In short, `aegis-embed` is built for teams that want one embedding space to support **fast routing, scalable retrieval, and high-confidence semantic matching** without paying the operational cost of a much larger model.
|
|||
|
|
|
|||
|
|
## Why it fits agentic workloads
|
|||
|
|
|
|||
|
|
Agentic systems do not call embeddings once. They call them **everywhere**: before retrieval, during routing, when matching tools, when searching memory, and while compressing or re-ranking state. That means a useful agent embedding model must be more than accurate — it must also be flexible under tight runtime budgets.
|
|||
|
|
|
|||
|
|
`aegis-embed` is designed around that reality.
|
|||
|
|
|
|||
|
|
### 1. One model, many budget tiers
|
|||
|
|
|
|||
|
|
This model supports **Matryoshka embeddings**, which means you can encode once at full size and truncate to smaller dimensions with limited quality loss.
|
|||
|
|
|
|||
|
|
That is especially useful for agent systems because different stages of the stack often need different budgets:
|
|||
|
|
|
|||
|
|
- **64d** for very cheap candidate generation, broad routing, or huge memory banks
|
|||
|
|
- **256d** for balanced retrieval over large corpora
|
|||
|
|
- **768d** for highest-quality retrieval, offline indexing, or final-stage matching
|
|||
|
|
|
|||
|
|
Instead of managing separate embedding models for each tier, you can keep **one semantic space** and choose the dimensional budget that matches the task.
|
|||
|
|
|
|||
|
|
### 2. 2D Matryoshka gives runtime flexibility, not just storage savings
|
|||
|
|
|
|||
|
|
The model is trained with **2D Matryoshka** behavior:
|
|||
|
|
|
|||
|
|
- **dimension reduction** for smaller vectors and lower storage / bandwidth cost
|
|||
|
|
- **layer reduction** for lower-latency inference paths in custom runtimes
|
|||
|
|
|
|||
|
|
This matters for agents because the same system often mixes:
|
|||
|
|
|
|||
|
|
- latency-sensitive routing decisions
|
|||
|
|
- high-volume memory scans
|
|||
|
|
- higher-quality retrieval for final evidence gathering
|
|||
|
|
|
|||
|
|
A single model that can serve multiple latency / quality profiles is much easier to operate than a stack of unrelated specialized encoders.
|
|||
|
|
|
|||
|
|
### 3. Long context helps when agent state is not naturally short
|
|||
|
|
|
|||
|
|
Many agent workloads are not short isolated queries. They involve:
|
|||
|
|
|
|||
|
|
- tool descriptions
|
|||
|
|
- execution traces
|
|||
|
|
- long notes
|
|||
|
|
- merged memory summaries
|
|||
|
|
- multi-hop research snippets
|
|||
|
|
- large document chunks
|
|||
|
|
|
|||
|
|
With **32,768 tokens** of context length, `aegis-embed` can represent larger semantic units before you are forced into aggressive chunking. That helps preserve cross-section meaning in long documents and richer memory entries.
|
|||
|
|
|
|||
|
|
### 4. Small enough to be operationally practical
|
|||
|
|
|
|||
|
|
At roughly **307M parameters**, this model sits in a useful middle ground:
|
|||
|
|
|
|||
|
|
- substantially lighter than large embedding models in the 600M+ or multi-billion range
|
|||
|
|
- still expressive enough for multilingual retrieval and similarity work
|
|||
|
|
- easier to host in systems where embedding is part of a hot path rather than an occasional offline batch
|
|||
|
|
|
|||
|
|
For agentic platforms, that usually means better economics and simpler scaling.
|
|||
|
|
|
|||
|
|
### 5. One embedding space across the stack
|
|||
|
|
|
|||
|
|
Agent systems are easier to operate when **routing, retrieval, memory search, and semantic matching** all live in the same vector space.
|
|||
|
|
|
|||
|
|
`aegis-embed` is well suited to that pattern:
|
|||
|
|
|
|||
|
|
- **64d** can serve broad routing and large-memory scanning
|
|||
|
|
- **256d** can cover the main retrieval tier
|
|||
|
|
- **768d** can stay reserved for the highest-fidelity matching paths
|
|||
|
|
|
|||
|
|
That means one model can cover multiple semantic stages without forcing the system to juggle incompatible encoders, duplicated indexes, or divergent retrieval behavior.
|
|||
|
|
|
|||
|
|
## Model at a glance
|
|||
|
|
|
|||
|
|
| Feature | Value |
|
|||
|
|
|---------|-------|
|
|||
|
|
| **Parameters** | 307M |
|
|||
|
|
| **Architecture** | ModernBERT encoder with YaRN scaling |
|
|||
|
|
| **Hidden Size** | 768 |
|
|||
|
|
| **Layers** | 22 |
|
|||
|
|
| **Context Length** | 32,768 tokens |
|
|||
|
|
| **Pooling** | Mean pooling |
|
|||
|
|
| **Similarity** | Cosine |
|
|||
|
|
| **Languages** | Multilingual |
|
|||
|
|
| **Matryoshka Dimensions** | 768, 512, 256, 128, 64 |
|
|||
|
|
|
|||
|
|
## Headline results
|
|||
|
|
|
|||
|
|
| Metric | Score |
|
|||
|
|
|--------|-------|
|
|||
|
|
| **MTEB Mean (24 tasks)** | **61.4** |
|
|||
|
|
| **STS Benchmark** | **80.5** |
|
|||
|
|
| **Dimension Retention** | **99% @ 256d**, **98% @ 64d** |
|
|||
|
|
| **Layer Speedup** | **3.3× @ 6L**, **5.8× @ 3L** |
|
|||
|
|
| **Latency vs BGE-M3** | **1.6-3.1× faster** on longer sequences / larger batches |
|
|||
|
|
|
|||
|
|
These numbers make the model particularly attractive for systems that must balance **quality, latency, vector size, and deployment simplicity** instead of optimizing only for leaderboard peak score.
|
|||
|
|
|
|||
|
|
## Usage
|
|||
|
|
|
|||
|
|
### Basic usage with Sentence Transformers
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from sentence_transformers import SentenceTransformer
|
|||
|
|
|
|||
|
|
model = SentenceTransformer("/path/to/aegis-embed")
|
|||
|
|
|
|||
|
|
texts = [
|
|||
|
|
"Find tool descriptions related to browser automation.",
|
|||
|
|
"检索和用户历史偏好相关的记忆。",
|
|||
|
|
"Retrieve notes about deployment failures in staging.",
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
embeddings = model.encode(texts)
|
|||
|
|
print(embeddings.shape) # (3, 768)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Matryoshka truncation for smaller vectors
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import torch.nn.functional as F
|
|||
|
|
from sentence_transformers import SentenceTransformer
|
|||
|
|
|
|||
|
|
model = SentenceTransformer("/path/to/aegis-embed")
|
|||
|
|
embeddings = model.encode(texts, convert_to_tensor=True)
|
|||
|
|
|
|||
|
|
# Balanced retrieval tier
|
|||
|
|
embeddings_256d = F.normalize(embeddings[:, :256], p=2, dim=1)
|
|||
|
|
|
|||
|
|
# Ultra-cheap routing / large memory-bank tier
|
|||
|
|
embeddings_64d = F.normalize(embeddings[:, :64], p=2, dim=1)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Long-context encoding
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from sentence_transformers import SentenceTransformer
|
|||
|
|
|
|||
|
|
model = SentenceTransformer("/path/to/aegis-embed")
|
|||
|
|
model.max_seq_length = 8192 # can be increased up to 32768
|
|||
|
|
|
|||
|
|
long_note = "..."
|
|||
|
|
embedding = model.encode(long_note)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Why Matryoshka matters for agents
|
|||
|
|
|
|||
|
|
A common agent stack has several retrieval-like stages:
|
|||
|
|
|
|||
|
|
1. **broad candidate fetch** over a very large store
|
|||
|
|
2. **narrower semantic lookup** over a smaller candidate set
|
|||
|
|
3. **high-confidence final matching** before action or answer synthesis
|
|||
|
|
|
|||
|
|
Matryoshka lets one model support all three stages:
|
|||
|
|
|
|||
|
|
| Stage | Suggested Dim | Why |
|
|||
|
|
|------|---------------|-----|
|
|||
|
|
| Broad routing / candidate generation | 64d | Maximize speed and minimize storage |
|
|||
|
|
| Main retrieval | 256d | Strong balance of quality and cost |
|
|||
|
|
| Final matching / offline indexing | 768d | Best semantic fidelity |
|
|||
|
|
|
|||
|
|
That is often a better operational story than mixing several incompatible embedding models across the same pipeline.
|
|||
|
|
|
|||
|
|
## Evaluation details
|
|||
|
|
|
|||
|
|
### MTEB benchmark (24 tasks)
|
|||
|
|
|
|||
|
|
| Category | Score |
|
|||
|
|
|----------|-------|
|
|||
|
|
| STS (7 tasks) | **79.3** |
|
|||
|
|
| Classification (6) | 62.4 |
|
|||
|
|
| Pair Classification (2) | 76.2 |
|
|||
|
|
| Reranking (2) | 64.4 |
|
|||
|
|
| Clustering (4) | 36.9 |
|
|||
|
|
| Retrieval (3) | 38.2 |
|
|||
|
|
| **Overall Mean** | **61.4** |
|
|||
|
|
|
|||
|
|
### STS benchmark comparison
|
|||
|
|
|
|||
|
|
| Model | Parameters | STS Score |
|
|||
|
|
|-------|------------|-----------|
|
|||
|
|
| Qwen3-Embed-0.6B | 600M | 76.17 |
|
|||
|
|
| **aegis-embed** | **307M** | **80.5** |
|
|||
|
|
| Qwen3-Embed-8B | 8B | 81.08 |
|
|||
|
|
|
|||
|
|
### 2D Matryoshka quality matrix (STS)
|
|||
|
|
|
|||
|
|
| Layers | 768d | 256d | 64d |
|
|||
|
|
|--------|------|------|-----|
|
|||
|
|
| 22L | **80.5** | 79.9 | 78.5 |
|
|||
|
|
| 11L | 53.7 | 48.0 | 44.4 |
|
|||
|
|
| 6L | 45.2 | 45.2 | 43.5 |
|
|||
|
|
| 3L | 44.0 | 44.1 | 41.8 |
|
|||
|
|
|
|||
|
|
### Long-context retrieval (4K tokens)
|
|||
|
|
|
|||
|
|
| Metric | Score |
|
|||
|
|
|--------|-------|
|
|||
|
|
| R@1 | 68.8% |
|
|||
|
|
| R@10 | 81.2% |
|
|||
|
|
| MRR | 71.9% |
|
|||
|
|
|
|||
|
|
### Throughput (AMD MI300X)
|
|||
|
|
|
|||
|
|
| Layers | Throughput | Speedup |
|
|||
|
|
|--------|------------|---------|
|
|||
|
|
| 22L | 477/s | 1.0× |
|
|||
|
|
| 11L | 916/s | 1.9× |
|
|||
|
|
| 6L | 1573/s | 3.3× |
|
|||
|
|
| 3L | 2761/s | 5.8× |
|
|||
|
|
|
|||
|
|
## Training
|
|||
|
|
|
|||
|
|
### Data
|
|||
|
|
|
|||
|
|
Trained on [BAAI/bge-m3-data](https://huggingface.co/datasets/BAAI/bge-m3-data) with multilingual triplets across diverse domains.
|
|||
|
|
|
|||
|
|
### Configuration
|
|||
|
|
|
|||
|
|
- **Base model**: [llm-semantic-router/mmbert-32k-yarn](https://huggingface.co/llm-semantic-router/mmbert-32k-yarn)
|
|||
|
|
- **Loss**: `Matryoshka2dLoss` (combines adaptive layer loss and Matryoshka loss)
|
|||
|
|
- **Matryoshka dimensions**: `[768, 512, 256, 128, 64]`
|
|||
|
|
- **Max sequence length**: `32768`
|
|||
|
|
- **Batch size**: `16` (effective `32` with gradient accumulation)
|
|||
|
|
- **Learning rate**: `2e-5`
|
|||
|
|
- **Hardware**: AMD Instinct MI300X
|
|||
|
|
|
|||
|
|
## Recommended use cases
|
|||
|
|
|
|||
|
|
`aegis-embed` is especially well suited for:
|
|||
|
|
|
|||
|
|
- **Agent memory retrieval** across long, mixed-format notes or histories
|
|||
|
|
- **Tool and skill selection** where descriptions need semantic matching
|
|||
|
|
- **Knowledge-base retrieval** for assistants and RAG systems
|
|||
|
|
- **Multilingual search** across mixed-language corpora
|
|||
|
|
- **Large memory banks** that benefit from 64d / 256d vector tiers
|
|||
|
|
- **Long-document semantic indexing** where short-context encoders lose structure
|
|||
|
|
|
|||
|
|
## Model lineage and packaging
|
|||
|
|
|
|||
|
|
`aegis-embed` is derived from `llm-semantic-router/mmbert-embed-32k-2d-matryoshka` and distributed here as a lean Sentence Transformers / PyTorch package.
|
|||
|
|
|
|||
|
|
This build intentionally omits bundled ONNX artifacts so the model remains smaller and easier to move, mirror, cache, and deploy in environments that primarily rely on native Transformers runtimes.
|
|||
|
|
|
|||
|
|
## Limitations
|
|||
|
|
|
|||
|
|
- Full-quality mode is still the best default for important retrieval decisions; aggressive layer reduction trades away quality.
|
|||
|
|
- Although the model supports up to 32K tokens, very long inputs still increase compute and memory cost.
|
|||
|
|
- The model is optimized for retrieval and semantic similarity; some downstream tasks may benefit from task-specific fine-tuning.
|
|||
|
|
- If your deployment stack requires ONNX out of the box, you will need to export that separately.
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
If you use this model, please cite the upstream work it is derived from:
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@misc{mmbert-embed-2d-matryoshka,
|
|||
|
|
title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training},
|
|||
|
|
author={vLLM Semantic Router Team},
|
|||
|
|
year={2025},
|
|||
|
|
url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## License
|
|||
|
|
|
|||
|
|
Apache 2.0
|