aegis-embed/README.md

---
library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - embeddings
  - multilingual
  - matryoshka
  - 2d-matryoshka
  - long-context
  - modernbert
  - retrieval
  - rag
  - agents
  - routing
  - memory
base_model: llm-semantic-router/mmbert-32k-yarn
datasets:
  - BAAI/bge-m3-data
language:
  - multilingual
license: apache-2.0
pipeline_tag: sentence-similarity
model-index:
  - name: aegis-embed
    results:
      - task:
          type: STS
        dataset:
          name: STS Benchmark
          type: mteb/stsbenchmark-sts
        metrics:
          - type: spearman
            value: 80.5
---

# aegis-embed

`aegis-embed` is a **multilingual long-context embedding model purpose-built for agent-native retrieval, memory, and decision workflows**.

It is designed for systems where embeddings sit on the semantic hot path rather than at the edge of the stack: **memory lookup, knowledge retrieval, tool matching, task routing, long-horizon recall, clustering, and multilingual indexing**. Its value is not just a benchmark score, but a practical operating profile that fits real agent runtimes: **32K context**, **2D Matryoshka adaptability across dimensions and layers**, **307M-class deployability**, and **strong latency-quality efficiency under repeated inference**.

In short, `aegis-embed` is built for teams that want one embedding space to support **fast routing, scalable retrieval, and high-confidence semantic matching** without paying the operational cost of a much larger model.

## Why it fits agentic workloads

Agentic systems do not call embeddings once. They call them **everywhere**: before retrieval, during routing, when matching tools, when searching memory, and while compressing or re-ranking state. That means a useful agent embedding model must be more than accurate — it must also be flexible under tight runtime budgets.

`aegis-embed` is designed around that reality.

### 1. One model, many budget tiers

This model supports **Matryoshka embeddings**, which means you can encode once at full size and truncate to smaller dimensions with limited quality loss.

That is especially useful for agent systems because different stages of the stack often need different budgets:

- **64d** for very cheap candidate generation, broad routing, or huge memory banks
- **256d** for balanced retrieval over large corpora
- **768d** for highest-quality retrieval, offline indexing, or final-stage matching

Instead of managing separate embedding models for each tier, you can keep **one semantic space** and choose the dimensional budget that matches the task.

### 2. 2D Matryoshka gives runtime flexibility, not just storage savings

The model is trained with **2D Matryoshka** behavior:

- **dimension reduction** for smaller vectors and lower storage / bandwidth cost
- **layer reduction** for lower-latency inference paths in custom runtimes

This matters for agents because the same system often mixes:

- latency-sensitive routing decisions
- high-volume memory scans
- higher-quality retrieval for final evidence gathering

A single model that can serve multiple latency / quality profiles is much easier to operate than a stack of unrelated specialized encoders.

### 3. Long context helps when agent state is not naturally short

Many agent workloads are not short isolated queries. They involve:

- tool descriptions
- execution traces
- long notes
- merged memory summaries
- multi-hop research snippets
- large document chunks

With **32,768 tokens** of context length, `aegis-embed` can represent larger semantic units before you are forced into aggressive chunking. That helps preserve cross-section meaning in long documents and richer memory entries.

### 4. Small enough to be operationally practical

At roughly **307M parameters**, this model sits in a useful middle ground:

- substantially lighter than large embedding models in the 600M+ or multi-billion range
- still expressive enough for multilingual retrieval and similarity work
- easier to host in systems where embedding is part of a hot path rather than an occasional offline batch

For agentic platforms, that usually means better economics and simpler scaling.

### 5. One embedding space across the stack

Agent systems are easier to operate when **routing, retrieval, memory search, and semantic matching** all live in the same vector space.

`aegis-embed` is well suited to that pattern:

- **64d** can serve broad routing and large-memory scanning
- **256d** can cover the main retrieval tier
- **768d** can stay reserved for the highest-fidelity matching paths

That means one model can cover multiple semantic stages without forcing the system to juggle incompatible encoders, duplicated indexes, or divergent retrieval behavior.

## Model at a glance

| Feature | Value |
|---------|-------|
| **Parameters** | 307M |
| **Architecture** | ModernBERT encoder with YaRN scaling |
| **Hidden Size** | 768 |
| **Layers** | 22 |
| **Context Length** | 32,768 tokens |
| **Pooling** | Mean pooling |
| **Similarity** | Cosine |
| **Languages** | Multilingual |
| **Matryoshka Dimensions** | 768, 512, 256, 128, 64 |

## Headline results

| Metric | Score |
|--------|-------|
| **MTEB Mean (24 tasks)** | **61.4** |
| **STS Benchmark** | **80.5** |
| **Dimension Retention** | **99% @ 256d**, **98% @ 64d** |
| **Layer Speedup** | **3.3× @ 6L**, **5.8× @ 3L** |
| **Latency vs BGE-M3** | **1.6-3.1× faster** on longer sequences / larger batches |

These numbers make the model particularly attractive for systems that must balance **quality, latency, vector size, and deployment simplicity** instead of optimizing only for leaderboard peak score.

## Usage

### Basic usage with Sentence Transformers

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("/path/to/aegis-embed")

texts = [
    "Find tool descriptions related to browser automation.",
    "检索和用户历史偏好相关的记忆。",
    "Retrieve notes about deployment failures in staging.",
]

embeddings = model.encode(texts)
print(embeddings.shape)  # (3, 768)
```

### Matryoshka truncation for smaller vectors

```python
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("/path/to/aegis-embed")
embeddings = model.encode(texts, convert_to_tensor=True)

# Balanced retrieval tier
embeddings_256d = F.normalize(embeddings[:, :256], p=2, dim=1)

# Ultra-cheap routing / large memory-bank tier
embeddings_64d = F.normalize(embeddings[:, :64], p=2, dim=1)
```

### Long-context encoding

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("/path/to/aegis-embed")
model.max_seq_length = 8192  # can be increased up to 32768

long_note = "..."
embedding = model.encode(long_note)
```

## Why Matryoshka matters for agents

A common agent stack has several retrieval-like stages:

1. **broad candidate fetch** over a very large store
2. **narrower semantic lookup** over a smaller candidate set
3. **high-confidence final matching** before action or answer synthesis

Matryoshka lets one model support all three stages:

| Stage | Suggested Dim | Why |
|------|---------------|-----|
| Broad routing / candidate generation | 64d | Maximize speed and minimize storage |
| Main retrieval | 256d | Strong balance of quality and cost |
| Final matching / offline indexing | 768d | Best semantic fidelity |

That is often a better operational story than mixing several incompatible embedding models across the same pipeline.

## Evaluation details

### MTEB benchmark (24 tasks)

| Category | Score |
|----------|-------|
| STS (7 tasks) | **79.3** |
| Classification (6) | 62.4 |
| Pair Classification (2) | 76.2 |
| Reranking (2) | 64.4 |
| Clustering (4) | 36.9 |
| Retrieval (3) | 38.2 |
| **Overall Mean** | **61.4** |

### STS benchmark comparison

| Model | Parameters | STS Score |
|-------|------------|-----------|
| Qwen3-Embed-0.6B | 600M | 76.17 |
| **aegis-embed** | **307M** | **80.5** |
| Qwen3-Embed-8B | 8B | 81.08 |

### 2D Matryoshka quality matrix (STS)

| Layers | 768d | 256d | 64d |
|--------|------|------|-----|
| 22L | **80.5** | 79.9 | 78.5 |
| 11L | 53.7 | 48.0 | 44.4 |
| 6L | 45.2 | 45.2 | 43.5 |
| 3L | 44.0 | 44.1 | 41.8 |

### Long-context retrieval (4K tokens)

| Metric | Score |
|--------|-------|
| R@1 | 68.8% |
| R@10 | 81.2% |
| MRR | 71.9% |

### Throughput (AMD MI300X)

| Layers | Throughput | Speedup |
|--------|------------|---------|
| 22L | 477/s | 1.0× |
| 11L | 916/s | 1.9× |
| 6L | 1573/s | 3.3× |
| 3L | 2761/s | 5.8× |

## Training

### Data

Trained on [BAAI/bge-m3-data](https://huggingface.co/datasets/BAAI/bge-m3-data) with multilingual triplets across diverse domains.

### Configuration

- **Base model**: [llm-semantic-router/mmbert-32k-yarn](https://huggingface.co/llm-semantic-router/mmbert-32k-yarn)
- **Loss**: `Matryoshka2dLoss` (combines adaptive layer loss and Matryoshka loss)
- **Matryoshka dimensions**: `[768, 512, 256, 128, 64]`
- **Max sequence length**: `32768`
- **Batch size**: `16` (effective `32` with gradient accumulation)
- **Learning rate**: `2e-5`
- **Hardware**: AMD Instinct MI300X

## Recommended use cases

`aegis-embed` is especially well suited for:

- **Agent memory retrieval** across long, mixed-format notes or histories
- **Tool and skill selection** where descriptions need semantic matching
- **Knowledge-base retrieval** for assistants and RAG systems
- **Multilingual search** across mixed-language corpora
- **Large memory banks** that benefit from 64d / 256d vector tiers
- **Long-document semantic indexing** where short-context encoders lose structure

## Model lineage and packaging

`aegis-embed` is derived from `llm-semantic-router/mmbert-embed-32k-2d-matryoshka` and distributed here as a lean Sentence Transformers / PyTorch package.

This build intentionally omits bundled ONNX artifacts so the model remains smaller and easier to move, mirror, cache, and deploy in environments that primarily rely on native Transformers runtimes.

## Limitations

- Full-quality mode is still the best default for important retrieval decisions; aggressive layer reduction trades away quality.
- Although the model supports up to 32K tokens, very long inputs still increase compute and memory cost.
- The model is optimized for retrieval and semantic similarity; some downstream tasks may benefit from task-specific fine-tuning.
- If your deployment stack requires ONNX out of the box, you will need to export that separately.

## Citation

If you use this model, please cite the upstream work it is derived from:

```bibtex
@misc{mmbert-embed-2d-matryoshka,
  title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training},
  author={vLLM Semantic Router Team},
  year={2025},
  url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka}
}
```

## License

Apache 2.0