embedding-amharic-base/README.md

---
language:
- am
license: mit
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:245876
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: rasyosef/roberta-base-amharic
widget:
- source_sentence: በኢትዮጵያ ለመጀመሪያ ጊዜ ወታደራዊ ስልጠና የወሰዱ ዕጩ ዲፕሎማቶች ተመረቁ
  sentences:
  - የውጭ ጉዳይ ሚኒስቴር ከሜጀር ጄነራል ሀየሎም አርአያ ወታደራዊ አካዳሚ ጋር በመተባበር በኢትዮጵያ ለመጀመሪያ ጊዜ ወታደራዊ
    ስልጠና የወሰዱ ዲፕሎማቶችን  አስመረቀ፡፡በወታደራዊ አካዳሚው ትላንት በተካሄደ የምርቃት ሥነ- ስርዓት ስልጠናውን ላገኙ 89
    ዕጩ ድፕሎማቶች የምስክር ወረቀት ተበረክቷል።
  - አዲስ አበባ፣ የካቲት 19፣ 2012 (ኤፍ.ቢ.ሲ) የኢፌዴሪ አየር ኃይል ለከፍተኛ መኮንኖች የማዕረግ እድገት ሰጥቷል።አየር
    ኃይሉ በዛሬው እለት በቢሾፍቱ በሚገኘው የኢፌዴሪ አየር ኃይል ጠቅላይ መምሪያ ባካሄደው ስነ ስርዓት ላይ የኢፌዴሪ ጦር ኃይሎች
    ምክተል ኤታማዦር ሹም ጄኔራል ብርሃኑ ጁላ እና የኢፌዴሪ አየር ኃይል ዋና አዛዥ ሜጀር ጄኔራል ይልማ መርዳሳን ጨምሮ ከፍተኛ
    አመራሮች ተገኝተዋል።በስነ ስርዓቱ ላይ 106 ለሚሆኑ መኮንኖች በአየር ኃይል ዋና አዛዥ ሜጀር ጄኔራል ይልማ መርዳሳ የተለያዩ
    የማዕረግ እድገቶችን ሰጥተዋል።
- source_sentence: ኢትዮጵያ ኢንተርኔትን በመዝጋቷ ከ130 ሚሊዮን ዶላር በላይ አጣች
  sentences:
  - የአሜሪካ ድምፅ ባለፉት ሰባ አምስት ዓመታት ውስጥ በዓለም ዙሪያ ያሉ የተለያዩ አድማጮችና ተመልካቾች ከሌሎች ምንጮች ሊያገኟቸው
    የማይችሏቸውን መረጃዎች ለዓለም ሲያደርስ መቆየቱን ዋና ዳይሬክተሯ አማንዳ ቤኔት ገልፀዋል።
  - የተቋሙ ጥናት የኢንተርኔን መዘጋት በሃገራት ምጣኔ ሐብት ላይ ያደረሰውን ጉዳት በተለያዩ መለኪያዎች የገመተ ሲሆን፤ በዚህም
    መሰረት ኢትዮጵያ ለ36 ቀናት ያህል ኢንተርኔትን በዘጋችበት እንዲሁም ለሰባት ቀናት ያህል በነበረው የማኅበራዊ ሚዲያ መናወጥ\
    ወቅት በጥቅሉ ከ130 ሚሊዮን ዶላር በላይ አጥታለች ይላል።

pipeline_tag: text-retrieval
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: RoBERTa Amharic Embed Base
  results:
  - task:
      type: information-retrieval
      name: Information Retrieval
    dataset:
      name: dim 768
      type: dim_768
    metrics:
    - type: cosine_recall@5
      value: 0.869800820152314
      name: Cosine Recall@5
    - type: cosine_recall@10
      value: 0.9050966608084359
      name: Cosine Recall@10
    - type: cosine_ndcg@10
      value: 0.8036666074756674
      name: Cosine Ndcg@10
    - type: cosine_mrr@10
      value: 0.7707977655033881
      name: Cosine Mrr@10
  - task:
      type: information-retrieval
      name: Information Retrieval
    dataset:
      name: dim 256
      type: dim_256
    metrics:
    - type: cosine_recall@5
      value: 0.8646748681898067
      name: Cosine Recall@5
    - type: cosine_recall@10
      value: 0.9020210896309314
      name: Cosine Recall@10
    - type: cosine_ndcg@10
      value: 0.7977610383416281
      name: Cosine Ndcg@10
    - type: cosine_mrr@10
      value: 0.764035577128722
      name: Cosine Mrr@10
datasets:
- rasyosef/Amharic-Passage-Retrieval-Dataset-V2
---

# Embedding-Amharic-Base

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [rasyosef/roberta-base-amharic](https://huggingface.co/rasyosef/roberta-base-amharic). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

It was introduced in the paper [The Multilingual Curse at the Retrieval Layer: Evidence from Amharic](https://huggingface.co/papers/2605.24556).

- **Code:** [GitHub Repository](https://github.com/rasyosef/amharic-neural-ir)
- **Paper:** [The Multilingual Curse at the Retrieval Layer: Evidence from Amharic](https://huggingface.co/papers/2605.24556)

## Model Details

### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [rasyosef/roberta-base-amharic](https://huggingface.co/rasyosef/roberta-base-amharic) <!-- at revision b1a3d2c267262e2b82c83be9d4e59db762a5e931 -->
- **Maximum Sequence Length:** 510 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
- **Language:** am
- **License:** mit

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 510, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
```

## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("rasyosef/embedding-amharic-base")

# What is the capital of Ethiopia? / France
queries = ['የኢትዮጵያ ዋና ከተማ ማናት?', 'የፈረንሳይ ዋና ከተማ ማናት?']

# Addis Ababa, Gondar, Paris, London, Washington D.C.
documents = ['አዲስ አበባ', 'ጎንደር', 'ፓሪስ', 'ለንደን', 'ዋሽንግተን ዲሲ']

# Compute embeddings
query_embeddings = model.encode_query(queries) # [2, 768]
document_embeddings = model.encode_document(documents) # [5, 768]

# Calculate semantic similarity
similarities = model.similarity(
    query_embeddings,
    document_embeddings
)

print(similarities)
# tensor([[0.5075, 0.3114, 0.0798, 0.1967, 0.1340],
#         [0.1777, 0.0770, 0.5714, 0.2596, 0.1076]])
```

<!--
### Direct Usage (Transformers)

<details><summary>Click to see the direct usage in Transformers</summary>

</details>
-->

<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

## Evaluation

### Metrics

#### Information Retrieval

* Dataset: `dim_768`
* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters:
  ```json
  {
      "truncate_dim": 768
  }
  ```

| Metric              | Value      |
|:--------------------|:-----------|
| cosine_recall@5     | 0.8698     |
| cosine_recall@10    | 0.9051     |
| **cosine_ndcg@10**  | **0.8037** |
| cosine_mrr@10       | 0.7708     |

#### Information Retrieval

* Dataset: `dim_256`
* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters:
  ```json
  {
      "truncate_dim": 256
  }
  ```

| Metric              | Value      |
|:--------------------|:-----------|
| cosine_recall@5     | 0.8647     |
| cosine_recall@10    | 0.902      |
| **cosine_ndcg@10**  | **0.7978** |
| cosine_mrr@10       | 0.764      |

<!--
## Bias, Risks and Limitations

*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Training Details

<details>

### Training Hyperparameters
#### Non-Default Hyperparameters

- `eval_strategy`: epoch
- `per_device_train_batch_size`: 64
- `per_device_eval_batch_size`: 64
- `gradient_accumulation_steps`: 2
- `learning_rate`: 6e-05
- `num_train_epochs`: 6
- `lr_scheduler_type`: cosine
- `warmup_ratio`: 0.025
- `fp16`: True
- `load_best_model_at_end`: True
- `optim`: adamw_torch_fused
- `batch_sampler`: no_duplicates

### Training Logs
| Epoch | Step | Training Loss | dim_768_cosine_ndcg@10 | dim_256_cosine_ndcg@10 |
|:-----:|:----:|:-------------:|:----------------------:|:----------------------:|
| -1    | -1   | -             | 0.0735                 | 0.0582                 |
| 1.0   | 1921 | 0.6769        | 0.7826                 | 0.7751                 |
| 2.0   | 3842 | 0.07          | 0.7894                 | 0.7829                 |
| 3.0   | 5763 | 0.0254        | 0.8030                 | 0.7953                 |
| 4.0   | 7684 | 0.0139        | 0.8037                 | 0.7978                 |


### Framework Versions
- Python: 3.11.13
- Sentence Transformers: 4.1.0
- Transformers: 4.52.4
- PyTorch: 2.7.1+cu126
- Accelerate: 1.7.0
- Datasets: 3.6.0
- Tokenizers: 0.21.1

</details>

## Citation

```bibtex
@inproceedings{alemneh2026amharicir,
  title     = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
  author    = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
  booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
  year      = {2026},
}
```
<!--
## Glossary

*Clearly define terms in order to be accessible across audiences.*
-->

<!--
## Model Card Authors

*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->

<!--
## Model Card Contact

*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->