---
language:
- am
license: mit
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:245876
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: rasyosef/roberta-base-amharic
widget:
- source_sentence: በኢትዮጵያ ለመጀመሪያ ጊዜ ወታደራዊ ስልጠና የወሰዱ ዕጩ ዲፕሎማቶች ተመረቁ
sentences:
- የውጭ ጉዳይ ሚኒስቴር ከሜጀር ጄነራል ሀየሎም አርአያ ወታደራዊ አካዳሚ ጋር በመተባበር በኢትዮጵያ ለመጀመሪያ ጊዜ ወታደራዊ
ስልጠና የወሰዱ ዲፕሎማቶችን አስመረቀ፡፡በወታደራዊ አካዳሚው ትላንት በተካሄደ የምርቃት ሥነ- ስርዓት ስልጠናውን ላገኙ 89
ዕጩ ድፕሎማቶች የምስክር ወረቀት ተበረክቷል።
- አዲስ አበባ፣ የካቲት 19፣ 2012 (ኤፍ.ቢ.ሲ) የኢፌዴሪ አየር ኃይል ለከፍተኛ መኮንኖች የማዕረግ እድገት ሰጥቷል።አየር
ኃይሉ በዛሬው እለት በቢሾፍቱ በሚገኘው የኢፌዴሪ አየር ኃይል ጠቅላይ መምሪያ ባካሄደው ስነ ስርዓት ላይ የኢፌዴሪ ጦር ኃይሎች
ምክተል ኤታማዦር ሹም ጄኔራል ብርሃኑ ጁላ እና የኢፌዴሪ አየር ኃይል ዋና አዛዥ ሜጀር ጄኔራል ይልማ መርዳሳን ጨምሮ ከፍተኛ
አመራሮች ተገኝተዋል።በስነ ስርዓቱ ላይ 106 ለሚሆኑ መኮንኖች በአየር ኃይል ዋና አዛዥ ሜጀር ጄኔራል ይልማ መርዳሳ የተለያዩ
የማዕረግ እድገቶችን ሰጥተዋል።
- source_sentence: ኢትዮጵያ ኢንተርኔትን በመዝጋቷ ከ130 ሚሊዮን ዶላር በላይ አጣች
sentences:
- የአሜሪካ ድምፅ ባለፉት ሰባ አምስት ዓመታት ውስጥ በዓለም ዙሪያ ያሉ የተለያዩ አድማጮችና ተመልካቾች ከሌሎች ምንጮች ሊያገኟቸው
የማይችሏቸውን መረጃዎች ለዓለም ሲያደርስ መቆየቱን ዋና ዳይሬክተሯ አማንዳ ቤኔት ገልፀዋል።
- የተቋሙ ጥናት የኢንተርኔን መዘጋት በሃገራት ምጣኔ ሐብት ላይ ያደረሰውን ጉዳት በተለያዩ መለኪያዎች የገመተ ሲሆን፤ በዚህም
መሰረት ኢትዮጵያ ለ36 ቀናት ያህል ኢንተርኔትን በዘጋችበት እንዲሁም ለሰባት ቀናት ያህል በነበረው የማኅበራዊ ሚዲያ መናወጥ\
ወቅት በጥቅሉ ከ130 ሚሊዮን ዶላር በላይ አጥታለች ይላል።
pipeline_tag: text-retrieval
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: RoBERTa Amharic Embed Base
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: dim 768
type: dim_768
metrics:
- type: cosine_recall@5
value: 0.869800820152314
name: Cosine Recall@5
- type: cosine_recall@10
value: 0.9050966608084359
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.8036666074756674
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.7707977655033881
name: Cosine Mrr@10
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: dim 256
type: dim_256
metrics:
- type: cosine_recall@5
value: 0.8646748681898067
name: Cosine Recall@5
- type: cosine_recall@10
value: 0.9020210896309314
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.7977610383416281
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.764035577128722
name: Cosine Mrr@10
datasets:
- rasyosef/Amharic-Passage-Retrieval-Dataset-V2
---
# Embedding-Amharic-Base
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [rasyosef/roberta-base-amharic](https://huggingface.co/rasyosef/roberta-base-amharic). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
It was introduced in the paper [The Multilingual Curse at the Retrieval Layer: Evidence from Amharic](https://huggingface.co/papers/2605.24556).
- **Code:** [GitHub Repository](https://github.com/rasyosef/amharic-neural-ir)
- **Paper:** [The Multilingual Curse at the Retrieval Layer: Evidence from Amharic](https://huggingface.co/papers/2605.24556)
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [rasyosef/roberta-base-amharic](https://huggingface.co/rasyosef/roberta-base-amharic)
- **Maximum Sequence Length:** 510 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
- **Language:** am
- **License:** mit
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 510, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("rasyosef/embedding-amharic-base")
# What is the capital of Ethiopia? / France
queries = ['የኢትዮጵያ ዋና ከተማ ማናት?', 'የፈረንሳይ ዋና ከተማ ማናት?']
# Addis Ababa, Gondar, Paris, London, Washington D.C.
documents = ['አዲስ አበባ', 'ጎንደር', 'ፓሪስ', 'ለንደን', 'ዋሽንግተን ዲሲ']
# Compute embeddings
query_embeddings = model.encode_query(queries) # [2, 768]
document_embeddings = model.encode_document(documents) # [5, 768]
# Calculate semantic similarity
similarities = model.similarity(
query_embeddings,
document_embeddings
)
print(similarities)
# tensor([[0.5075, 0.3114, 0.0798, 0.1967, 0.1340],
# [0.1777, 0.0770, 0.5714, 0.2596, 0.1076]])
```
## Evaluation
### Metrics
#### Information Retrieval
* Dataset: `dim_768`
* Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters:
```json
{
"truncate_dim": 768
}
```
| Metric | Value |
|:--------------------|:-----------|
| cosine_recall@5 | 0.8698 |
| cosine_recall@10 | 0.9051 |
| **cosine_ndcg@10** | **0.8037** |
| cosine_mrr@10 | 0.7708 |
#### Information Retrieval
* Dataset: `dim_256`
* Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters:
```json
{
"truncate_dim": 256
}
```
| Metric | Value |
|:--------------------|:-----------|
| cosine_recall@5 | 0.8647 |
| cosine_recall@10 | 0.902 |
| **cosine_ndcg@10** | **0.7978** |
| cosine_mrr@10 | 0.764 |
## Training Details
### Training Hyperparameters
#### Non-Default Hyperparameters
- `eval_strategy`: epoch
- `per_device_train_batch_size`: 64
- `per_device_eval_batch_size`: 64
- `gradient_accumulation_steps`: 2
- `learning_rate`: 6e-05
- `num_train_epochs`: 6
- `lr_scheduler_type`: cosine
- `warmup_ratio`: 0.025
- `fp16`: True
- `load_best_model_at_end`: True
- `optim`: adamw_torch_fused
- `batch_sampler`: no_duplicates
### Training Logs
| Epoch | Step | Training Loss | dim_768_cosine_ndcg@10 | dim_256_cosine_ndcg@10 |
|:-----:|:----:|:-------------:|:----------------------:|:----------------------:|
| -1 | -1 | - | 0.0735 | 0.0582 |
| 1.0 | 1921 | 0.6769 | 0.7826 | 0.7751 |
| 2.0 | 3842 | 0.07 | 0.7894 | 0.7829 |
| 3.0 | 5763 | 0.0254 | 0.8030 | 0.7953 |
| 4.0 | 7684 | 0.0139 | 0.8037 | 0.7978 |
### Framework Versions
- Python: 3.11.13
- Sentence Transformers: 4.1.0
- Transformers: 4.52.4
- PyTorch: 2.7.1+cu126
- Accelerate: 1.7.0
- Datasets: 3.6.0
- Tokenizers: 0.21.1
## Citation
```bibtex
@inproceedings{alemneh2026amharicir,
title = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
author = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
year = {2026},
}
```