Files
embedding-amharic-base/README.md
ModelHub XC d73b55b3fd 初始化项目,由ModelHub XC社区提供模型
Model: rasyosef/embedding-amharic-base
Source: Original Platform
2026-05-28 07:45:18 +08:00

11 KiB

language, license, tags, base_model, widget, pipeline_tag, library_name, metrics, model-index, datasets
language license tags base_model widget pipeline_tag library_name metrics model-index datasets
am
mit
sentence-transformers
sentence-similarity
feature-extraction
generated_from_trainer
dataset_size:245876
loss:MatryoshkaLoss
loss:MultipleNegativesRankingLoss
rasyosef/roberta-base-amharic
source_sentence sentences
በኢትዮጵያ ለመጀመሪያ ጊዜ ወታደራዊ ስልጠና የወሰዱ ዕጩ ዲፕሎማቶች ተመረቁ
የውጭ ጉዳይ ሚኒስቴር ከሜጀር ጄነራል ሀየሎም አርአያ ወታደራዊ አካዳሚ ጋር በመተባበር በኢትዮጵያ ለመጀመሪያ ጊዜ ወታደራዊ ስልጠና የወሰዱ ዲፕሎማቶችን  አስመረቀ፡፡በወታደራዊ አካዳሚው ትላንት በተካሄደ የምርቃት ሥነ- ስርዓት ስልጠናውን ላገኙ 89 ዕጩ ድፕሎማቶች የምስክር ወረቀት ተበረክቷል።
አዲስ አበባ፣ የካቲት 19፣ 2012 (ኤፍ.ቢ.ሲ) የኢፌዴሪ አየር ኃይል ለከፍተኛ መኮንኖች የማዕረግ እድገት ሰጥቷል።አየር ኃይሉ በዛሬው እለት በቢሾፍቱ በሚገኘው የኢፌዴሪ አየር ኃይል ጠቅላይ መምሪያ ባካሄደው ስነ ስርዓት ላይ የኢፌዴሪ ጦር ኃይሎች ምክተል ኤታማዦር ሹም ጄኔራል ብርሃኑ ጁላ እና የኢፌዴሪ አየር ኃይል ዋና አዛዥ ሜጀር ጄኔራል ይልማ መርዳሳን ጨምሮ ከፍተኛ አመራሮች ተገኝተዋል።በስነ ስርዓቱ ላይ 106 ለሚሆኑ መኮንኖች በአየር ኃይል ዋና አዛዥ ሜጀር ጄኔራል ይልማ መርዳሳ የተለያዩ የማዕረግ እድገቶችን ሰጥተዋል።
source_sentence sentences
ኢትዮጵያ ኢንተርኔትን በመዝጋቷ ከ130 ሚሊዮን ዶላር በላይ አጣች
የአሜሪካ ድምፅ ባለፉት ሰባ አምስት ዓመታት ውስጥ በዓለም ዙሪያ ያሉ የተለያዩ አድማጮችና ተመልካቾች ከሌሎች ምንጮች ሊያገኟቸው የማይችሏቸውን መረጃዎች ለዓለም ሲያደርስ መቆየቱን ዋና ዳይሬክተሯ አማንዳ ቤኔት ገልፀዋል።
የተቋሙ ጥናት የኢንተርኔን መዘጋት በሃገራት ምጣኔ ሐብት ላይ ያደረሰውን ጉዳት በተለያዩ መለኪያዎች የገመተ ሲሆን፤ በዚህም መሰረት ኢትዮጵያ ለ36 ቀናት ያህል ኢንተርኔትን በዘጋችበት እንዲሁም ለሰባት ቀናት ያህል በነበረው የማኅበራዊ ሚዲያ መናወጥ\ ወቅት በጥቅሉ ከ130 ሚሊዮን ዶላር በላይ አጥታለች ይላል።
text-retrieval sentence-transformers
cosine_accuracy@1
cosine_accuracy@3
cosine_accuracy@5
cosine_accuracy@10
cosine_precision@1
cosine_precision@3
cosine_precision@5
cosine_precision@10
cosine_recall@1
cosine_recall@3
cosine_recall@5
cosine_recall@10
cosine_ndcg@10
cosine_mrr@10
cosine_map@100
name results
RoBERTa Amharic Embed Base
task dataset metrics
type name
information-retrieval Information Retrieval
name type
dim 768 dim_768
type value name
cosine_recall@5 0.869800820152314 Cosine Recall@5
type value name
cosine_recall@10 0.9050966608084359 Cosine Recall@10
type value name
cosine_ndcg@10 0.8036666074756674 Cosine Ndcg@10
type value name
cosine_mrr@10 0.7707977655033881 Cosine Mrr@10
task dataset metrics
type name
information-retrieval Information Retrieval
name type
dim 256 dim_256
type value name
cosine_recall@5 0.8646748681898067 Cosine Recall@5
type value name
cosine_recall@10 0.9020210896309314 Cosine Recall@10
type value name
cosine_ndcg@10 0.7977610383416281 Cosine Ndcg@10
type value name
cosine_mrr@10 0.764035577128722 Cosine Mrr@10
rasyosef/Amharic-Passage-Retrieval-Dataset-V2

Embedding-Amharic-Base

This is a sentence-transformers model finetuned from rasyosef/roberta-base-amharic. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

It was introduced in the paper The Multilingual Curse at the Retrieval Layer: Evidence from Amharic.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: rasyosef/roberta-base-amharic
  • Maximum Sequence Length: 510 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Language: am
  • License: mit

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 510, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("rasyosef/embedding-amharic-base")

# What is the capital of Ethiopia? / France
queries = ['የኢትዮጵያ ዋና ከተማ ማናት?', 'የፈረንሳይ ዋና ከተማ ማናት?'] 

# Addis Ababa, Gondar, Paris, London, Washington D.C.
documents = ['አዲስ አበባ', 'ጎንደር', 'ፓሪስ', 'ለንደን', 'ዋሽንግተን ዲሲ'] 

# Compute embeddings
query_embeddings = model.encode_query(queries) # [2, 768]
document_embeddings = model.encode_document(documents) # [5, 768]

# Calculate semantic similarity
similarities = model.similarity(
    query_embeddings, 
    document_embeddings
)

print(similarities)
# tensor([[0.5075, 0.3114, 0.0798, 0.1967, 0.1340],
#         [0.1777, 0.0770, 0.5714, 0.2596, 0.1076]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_recall@5 0.8698
cosine_recall@10 0.9051
cosine_ndcg@10 0.8037
cosine_mrr@10 0.7708

Information Retrieval

Metric Value
cosine_recall@5 0.8647
cosine_recall@10 0.902
cosine_ndcg@10 0.7978
cosine_mrr@10 0.764

Training Details

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • gradient_accumulation_steps: 2
  • learning_rate: 6e-05
  • num_train_epochs: 6
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.025
  • fp16: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_256_cosine_ndcg@10
-1 -1 - 0.0735 0.0582
1.0 1921 0.6769 0.7826 0.7751
2.0 3842 0.07 0.7894 0.7829
3.0 5763 0.0254 0.8030 0.7953
4.0 7684 0.0139 0.8037 0.7978

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 4.1.0
  • Transformers: 4.52.4
  • PyTorch: 2.7.1+cu126
  • Accelerate: 1.7.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.1

Citation

@inproceedings{alemneh2026amharicir,
  title     = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
  author    = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
  booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
  year      = {2026},
}