--- language: - am license: mit tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:245876 - loss:MatryoshkaLoss - loss:MultipleNegativesRankingLoss base_model: rasyosef/roberta-base-amharic widget: - source_sentence: በኢትዮጵያ ለመጀመሪያ ጊዜ ወታደራዊ ስልጠና የወሰዱ ዕጩ ዲፕሎማቶች ተመረቁ sentences: - የውጭ ጉዳይ ሚኒስቴር ከሜጀር ጄነራል ሀየሎም አርአያ ወታደራዊ አካዳሚ ጋር በመተባበር በኢትዮጵያ ለመጀመሪያ ጊዜ ወታደራዊ ስልጠና የወሰዱ ዲፕሎማቶችን  አስመረቀ፡፡በወታደራዊ አካዳሚው ትላንት በተካሄደ የምርቃት ሥነ- ስርዓት ስልጠናውን ላገኙ 89 ዕጩ ድፕሎማቶች የምስክር ወረቀት ተበረክቷል። - አዲስ አበባ፣ የካቲት 19፣ 2012 (ኤፍ.ቢ.ሲ) የኢፌዴሪ አየር ኃይል ለከፍተኛ መኮንኖች የማዕረግ እድገት ሰጥቷል።አየር ኃይሉ በዛሬው እለት በቢሾፍቱ በሚገኘው የኢፌዴሪ አየር ኃይል ጠቅላይ መምሪያ ባካሄደው ስነ ስርዓት ላይ የኢፌዴሪ ጦር ኃይሎች ምክተል ኤታማዦር ሹም ጄኔራል ብርሃኑ ጁላ እና የኢፌዴሪ አየር ኃይል ዋና አዛዥ ሜጀር ጄኔራል ይልማ መርዳሳን ጨምሮ ከፍተኛ አመራሮች ተገኝተዋል።በስነ ስርዓቱ ላይ 106 ለሚሆኑ መኮንኖች በአየር ኃይል ዋና አዛዥ ሜጀር ጄኔራል ይልማ መርዳሳ የተለያዩ የማዕረግ እድገቶችን ሰጥተዋል። - source_sentence: ኢትዮጵያ ኢንተርኔትን በመዝጋቷ ከ130 ሚሊዮን ዶላር በላይ አጣች sentences: - የአሜሪካ ድምፅ ባለፉት ሰባ አምስት ዓመታት ውስጥ በዓለም ዙሪያ ያሉ የተለያዩ አድማጮችና ተመልካቾች ከሌሎች ምንጮች ሊያገኟቸው የማይችሏቸውን መረጃዎች ለዓለም ሲያደርስ መቆየቱን ዋና ዳይሬክተሯ አማንዳ ቤኔት ገልፀዋል። - የተቋሙ ጥናት የኢንተርኔን መዘጋት በሃገራት ምጣኔ ሐብት ላይ ያደረሰውን ጉዳት በተለያዩ መለኪያዎች የገመተ ሲሆን፤ በዚህም መሰረት ኢትዮጵያ ለ36 ቀናት ያህል ኢንተርኔትን በዘጋችበት እንዲሁም ለሰባት ቀናት ያህል በነበረው የማኅበራዊ ሚዲያ መናወጥ\ ወቅት በጥቅሉ ከ130 ሚሊዮን ዶላር በላይ አጥታለች ይላል። pipeline_tag: text-retrieval library_name: sentence-transformers metrics: - cosine_accuracy@1 - cosine_accuracy@3 - cosine_accuracy@5 - cosine_accuracy@10 - cosine_precision@1 - cosine_precision@3 - cosine_precision@5 - cosine_precision@10 - cosine_recall@1 - cosine_recall@3 - cosine_recall@5 - cosine_recall@10 - cosine_ndcg@10 - cosine_mrr@10 - cosine_map@100 model-index: - name: RoBERTa Amharic Embed Base results: - task: type: information-retrieval name: Information Retrieval dataset: name: dim 768 type: dim_768 metrics: - type: cosine_recall@5 value: 0.869800820152314 name: Cosine Recall@5 - type: cosine_recall@10 value: 0.9050966608084359 name: Cosine Recall@10 - type: cosine_ndcg@10 value: 0.8036666074756674 name: Cosine Ndcg@10 - type: cosine_mrr@10 value: 0.7707977655033881 name: Cosine Mrr@10 - task: type: information-retrieval name: Information Retrieval dataset: name: dim 256 type: dim_256 metrics: - type: cosine_recall@5 value: 0.8646748681898067 name: Cosine Recall@5 - type: cosine_recall@10 value: 0.9020210896309314 name: Cosine Recall@10 - type: cosine_ndcg@10 value: 0.7977610383416281 name: Cosine Ndcg@10 - type: cosine_mrr@10 value: 0.764035577128722 name: Cosine Mrr@10 datasets: - rasyosef/Amharic-Passage-Retrieval-Dataset-V2 --- # Embedding-Amharic-Base This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [rasyosef/roberta-base-amharic](https://huggingface.co/rasyosef/roberta-base-amharic). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. It was introduced in the paper [The Multilingual Curse at the Retrieval Layer: Evidence from Amharic](https://huggingface.co/papers/2605.24556). - **Code:** [GitHub Repository](https://github.com/rasyosef/amharic-neural-ir) - **Paper:** [The Multilingual Curse at the Retrieval Layer: Evidence from Amharic](https://huggingface.co/papers/2605.24556) ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [rasyosef/roberta-base-amharic](https://huggingface.co/rasyosef/roberta-base-amharic) - **Maximum Sequence Length:** 510 tokens - **Output Dimensionality:** 768 dimensions - **Similarity Function:** Cosine Similarity - **Language:** am - **License:** mit ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 510, 'do_lower_case': False}) with Transformer model: XLMRobertaModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("rasyosef/embedding-amharic-base") # What is the capital of Ethiopia? / France queries = ['የኢትዮጵያ ዋና ከተማ ማናት?', 'የፈረንሳይ ዋና ከተማ ማናት?'] # Addis Ababa, Gondar, Paris, London, Washington D.C. documents = ['አዲስ አበባ', 'ጎንደር', 'ፓሪስ', 'ለንደን', 'ዋሽንግተን ዲሲ'] # Compute embeddings query_embeddings = model.encode_query(queries) # [2, 768] document_embeddings = model.encode_document(documents) # [5, 768] # Calculate semantic similarity similarities = model.similarity( query_embeddings, document_embeddings ) print(similarities) # tensor([[0.5075, 0.3114, 0.0798, 0.1967, 0.1340], # [0.1777, 0.0770, 0.5714, 0.2596, 0.1076]]) ``` ## Evaluation ### Metrics #### Information Retrieval * Dataset: `dim_768` * Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters: ```json { "truncate_dim": 768 } ``` | Metric | Value | |:--------------------|:-----------| | cosine_recall@5 | 0.8698 | | cosine_recall@10 | 0.9051 | | **cosine_ndcg@10** | **0.8037** | | cosine_mrr@10 | 0.7708 | #### Information Retrieval * Dataset: `dim_256` * Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters: ```json { "truncate_dim": 256 } ``` | Metric | Value | |:--------------------|:-----------| | cosine_recall@5 | 0.8647 | | cosine_recall@10 | 0.902 | | **cosine_ndcg@10** | **0.7978** | | cosine_mrr@10 | 0.764 | ## Training Details
### Training Hyperparameters #### Non-Default Hyperparameters - `eval_strategy`: epoch - `per_device_train_batch_size`: 64 - `per_device_eval_batch_size`: 64 - `gradient_accumulation_steps`: 2 - `learning_rate`: 6e-05 - `num_train_epochs`: 6 - `lr_scheduler_type`: cosine - `warmup_ratio`: 0.025 - `fp16`: True - `load_best_model_at_end`: True - `optim`: adamw_torch_fused - `batch_sampler`: no_duplicates ### Training Logs | Epoch | Step | Training Loss | dim_768_cosine_ndcg@10 | dim_256_cosine_ndcg@10 | |:-----:|:----:|:-------------:|:----------------------:|:----------------------:| | -1 | -1 | - | 0.0735 | 0.0582 | | 1.0 | 1921 | 0.6769 | 0.7826 | 0.7751 | | 2.0 | 3842 | 0.07 | 0.7894 | 0.7829 | | 3.0 | 5763 | 0.0254 | 0.8030 | 0.7953 | | 4.0 | 7684 | 0.0139 | 0.8037 | 0.7978 | ### Framework Versions - Python: 3.11.13 - Sentence Transformers: 4.1.0 - Transformers: 4.52.4 - PyTorch: 2.7.1+cu126 - Accelerate: 1.7.0 - Datasets: 3.6.0 - Tokenizers: 0.21.1
## Citation ```bibtex @inproceedings{alemneh2026amharicir, title = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic}, author = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten}, booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026}, year = {2026}, } ```