360 lines
17 KiB
Markdown
360 lines
17 KiB
Markdown
|
|
---
|
|||
|
|
tags:
|
|||
|
|
- sentence-transformers
|
|||
|
|
- sentence-similarity
|
|||
|
|
- feature-extraction
|
|||
|
|
- dense
|
|||
|
|
- Algerian AI
|
|||
|
|
- Algerian
|
|||
|
|
- algeria
|
|||
|
|
- darja
|
|||
|
|
- darija
|
|||
|
|
- algerian darija
|
|||
|
|
- algerian dialect
|
|||
|
|
- rag
|
|||
|
|
- ar
|
|||
|
|
- multilingual-e5
|
|||
|
|
- generated_from_trainer
|
|||
|
|
- loss:MultipleNegativesRankingLoss
|
|||
|
|
base_model: intfloat/multilingual-e5-base
|
|||
|
|
widget:
|
|||
|
|
- source_sentence: 'query: Renault Kangoo 2019'
|
|||
|
|
sentences:
|
|||
|
|
- >-
|
|||
|
|
passage: سيارة Renault Kangoo 2019 Confort · مازوت · يدوية · 1.5 DCI 90ch ·
|
|||
|
|
المسافة: 199,000 كم · السعر: 420 مليون دج · سيسبونسيو 10/10
|
|||
|
|
|
|||
|
|
موتور 10/10
|
|||
|
|
|
|||
|
|
سبيغة 0
|
|||
|
|
|
|||
|
|
كلشي معاود فيها جديد
|
|||
|
|
- >-
|
|||
|
|
passage: سيارة Dfsq Dfsq 2013 · بنزين · يدوية · 1.1 · المسافة: 280 كم ·
|
|||
|
|
السعر: 140 مليون دج · باتنة · مفيهش معاود
|
|||
|
|
|
|||
|
|
موتور محطوط جديد
|
|||
|
|
- >-
|
|||
|
|
passage: بيع فيلا تيبازة بوسماعيل · فيلا · السعر: 8 مليون دج · تيبازة ·
|
|||
|
|
agence immobilier LABID agrée par l'état met en vente trés bel villa r+2 de
|
|||
|
|
sup 250 m² deux facade dans un résidence clôturé et gardée jour et nuit
|
|||
|
|
libre de suite l'villa avec toute commanditée :
|
|||
|
|
|
|||
|
|
- rdc : deux garage pour 7 véhicule + studio + jardain
|
|||
|
|
|
|||
|
|
- 1ére étage : salon de chambre + cuisine + salle de bain + sanitaire
|
|||
|
|
|
|||
|
|
- 2éme étage : salon +3 chambre + sanitaire + Hammam
|
|||
|
|
|
|||
|
|
- 3éme étage : grand salon + deux terrasse
|
|||
|
|
|
|||
|
|
- chauffage centrale
|
|||
|
|
|
|||
|
|
- climatisation
|
|||
|
|
|
|||
|
|
- caméra de surveillance
|
|||
|
|
|
|||
|
|
- bâché d'eau
|
|||
|
|
|
|||
|
|
- acte et livret foncier
|
|||
|
|
|
|||
|
|
- les prix : 8 milliards nég lég
|
|||
|
|
|
|||
|
|
- pour plus d'informations consultéz agence labid au :
|
|||
|
|
|
|||
|
|
-
|
|||
|
|
- source_sentence: 'query: location terrain Oran'
|
|||
|
|
sentences:
|
|||
|
|
- >-
|
|||
|
|
passage: كراء عمارة وهران وهران · ارض · 90 م² · السعر: 6 مليون دج · وهران ·
|
|||
|
|
location plusieurs appartements dans un immeuble de 5 étages et avec
|
|||
|
|
ascenseur
|
|||
|
|
|
|||
|
|
les appartements sont neuf jamais habité
|
|||
|
|
|
|||
|
|
merci de nous contacter pour savoir plus de détails .
|
|||
|
|
- >-
|
|||
|
|
passage: سيارة Kia Seltos 2025 LUXuRY · بنزين · اوتوماتيك · 1.5 · السعر: 545
|
|||
|
|
مليون دج · الوادي
|
|||
|
|
- >-
|
|||
|
|
passage: سيارة Peugeot 308 2015 Active · مازوت · يدوية · 1.6 e HDI 112ch ·
|
|||
|
|
المسافة: 375,000 كم · وهران · Je vente 308 jdida machya 375000
|
|||
|
|
- source_sentence: 'query: villa Alger avec jardin'
|
|||
|
|
sentences:
|
|||
|
|
- >-
|
|||
|
|
passage: بيع شقة 3 غرف الجزائر العاشور · شقة · 3 غرف · السعر: 3 مليون دج ·
|
|||
|
|
الجزائر ·vente une appartement a el3achour Hawch chawech De 96m F3 en 3 em
|
|||
|
|
etg avec la scenseur tout comoditie chauffage central climatisation cuisine
|
|||
|
|
équipée boxe pour stationnement les caméras de surveillance avec act et
|
|||
|
|
livret foncièr
|
|||
|
|
- >-
|
|||
|
|
passage: كراء شقة دوبلكس 4 غرف الجزائر العاشور · شقة · 4 غرف · مطبخ مجهز ·
|
|||
|
|
تدفئة مركزية · تكييف · تيراس · مفروش · جناح غرفة النوم · السعر: 29 مليون دج
|
|||
|
|
· الجزائر · El Achour Location d’un Duplex F4 meublé de haut standing
|
|||
|
|
superficie 213 m²
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Le Duplex se compose :
|
|||
|
|
|
|||
|
|
|
|||
|
|
Niveau 1: une entrée, un joli séjour avec une salle à manger, une cuisine
|
|||
|
|
équipée haute gamme, sanitaire + hammame, terrasse.
|
|||
|
|
|
|||
|
|
|
|||
|
|
Niveau 2 : 3 chambres dont une master bed room, une salle de bain avec
|
|||
|
|
jacuzzi, espace bureau, 2 balcons.
|
|||
|
|
|
|||
|
|
|
|||
|
|
Équipements : climatisation, chauffage central, double vitrage, stores
|
|||
|
|
électriques, visiophone, 1 place de parking.
|
|||
|
|
|
|||
|
|
|
|||
|
|
Commodités de la résidence : ascenseur, parking, gardiennage 24h/24, aire de
|
|||
|
|
jeux pour enfants, espaces verts pour vos moments de détente.
|
|||
|
|
- >-
|
|||
|
|
passage: كراء شقة 5 غرف البليدة البليدة · شقة · 5 غرف · السعر: 5 مليون دج ·
|
|||
|
|
البليدة · 203m plus ascenseur
|
|||
|
|
- source_sentence: 'query: Cuxi Cuxi 2025'
|
|||
|
|
sentences:
|
|||
|
|
- >-
|
|||
|
|
passage: سيارة Volkswagen Golf 7 2016 Trendline + · مازوت · يدوية · 2.0 TDI
|
|||
|
|
110ch · المسافة: 280,000 كم
|
|||
|
|
- >-
|
|||
|
|
passage: سيارة Opel Corsa 2001 Corsa · مازوت · يدوية · 1.7 D 60ch · المسافة:
|
|||
|
|
350,000 كم · السعر: 65 مليون دج · موتور نعاود يدور شهرة السبيغة فيها سوباسمو
|
|||
|
|
- >-
|
|||
|
|
passage: سيارة Cuxi Cuxi 2025 · بنزين · اوتوماتيك · Yamaha 110 · المسافة:
|
|||
|
|
9,250 كم · السعر: 28 مليون دج · قسنطينة · Cuxi 2025 jdida état 10/10
|
|||
|
|
- source_sentence: 'query: Rani nhawes 3la tonobil Hyundai i10'
|
|||
|
|
sentences:
|
|||
|
|
- 'passage: بيع شقة غرفتين 3 غرف 4 غرف وهران بئر الجير · شقة · 3 غرف · وهران'
|
|||
|
|
- >-
|
|||
|
|
passage: سيارة Kia Cerato 2008 · مازوت · يدوية · المسافة: 230,000 كم ·
|
|||
|
|
السعر: 135 مليون دج · سوق اهراس · Problem də terage
|
|||
|
|
- >-
|
|||
|
|
passage: سيارة Hyundai i10 2014 GLS · بنزين · يدوية · 1.1 · المسافة: 300,000
|
|||
|
|
كم · عين تموشنت · Fiha bantoura
|
|||
|
|
pipeline_tag: sentence-similarity
|
|||
|
|
library_name: sentence-transformers
|
|||
|
|
license: mit
|
|||
|
|
language:
|
|||
|
|
- ar
|
|||
|
|
- fr
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# AlgerianME5
|
|||
|
|
|
|||
|
|
**algerianME5** is a specialized **Sentence-Transformer** model designed to map Algerian search queries to a 768-dimensional dense vector space, It is specifically fine-tuned to understand the nuances and the vocabulary of the Algerian car and real estate markets, where listings often mix Arabic, French, and darja in both Arabic and Latin script
|
|||
|
|
|
|||
|
|
Note: For more details about the methodology, data synthesis, and evaluation, [please visit the full Medium Story](https://medium.com/@ayoubhimeur/building-a-semantic-search-engine-for-algerian-marketplaces-cc04a0008346)
|
|||
|
|
|
|||
|
|
## Key Features :
|
|||
|
|
-**Domain Specific**: Optimized for real estate and automotive algerian vocabulary “sbigha,” “f3,” “livret foncier”
|
|||
|
|
|
|||
|
|
-**Cross lingual Retrieval**: Maps informal latin queries "tonobil mliha" to formal Arabic or French listing descriptions
|
|||
|
|
|
|||
|
|
-**Robust Embeddings**: Based on the powerful intfloat/multilingual-e5-base architecture
|
|||
|
|
|
|||
|
|
## Use cases :
|
|||
|
|
|
|||
|
|
-**Semantic Search** : Find relevant listings even if keywords dont match exactly (use it as a second layer)
|
|||
|
|
|
|||
|
|
-**Textual Similarity**:Compare two listings to find duplicates or similar items
|
|||
|
|
|
|||
|
|
-**Clustering** Group listings by sub-market or vehicle/property type
|
|||
|
|
|
|||
|
|
## Model Details
|
|||
|
|
|
|||
|
|
### Model Description
|
|||
|
|
- **Model Type:** Sentence Transformer
|
|||
|
|
- **Base model:** [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) <!-- at revision 835193815a3936a24a0ee7dc9e3d48c1fbb19c55 -->
|
|||
|
|
- **Maximum Sequence Length:** 256 tokens
|
|||
|
|
- **Output Dimensionality:** 768 dimensions
|
|||
|
|
- **Similarity Function:** Cosine Similarity
|
|||
|
|
|
|||
|
|
### Model Sources
|
|||
|
|
|
|||
|
|
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
|
|||
|
|
- **Repository:** [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
|
|||
|
|
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
|
|||
|
|
|
|||
|
|
### Full Model Architecture
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
SentenceTransformer(
|
|||
|
|
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
|
|||
|
|
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
|||
|
|
(2): Normalize()
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Usage
|
|||
|
|
|
|||
|
|
### Direct Usage (Sentence Transformers)
|
|||
|
|
|
|||
|
|
First install the Sentence Transformers library:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
pip install -U sentence-transformers
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Then you can load this model and run inference.
|
|||
|
|
```python
|
|||
|
|
from sentence_transformers import SentenceTransformer
|
|||
|
|
|
|||
|
|
|
|||
|
|
model = SentenceTransformer("81melody/algerianME5")
|
|||
|
|
sentences = [
|
|||
|
|
'query: Rani nhawes 3la tonobil Hyundai i10',
|
|||
|
|
'passage: سيارة Hyundai i10 2014 GLS · بنزين · يدوية · 1.1 · المسافة: 300,000 كم · عين تموشنت · Fiha bantoura',
|
|||
|
|
'passage: سيارة Kia Cerato 2008 · مازوت · يدوية · المسافة: 230,000 كم · السعر: 135 مليون دج · سوق اهراس · Problem də terage',
|
|||
|
|
]
|
|||
|
|
embeddings = model.encode(sentences)
|
|||
|
|
print(embeddings.shape)
|
|||
|
|
# [3, 768]
|
|||
|
|
|
|||
|
|
# Get the similarity scores for the embeddings
|
|||
|
|
similarities = model.similarity(embeddings, embeddings)
|
|||
|
|
print(similarities)
|
|||
|
|
```
|
|||
|
|
**OR**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from sentence_transformers import SentenceTransformer , util
|
|||
|
|
model = SentenceTransformer("81melody/algerianME5")
|
|||
|
|
listings = [
|
|||
|
|
# REAL ESTATE
|
|||
|
|
"بيع شقة 4 غرف الجزائر شراقة · شقة · 4 غرف · السعر: 4 مليون دج · Appartement Composé De 1 Suite Parentale... Résidence sécurisée",
|
|||
|
|
"كراء شقة 4 غرف وهران وهران · شقة · 4 غرف · Location appartement par jour pour familles",
|
|||
|
|
"بيع ارض تلمسان مغنية · ارض · الجزائر · بلان فالسانك مليح",
|
|||
|
|
"كراء محل الجزائر الابيار · محل تجاري · 105 م² · Local avec Deux rideaux",
|
|||
|
|
|
|||
|
|
# CARS
|
|||
|
|
"سيارة MG Zs Ev 2024 Comfort · بنزين · يدوية · 1.5 VTi-Tech 106ch · المسافة: 67,000 كم · Très beau SUV comme neuf",
|
|||
|
|
"سيارة Hyundai Grand i10 2018 Restylée DZ · بنزين · يدوية · 1.2 ess 87ch · السعر: 265 مليون دج · صبيغة فيها لال و لامان",
|
|||
|
|
"سيارة Renault Clio 4 2018 GT Line + · مازوت · يدوية · 1.5 DCI 85ch · السعر: 330 مليون دج"
|
|||
|
|
]
|
|||
|
|
queries = [
|
|||
|
|
"شقة 4 غرف الجزائر",
|
|||
|
|
"dar lel bi3 fi Alger centre",
|
|||
|
|
"ard lel bi3 telemcan" ,
|
|||
|
|
"chhal souma MG Zs Ev",
|
|||
|
|
"Grand I10 2018 Restylée DZ",
|
|||
|
|
"tonobil mliha fiha sbigha shwia"
|
|||
|
|
]
|
|||
|
|
q_prefix = "query: "
|
|||
|
|
p_prefix = "passage: "
|
|||
|
|
|
|||
|
|
encoded_listings = model.encode(
|
|||
|
|
[f"{p_prefix}{l}" for l in listings],
|
|||
|
|
convert_to_tensor=True,
|
|||
|
|
show_progress_bar=False
|
|||
|
|
)
|
|||
|
|
for query in queries:
|
|||
|
|
print(f"\nQuery: '{query}'")
|
|||
|
|
|
|||
|
|
|
|||
|
|
query_emb = model.encode(f"{q_prefix}{query}", convert_to_tensor=True)
|
|||
|
|
hits = util.semantic_search(query_emb, encoded_listings, top_k=3)[0]
|
|||
|
|
|
|||
|
|
|
|||
|
|
for i, hit in enumerate(hits):
|
|||
|
|
score = hit['score']
|
|||
|
|
doc_id = hit['corpus_id']
|
|||
|
|
display_text = listings[doc_id][:100] + "..." if len(listings[doc_id]) > 100 else listings[doc_id]
|
|||
|
|
print(f"[Score: {score:.3f}] {display_text}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Training Details
|
|||
|
|
|
|||
|
|
### Training Dataset
|
|||
|
|
|
|||
|
|
* Size: 100,000 training samples
|
|||
|
|
* Columns: <code>sentence_0</code> and <code>sentence_1</code>
|
|||
|
|
* Approximate statistics based on the first 1000 samples:
|
|||
|
|
| | sentence_0 | sentence_1 |
|
|||
|
|
|:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
|
|||
|
|
| type | string | string |
|
|||
|
|
| details | <ul><li>min: 7 tokens</li><li>mean: 11.07 tokens</li><li>max: 22 tokens</li></ul> | <ul><li>min: 17 tokens</li><li>mean: 82.2 tokens</li><li>max: 256 tokens</li></ul> |
|
|||
|
|
* Samples:
|
|||
|
|
| sentence_0 | sentence_1 |
|
|||
|
|
|:----------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|||
|
|
| <code>query: بيع محل وهران بئر</code> | <code>passage: بيع محل وهران بئر الجير · محل تجاري · 750 م² · السعر: 20 مليار دج · وهران · On vous propose en vente un local de 750 m² (550 m² en rez-de-chaussée et 200 m² sous pente) , avec deux rideaux électriques , pour le prix de : 20 Milliards fixe .<br><br>Pour plus de détails veuillez nous contacter</code> |
|
|||
|
|
| <code>query: شقة الجزائر برج</code> | <code>passage: بيع شقة الجزائر برج الكيفان · شقة · 1 غرف · 64 م² · وثائق: دفتر عقاري · عقد موثق · الجزائر · 🔔OPPORTUNITÉ EN OR 🔔<br>– T2 à vendre +paiement par tranche dans 24mois<br><br>❄️À seulement quelques pas de la piscine, dans une site sécurisée et bien située, ce T2 en semi-finis une valeur sûre pour tout investisseur avisé.<br><br>Pourquoi ce bien est exceptionnel ?<br>✅️Localisation stratégique, très demandée<br>✅️Retour sur investissement rapide<br>✅️Prêt à être exploité dès l’achat !<br>✅️Un petit prix pour un grand potentiel.<br>✅️Les bonnes affaires ne durent jamais longtemps…<br>Saisissez cette opportunité maintenant !</code> |
|
|||
|
|
| <code>query: GX3 PRO 2025 X3 Pro</code> | <code>passage: سيارة Geely GX3 PRO 2025 X3 pro livane · بنزين · اوتوماتيك · 1.5 · المسافة: جديدة · بجاية · Vent une livane x3pro neuf carte grise Safia</code> |
|
|||
|
|
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"scale": 20.0,
|
|||
|
|
"similarity_fct": "cos_sim",
|
|||
|
|
"gather_across_devices": false,
|
|||
|
|
"directions": [
|
|||
|
|
"query_to_doc"
|
|||
|
|
],
|
|||
|
|
"partition_mode": "joint",
|
|||
|
|
"hardness_mode": null,
|
|||
|
|
"hardness_strength": 0.0
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Training Hyperparameters
|
|||
|
|
#### Non-Default Hyperparameters
|
|||
|
|
|
|||
|
|
- `per_device_train_batch_size`: 16
|
|||
|
|
- `per_device_eval_batch_size`: 16
|
|||
|
|
- `fp16`: True
|
|||
|
|
- `multi_dataset_batch_sampler`: round_robin
|
|||
|
|
|
|||
|
|
</details>
|
|||
|
|
|
|||
|
|
### Training Logs
|
|||
|
|
| Epoch | Step | Training Loss |
|
|||
|
|
|:------:|:-----:|:-------------:|
|
|||
|
|
| ... | ... | ... |
|
|||
|
|
| 2.32 | 14500 | 0.2827 |
|
|||
|
|
| 2.4 | 15000 | 0.3062 |
|
|||
|
|
| 2.48 | 15500 | 0.3045 |
|
|||
|
|
| 2.56 | 16000 | 0.2841 |
|
|||
|
|
|
|||
|
|
|
|||
|
|
### Framework Versions
|
|||
|
|
- Python: 3.12.13
|
|||
|
|
- Sentence Transformers: 5.3.0
|
|||
|
|
- Transformers: 5.0.0
|
|||
|
|
- PyTorch: 2.10.0+cu128
|
|||
|
|
- Accelerate: 1.13.0
|
|||
|
|
- Datasets: 4.0.0
|
|||
|
|
- Tokenizers: 0.22.2
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
### BibTeX
|
|||
|
|
|
|||
|
|
#### Sentence Transformers
|
|||
|
|
```bibtex
|
|||
|
|
@inproceedings{reimers-2019-sentence-bert,
|
|||
|
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
|
|||
|
|
author = "Reimers, Nils and Gurevych, Iryna",
|
|||
|
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
|
|||
|
|
month = "11",
|
|||
|
|
year = "2019",
|
|||
|
|
publisher = "Association for Computational Linguistics",
|
|||
|
|
url = "https://arxiv.org/abs/1908.10084",
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### MultipleNegativesRankingLoss
|
|||
|
|
```bibtex
|
|||
|
|
@misc{oord2019representationlearningcontrastivepredictive,
|
|||
|
|
title={Representation Learning with Contrastive Predictive Coding},
|
|||
|
|
author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
|
|||
|
|
year={2019},
|
|||
|
|
eprint={1807.03748},
|
|||
|
|
archivePrefix={arXiv},
|
|||
|
|
primaryClass={cs.LG},
|
|||
|
|
url={https://arxiv.org/abs/1807.03748},
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
### Contact
|
|||
|
|
Iam interested in any further related work, contact me at mohamed.himeur@student.unamur.be
|