--- license: apache-2.0 language: - ar - arz library_name: transformers pipeline_tag: automatic-speech-recognition datasets: - YouTube - rsalshalan/MGB3 - pain/MASC - mozilla-foundation/common_voice_15_0 - halabi2016/arabic_speech_corpus model-index: - name: egyptian-arabic-wav2vec2-xlsr-53 results: - task: name: Speech Recognition type: automatic-speech-recognition dataset: name: mozilla-foundation/common_voice_17_0 type: mozilla-foundation/common_voice_17_0 args: ar metrics: - name: Test WER type: wer value: 27.20 base_model: - omarxadel/wav2vec2-large-xlsr-53-arabic-egyptian --- # ๐Ÿช๐Ÿ‡ช๐Ÿ‡ฌ Egyptian Arabic ASR โ€” wav2vec2-large-xlsr-53 Fine-tuned This model is a fine-tuned version of [omarxadel/wav2vec2-large-xlsr-53-arabic-egyptian](https://huggingface.co/omarxadel/wav2vec2-large-xlsr-53-arabic-egyptian), enhancing **Egyptian Arabic**, **Modern Standard Arabic (MSA)** and **Gulf / Levantine Arabic** for Automatic Speech Recognition. --- ## ๐Ÿ“š Dataset It was trained on a diverse combination of publicly available and custom-collected Arabic speech datasets, including: - **๐Ÿ“บ YouTube Egyptian Arabic Speech** *(custom-curated)* - **๐ŸŽง MASC** *(Media Arabic Speech Corpus)* - **๐ŸŒ Common Voice 15 - Arabic** - **๐Ÿ“ป MGB-3 Broadcast Speech** - **๐Ÿ—‚๏ธ Arabic Speech Corpus** --- ## ๐Ÿ”ฅ Model Highlights - ๐Ÿ“Œ Focused on real-life Egyptian Arabic speech (YouTube, spontaneous, conversational) - ๐Ÿš€ Supports MSA and other Arabic dialects. - ๐Ÿ”‰ Trained on both scripted and natural speech --- ## ๐Ÿ’ฌ Languages & Dialects | Dialect | Coverage | | ---------------------------- | ------------ | | Egyptian Arabic | โœ… Primary | | Modern Standard Arabic (MSA) | โœ… Supported | | Gulf / Levantine | โœ… Supported | --- ## ๐Ÿš€ Usage ```python from transformers import pipeline asr = pipeline("automatic-speech-recognition", model="IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53") asr("path/to/audio.wav") # Long-Form Transcription: https://huggingface.co/blog/asr-chunking asr = pipeline("automatic-speech-recognition", model="IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53", chunk_length_s=30) asr("path/to/audio.wav") ``` ```python from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import torch import torchaudio model = Wav2Vec2ForCTC.from_pretrained("IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53") processor = Wav2Vec2Processor.from_pretrained("IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53") # Load audio (must be mono, 16kHz) waveform, sr = torchaudio.load("path/to/audio.wav") # Convert to mono if not already if waveform.shape[0] > 1: waveform = torch.mean(waveform, dim=0, keepdim=True) # Resample if needed to 16 kHz if sr != 16000: resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000) waveform = resampler(waveform) inputs = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt") with torch.inference_mode(): logits = model(**inputs).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) print(transcription) ``` --- ## ๐Ÿงช Evaluation ```python import torch import torchaudio import re from datasets import load_dataset from evaluate import load from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor # Device setup device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # ๐Ÿ”‘ Replace with your Hugging Face token and the desired Wav2Vec2-based model ID HF_TOKEN = "your_hf_token" MODEL_NAME = "your_model_name_or_path" # Load the Common Voice 17.0 Arabic test split test_dataset = load_dataset( "mozilla-foundation/common_voice_17_0", "ar", split="test", token=HF_TOKEN ) # Load WER metric wer = load("wer") # Load processor and model processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME, token=HF_TOKEN) model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME, token=HF_TOKEN).to(device) # Define regex for cleaning up unwanted characters CHARS_TO_IGNORE_REGEX = r'[\ุ›\โ€”\_get\ยซ\ยป\ู€\,\?\.\!\-\;\:"\โ€œ\%\โ€˜\โ€\๏ฟฝ\#\ุŒ\โ˜ญ,\ุŸ]' def preprocess(batch): """Removes unwanted characters and resamples audio to 16kHz.""" batch["sentence"] = re.sub(CHARS_TO_IGNORE_REGEX, "", batch["sentence"]) speech_array, sampling_rate = torchaudio.load(batch["path"]) resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16_000) batch["speech"] = resampler(speech_array).squeeze().numpy() return batch # Apply preprocessing test_dataset = test_dataset.map(preprocess) def predict(batch): """Runs inference and decodes predicted text.""" inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True) with torch.inference_mode(): logits = model( input_values=inputs["input_values"].to(device), attention_mask=inputs["attention_mask"].to(device) ).logits predicted_ids = torch.argmax(logits, dim=-1) batch["pred_strings"] = processor.batch_decode(predicted_ids) return batch # Run prediction result = test_dataset.map(predict, batched=True, batch_size=8) # Compute and print Word Error Rate wer_score = wer.compute(predictions=result["pred_strings"], references=result["sentence"]) print(f"WER: {wer_score * 100:.2f}%") ``` --- ## ๐Ÿ—ฃ๏ธ Model Comparison on Common Voice 17.0 Arabic Subset (Test Set) | **Model** | **WER (%)** | | -------------------------------------------------- | ----------: | | **`IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53`** | **27.20** | | `jonatasgrosman/wav2vec2-large-xlsr-53-arabic` | 45.55 | | `AndrewMcDowell/wav2vec2-xls-r-300m-arabic` | 47.22 | | `openai/whisper-large-v3`* | 52.36 | | `Ahmed107/hamsa-v0.6Q`* | 53.27 | | `nadsoft/hamsa-v0.1-beta`* | 65.60 | | `openai/whisper-medium`* | 67.75 | | `openai/whisper-small`* | 74.16 | | `omarxadel/wav2vec2-large-xlsr-53-arabic-egyptian` | 91.82 | | `arbml/wav2vec2-large-xlsr-53-arabic-egyptian` | 93.92 | | `mboushaba/whisper-large-v3-turbo-arabic`* | 96.90 | \*: *Whisper models were decoded using beam search (`beam_size = 5`) and evaluated using `BasicTextNormalizer` with `remove_diacritics=False` and `split_letters=False`, applied to both predictions and reference text.* --- ## โœจ Citation If you want to cite this model you can use this: ```bibtex @misc{amin2025egyptianasr, title={Egyptian Arabic ASR with wav2vec2 XLSR 53}, author={Ibrahim Amin}, year={2025}, howpublished={\url{https://huggingface.co/IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53}}, } ```