Files
ModelHub XC 3470c2e74a 初始化项目,由ModelHub XC社区提供模型
Model: IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53
Source: Original Platform
2026-05-08 11:40:38 +08:00

215 lines
6.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
language:
- ar
- arz
library_name: transformers
pipeline_tag: automatic-speech-recognition
datasets:
- YouTube
- rsalshalan/MGB3
- pain/MASC
- mozilla-foundation/common_voice_15_0
- halabi2016/arabic_speech_corpus
model-index:
- name: egyptian-arabic-wav2vec2-xlsr-53
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: mozilla-foundation/common_voice_17_0
type: mozilla-foundation/common_voice_17_0
args: ar
metrics:
- name: Test WER
type: wer
value: 27.20
base_model:
- omarxadel/wav2vec2-large-xlsr-53-arabic-egyptian
---
# 🐪🇪🇬 Egyptian Arabic ASR — wav2vec2-large-xlsr-53 Fine-tuned
This model is a fine-tuned version of [omarxadel/wav2vec2-large-xlsr-53-arabic-egyptian](https://huggingface.co/omarxadel/wav2vec2-large-xlsr-53-arabic-egyptian),
enhancing **Egyptian Arabic**, **Modern Standard Arabic (MSA)** and **Gulf / Levantine Arabic** for Automatic Speech Recognition.
---
## 📚 Dataset
It was trained on a diverse combination of publicly available and custom-collected Arabic speech datasets, including:
- **📺 YouTube Egyptian Arabic Speech** *(custom-curated)*
- **🎧 MASC** *(Media Arabic Speech Corpus)*
- **🌍 Common Voice 15 - Arabic**
- **📻 MGB-3 Broadcast Speech**
- **🗂️ Arabic Speech Corpus**
---
## 🔥 Model Highlights
- 📌 Focused on real-life Egyptian Arabic speech (YouTube, spontaneous, conversational)
- 🚀 Supports MSA and other Arabic dialects.
- 🔉 Trained on both scripted and natural speech
---
## 💬 Languages & Dialects
| Dialect | Coverage |
| ---------------------------- | ------------ |
| Egyptian Arabic | ✅ Primary |
| Modern Standard Arabic (MSA) | ✅ Supported |
| Gulf / Levantine | ✅ Supported |
---
## 🚀 Usage
```python
from transformers import pipeline
asr = pipeline("automatic-speech-recognition", model="IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53")
asr("path/to/audio.wav")
# Long-Form Transcription: https://huggingface.co/blog/asr-chunking
asr = pipeline("automatic-speech-recognition", model="IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53", chunk_length_s=30)
asr("path/to/audio.wav")
```
```python
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import torchaudio
model = Wav2Vec2ForCTC.from_pretrained("IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53")
processor = Wav2Vec2Processor.from_pretrained("IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53")
# Load audio (must be mono, 16kHz)
waveform, sr = torchaudio.load("path/to/audio.wav")
# Convert to mono if not already
if waveform.shape[0] > 1:
waveform = torch.mean(waveform, dim=0, keepdim=True)
# Resample if needed to 16 kHz
if sr != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
waveform = resampler(waveform)
inputs = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt")
with torch.inference_mode():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)
```
---
## 🧪 Evaluation
```python
import torch
import torchaudio
import re
from datasets import load_dataset
from evaluate import load
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
# Device setup
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# 🔑 Replace with your Hugging Face token and the desired Wav2Vec2-based model ID
HF_TOKEN = "your_hf_token"
MODEL_NAME = "your_model_name_or_path"
# Load the Common Voice 17.0 Arabic test split
test_dataset = load_dataset(
"mozilla-foundation/common_voice_17_0",
"ar",
split="test",
token=HF_TOKEN
)
# Load WER metric
wer = load("wer")
# Load processor and model
processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME, token=HF_TOKEN)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME, token=HF_TOKEN).to(device)
# Define regex for cleaning up unwanted characters
CHARS_TO_IGNORE_REGEX = r'[\؛\—\_get\«\»\ـ\,\?\.\!\-\;\:"\“\%\\”\<5C>\#\،\☭,\؟]'
def preprocess(batch):
"""Removes unwanted characters and resamples audio to 16kHz."""
batch["sentence"] = re.sub(CHARS_TO_IGNORE_REGEX, "", batch["sentence"])
speech_array, sampling_rate = torchaudio.load(batch["path"])
resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16_000)
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
# Apply preprocessing
test_dataset = test_dataset.map(preprocess)
def predict(batch):
"""Runs inference and decodes predicted text."""
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.inference_mode():
logits = model(
input_values=inputs["input_values"].to(device),
attention_mask=inputs["attention_mask"].to(device)
).logits
predicted_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(predicted_ids)
return batch
# Run prediction
result = test_dataset.map(predict, batched=True, batch_size=8)
# Compute and print Word Error Rate
wer_score = wer.compute(predictions=result["pred_strings"], references=result["sentence"])
print(f"WER: {wer_score * 100:.2f}%")
```
---
## 🗣️ Model Comparison on Common Voice 17.0 Arabic Subset (Test Set)
| **Model** | **WER (%)** |
| -------------------------------------------------- | ----------: |
| **`IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53`** | **27.20** |
| `jonatasgrosman/wav2vec2-large-xlsr-53-arabic` | 45.55 |
| `AndrewMcDowell/wav2vec2-xls-r-300m-arabic` | 47.22 |
| `openai/whisper-large-v3`* | 52.36 |
| `Ahmed107/hamsa-v0.6Q`* | 53.27 |
| `nadsoft/hamsa-v0.1-beta`* | 65.60 |
| `openai/whisper-medium`* | 67.75 |
| `openai/whisper-small`* | 74.16 |
| `omarxadel/wav2vec2-large-xlsr-53-arabic-egyptian` | 91.82 |
| `arbml/wav2vec2-large-xlsr-53-arabic-egyptian` | 93.92 |
| `mboushaba/whisper-large-v3-turbo-arabic`* | 96.90 |
\*: *Whisper models were decoded using beam search (`beam_size = 5`) and evaluated using `BasicTextNormalizer` with `remove_diacritics=False` and `split_letters=False`, applied to both predictions and reference text.*
---
## ✨ Citation
If you want to cite this model you can use this:
```bibtex
@misc{amin2025egyptianasr,
title={Egyptian Arabic ASR with wav2vec2 XLSR 53},
author={Ibrahim Amin},
year={2025},
howpublished={\url{https://huggingface.co/IbrahimAmin/egyptian-arabic-wav2vec2-xlsr-53}},
}
```