初始化项目,由ModelHub XC社区提供模型

Model: TaloCreations/whisper-darija-finetuned
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-08 11:34:43 +08:00
commit 1b676f7889
12 changed files with 117285 additions and 0 deletions

35
.gitattributes vendored Normal file
View File

@@ -0,0 +1,35 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text

177
README.md Normal file
View File

@@ -0,0 +1,177 @@
---
library_name: transformers
tags:
- automatic-speech-recognition
- audio
- darija
- moroccan-arabic
- whisper
- fine-tuned
---
# Model Card for Whisper Darija (Fine-Tuned)
This is a fine-tuned [OpenAI Whisper small model](https://huggingface.co/openai/whisper-small) on Moroccan Darija speech transcription. It is trained to transcribe Moroccan dialectal Arabic from audio.
## Model Details
### Model Description
This model is a fine-tuned version of `giannitto/whisper-morocco-model` using a dataset of Moroccan Darija audio and transcriptions. The fine-tuning process aimed to improve the model's Word Error Rate (WER) for spoken Darija, which is underrepresented in many multilingual speech models.
- **Developed by:** Bentaleb Ali
- **Model type:** Automatic Speech Recognition (ASR)
- **Language(s):** Moroccan Darija (Arabic dialect)
- **License:** Apache 2.0
- **Finetuned from model:** giannitto/whisper-morocco-model
### Model Sources
- **Repository:** https://huggingface.co/TaloCreations/whisper-darija-finetuned
## Uses
### Direct Use
This model is intended for transcription of Moroccan Darija audio into text. It can be used in:
- Voice assistants
- Media subtitling
- Dialectal speech processing
- Linguistic research
### Out-of-Scope Use
- Translation tasks (this model is for transcription, not translation)
- Other Arabic dialects outside Moroccan Darija
## Bias, Risks, and Limitations
- The model may perform poorly on noisy or low-quality recordings.
- The model may not generalize well to other dialects of Arabic.
- Biases in the training data (e.g., gender, age, region) may affect transcription accuracy.
### Recommendations
Carefully evaluate outputs when using the model in sensitive applications. Avoid using it in high-risk domains without human verification.
## How to Get Started with the Model
```python
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch, torchaudio
# Load model and processor
processor = AutoProcessor.from_pretrained("TaloCreations/whisper-darija-finetuned")
model = AutoModelForSpeechSeq2Seq.from_pretrained("TaloCreations/whisper-darija-finetuned")
model.eval()
speech, sr = torchaudio.load("path_to_record.wav")
if sr != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
speech = resampler(speech)
# Preprocess and generate
inputs = processor(speech[0], sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
generated_ids = model.generate(**inputs)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("📢 Transcription:", transcription)
```
## Training Details
### Training Data
The model was trained on:
- [atlasia/DODa-audio-dataset Viewer](https://huggingface.co/datasets/atlasia/DODa-audio-dataset)
- [adiren7/darija_speech_to_text](https://huggingface.co/datasets/adiren7/darija_speech_to_text)
These datasets contain manually transcribed audio samples of Moroccan Darija.
### Training Procedure
#### Preprocessing
- All audio was resampled to 16kHz
- Mel spectrograms were padded to 3000 frames (30s max)
- Transcripts were tokenized and clipped to <=448 tokens
- Decoder prompts were injected to ensure language/task alignment
#### Training Hyperparameters
- Batch size: 8 (gradient accumulation = 2)
- Epochs: 10
- Learning rate: 2e-6
- Mixed precision: fp16
- Weight decay: 0.01
- Warmup steps: 500
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
A held-out subset (10%) of the training datasets.
#### Metrics
- Word Error Rate (WER)
### Results
### 📊 Training Progress
| Epoch | Training Loss | Validation Loss | Word Error Rate (WER) |
|-------|----------------|------------------|------------------------|
| 1 | 0.905000 | 0.831409 | 0.825147 |
| 2 | 0.773200 | 0.712022 | 0.732625 |
| 3 | 0.658900 | 0.652096 | 0.631158 |
| 4 | 0.609100 | 0.608619 | 0.578152 |
| 5 | 0.548400 | 0.579711 | 0.546444 |
| 6 | 0.509700 | 0.561768 | 0.524927 |
| 7 | 0.482000 | 0.551717 | 0.522067 |
| 8 | 0.459400 | 0.545695 | 0.526979 |
| 9 | 0.446500 | 0.543017 | 0.497141 |
| 10 | 0.443200 | 0.542152 | 0.504545 |
#### Summary
After 10 epochs, the model achieved a WER of ~50%, a significant improvement over baseline multilingual Whisper models on Moroccan Darija.
## Environmental Impact
Estimated based on training on a single A100 GPU for ~6.5 hours.
- **Hardware Type:** A100
- **Hours used:** ~6.5
- **Cloud Provider:** Google Cloud (Colab)
- **Compute Region:** Morocco
## Technical Specifications
### Model Architecture and Objective
- Whisper (small) encoder-decoder architecture
- Objective: sequence-to-sequence transcription
### Compute Infrastructure
- Google Colab Pro
- 1x A100 GPU
- PyTorch + Transformers 4.39
## Citation
```bibtex
title={Whisper Darija: Fine-tuned Whisper Model for Moroccan Arabic Speech},
author={Bentaleb, Ali},
year={2025},
}
```
## Model Card Authors
- Ali Bentaleb [@TaloCreations](https://huggingface.co/TaloCreations)
## Model Card Contact
- 📧 alitennis131800@gmail.com

1609
added_tokens.json Normal file

File diff suppressed because it is too large Load Diff

61
config.json Normal file
View File

@@ -0,0 +1,61 @@
{
"_name_or_path": "./whisper-darija-finetuned/checkpoint-1098",
"activation_dropout": 0.0,
"activation_function": "gelu",
"apply_spec_augment": false,
"architectures": [
"WhisperForConditionalGeneration"
],
"attention_dropout": 0.0,
"begin_suppress_tokens": null,
"bos_token_id": 50257,
"classifier_proj_size": 256,
"d_model": 768,
"decoder_attention_heads": 12,
"decoder_ffn_dim": 3072,
"decoder_layerdrop": 0.0,
"decoder_layers": 12,
"decoder_start_token_id": 50258,
"dropout": 0.0,
"encoder_attention_heads": 12,
"encoder_ffn_dim": 3072,
"encoder_layerdrop": 0.0,
"encoder_layers": 12,
"eos_token_id": 50257,
"forced_decoder_ids": [
[
1,
50259
],
[
2,
50359
],
[
3,
50363
]
],
"init_std": 0.02,
"is_encoder_decoder": true,
"mask_feature_length": 10,
"mask_feature_min_masks": 0,
"mask_feature_prob": 0.0,
"mask_time_length": 10,
"mask_time_min_masks": 2,
"mask_time_prob": 0.05,
"max_length": null,
"max_source_positions": 1500,
"max_target_positions": 448,
"median_filter_width": 7,
"model_type": "whisper",
"num_hidden_layers": 12,
"num_mel_bins": 80,
"pad_token_id": 50257,
"scale_embedding": false,
"torch_dtype": "float32",
"transformers_version": "4.49.0",
"use_cache": true,
"use_weighted_layer_sum": false,
"vocab_size": 51865
}

254
generation_config.json Normal file
View File

@@ -0,0 +1,254 @@
{
"alignment_heads": [
[
5,
3
],
[
5,
9
],
[
8,
0
],
[
8,
4
],
[
8,
7
],
[
8,
8
],
[
9,
0
],
[
9,
7
],
[
9,
9
],
[
10,
5
]
],
"begin_suppress_tokens": [
220,
50257
],
"bos_token_id": 50257,
"decoder_start_token_id": 50258,
"eos_token_id": 50257,
"is_multilingual": true,
"lang_to_id": {
"<|af|>": 50327,
"<|am|>": 50334,
"<|ar|>": 50272,
"<|as|>": 50350,
"<|az|>": 50304,
"<|ba|>": 50355,
"<|be|>": 50330,
"<|bg|>": 50292,
"<|bn|>": 50302,
"<|bo|>": 50347,
"<|br|>": 50309,
"<|bs|>": 50315,
"<|ca|>": 50270,
"<|cs|>": 50283,
"<|cy|>": 50297,
"<|da|>": 50285,
"<|de|>": 50261,
"<|el|>": 50281,
"<|en|>": 50259,
"<|es|>": 50262,
"<|et|>": 50307,
"<|eu|>": 50310,
"<|fa|>": 50300,
"<|fi|>": 50277,
"<|fo|>": 50338,
"<|fr|>": 50265,
"<|gl|>": 50319,
"<|gu|>": 50333,
"<|haw|>": 50352,
"<|ha|>": 50354,
"<|he|>": 50279,
"<|hi|>": 50276,
"<|hr|>": 50291,
"<|ht|>": 50339,
"<|hu|>": 50286,
"<|hy|>": 50312,
"<|id|>": 50275,
"<|is|>": 50311,
"<|it|>": 50274,
"<|ja|>": 50266,
"<|jw|>": 50356,
"<|ka|>": 50329,
"<|kk|>": 50316,
"<|km|>": 50323,
"<|kn|>": 50306,
"<|ko|>": 50264,
"<|la|>": 50294,
"<|lb|>": 50345,
"<|ln|>": 50353,
"<|lo|>": 50336,
"<|lt|>": 50293,
"<|lv|>": 50301,
"<|mg|>": 50349,
"<|mi|>": 50295,
"<|mk|>": 50308,
"<|ml|>": 50296,
"<|mn|>": 50314,
"<|mr|>": 50320,
"<|ms|>": 50282,
"<|mt|>": 50343,
"<|my|>": 50346,
"<|ne|>": 50313,
"<|nl|>": 50271,
"<|nn|>": 50342,
"<|no|>": 50288,
"<|oc|>": 50328,
"<|pa|>": 50321,
"<|pl|>": 50269,
"<|ps|>": 50340,
"<|pt|>": 50267,
"<|ro|>": 50284,
"<|ru|>": 50263,
"<|sa|>": 50344,
"<|sd|>": 50332,
"<|si|>": 50322,
"<|sk|>": 50298,
"<|sl|>": 50305,
"<|sn|>": 50324,
"<|so|>": 50326,
"<|sq|>": 50317,
"<|sr|>": 50303,
"<|su|>": 50357,
"<|sv|>": 50273,
"<|sw|>": 50318,
"<|ta|>": 50287,
"<|te|>": 50299,
"<|tg|>": 50331,
"<|th|>": 50289,
"<|tk|>": 50341,
"<|tl|>": 50348,
"<|tr|>": 50268,
"<|tt|>": 50351,
"<|uk|>": 50280,
"<|ur|>": 50290,
"<|uz|>": 50337,
"<|vi|>": 50278,
"<|yi|>": 50335,
"<|yo|>": 50325,
"<|zh|>": 50260
},
"language": "arabic",
"max_initial_timestamp_index": 50,
"max_length": 448,
"no_timestamps_token_id": 50363,
"pad_token_id": 50257,
"prev_sot_token_id": 50361,
"return_timestamps": false,
"suppress_tokens": [
1,
2,
7,
8,
9,
10,
14,
25,
26,
27,
28,
29,
31,
58,
59,
60,
61,
62,
63,
90,
91,
92,
93,
359,
503,
522,
542,
873,
893,
902,
918,
922,
931,
1350,
1853,
1982,
2460,
2627,
3246,
3253,
3268,
3536,
3846,
3961,
4183,
4667,
6585,
6647,
7273,
9061,
9383,
10428,
10929,
11938,
12033,
12331,
12562,
13793,
14157,
14635,
15265,
15618,
16553,
16604,
18362,
18956,
20075,
21675,
22520,
26130,
26161,
26435,
28279,
29464,
31650,
32302,
32470,
36865,
42863,
47425,
49870,
50254,
50258,
50360,
50361,
50362
],
"task": "transcribe",
"task_to_id": {
"transcribe": 50359,
"translate": 50358
},
"transformers_version": "4.49.0"
}

50001
merges.txt Normal file

File diff suppressed because it is too large Load Diff

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3752a83e6dae80c5c09fd5bfc4c76fdf384f8f0109a5033e35a0170999f69c15
size 966995080

1742
normalizer.json Normal file

File diff suppressed because it is too large Load Diff

14
preprocessor_config.json Normal file
View File

@@ -0,0 +1,14 @@
{
"chunk_length": 30,
"feature_extractor_type": "WhisperFeatureExtractor",
"feature_size": 80,
"hop_length": 160,
"n_fft": 400,
"n_samples": 480000,
"nb_max_frames": 3000,
"padding_side": "right",
"padding_value": 0.0,
"processor_class": "WhisperProcessor",
"return_attention_mask": false,
"sampling_rate": 16000
}

139
special_tokens_map.json Normal file
View File

@@ -0,0 +1,139 @@
{
"additional_special_tokens": [
"<|endoftext|>",
"<|startoftranscript|>",
"<|en|>",
"<|zh|>",
"<|de|>",
"<|es|>",
"<|ru|>",
"<|ko|>",
"<|fr|>",
"<|ja|>",
"<|pt|>",
"<|tr|>",
"<|pl|>",
"<|ca|>",
"<|nl|>",
"<|ar|>",
"<|sv|>",
"<|it|>",
"<|id|>",
"<|hi|>",
"<|fi|>",
"<|vi|>",
"<|he|>",
"<|uk|>",
"<|el|>",
"<|ms|>",
"<|cs|>",
"<|ro|>",
"<|da|>",
"<|hu|>",
"<|ta|>",
"<|no|>",
"<|th|>",
"<|ur|>",
"<|hr|>",
"<|bg|>",
"<|lt|>",
"<|la|>",
"<|mi|>",
"<|ml|>",
"<|cy|>",
"<|sk|>",
"<|te|>",
"<|fa|>",
"<|lv|>",
"<|bn|>",
"<|sr|>",
"<|az|>",
"<|sl|>",
"<|kn|>",
"<|et|>",
"<|mk|>",
"<|br|>",
"<|eu|>",
"<|is|>",
"<|hy|>",
"<|ne|>",
"<|mn|>",
"<|bs|>",
"<|kk|>",
"<|sq|>",
"<|sw|>",
"<|gl|>",
"<|mr|>",
"<|pa|>",
"<|si|>",
"<|km|>",
"<|sn|>",
"<|yo|>",
"<|so|>",
"<|af|>",
"<|oc|>",
"<|ka|>",
"<|be|>",
"<|tg|>",
"<|sd|>",
"<|gu|>",
"<|am|>",
"<|yi|>",
"<|lo|>",
"<|uz|>",
"<|fo|>",
"<|ht|>",
"<|ps|>",
"<|tk|>",
"<|nn|>",
"<|mt|>",
"<|sa|>",
"<|lb|>",
"<|my|>",
"<|bo|>",
"<|tl|>",
"<|mg|>",
"<|as|>",
"<|tt|>",
"<|haw|>",
"<|ln|>",
"<|ha|>",
"<|ba|>",
"<|jw|>",
"<|su|>",
"<|translate|>",
"<|transcribe|>",
"<|startoflm|>",
"<|startofprev|>",
"<|nocaptions|>",
"<|notimestamps|>"
],
"bos_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

12990
tokenizer_config.json Normal file

File diff suppressed because it is too large Load Diff

50260
vocab.json Normal file

File diff suppressed because it is too large Load Diff