asr-wav2vec2-ctc-french/README.md

---
license: apache-2.0
language: fr
library_name: transformers
thumbnail: null
tags:
- automatic-speech-recognition
- hf-asr-leaderboard
- robust-speech-event
- CTC
- Wav2vec2
datasets:
- common_voice
- mozilla-foundation/common_voice_11_0
- facebook/multilingual_librispeech
- facebook/voxpopuli
- gigant/african_accented_french
metrics:
- wer
model-index:
- name: Fine-tuned wav2vec2-FR-7K-large model for ASR in French
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 11.0
      type: mozilla-foundation/common_voice_11_0
      args: fr
    metrics:
    - name: Test WER
      type: wer
      value: 11.44
    - name: Test WER (+LM)
      type: wer
      value: 9.66
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Multilingual LibriSpeech (MLS)
      type: facebook/multilingual_librispeech
      args: french
    metrics:
    - name: Test WER
      type: wer
      value: 5.93
    - name: Test WER (+LM)
      type: wer
      value: 5.13
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VoxPopuli
      type: facebook/voxpopuli
      args: fr
    metrics:
    - name: Test WER
      type: wer
      value: 9.33
    - name: Test WER (+LM)
      type: wer
      value: 8.51
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: African Accented French
      type: gigant/african_accented_french
      args: fr
    metrics:
    - name: Test WER
      type: wer
      value: 16.22
    - name: Test WER (+LM)
      type: wer
      value: 15.39
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Robust Speech Event - Dev Data
      type: speech-recognition-community-v2/dev_data
      args: fr
    metrics:
    - name: Test WER
      type: wer
      value: 16.56
    - name: Test WER (+LM)
      type: wer
      value: 12.96
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Fleurs
      type: google/fleurs
      args: fr_fr
    metrics:
    - name: Test WER
      type: wer
      value: 10.10
    - name: Test WER (+LM)
      type: wer
      value: 8.84
---

# Fine-tuned wav2vec2-FR-7K-large model for ASR in French

<style>
img {
 display: inline;
}
</style>

![Model architecture](https://img.shields.io/badge/Model_Architecture-Wav2Vec2--CTC-lightgrey)
![Model size](https://img.shields.io/badge/Params-315M-lightgrey)
![Language](https://img.shields.io/badge/Language-French-lightgrey)

This model is a fine-tuned version of [LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large), trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Multilingual TEDx](http://www.openslr.org/100), [MediaSpeech](https://www.openslr.org/108), and [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french). When using the model make sure that your speech input is also sampled at 16Khz.

## Usage

1. To use on a local audio file with the language model

```python
import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor_with_lm.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
```

2. To use on a local audio file without the language model

```python
import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2Processor

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]
```

## Evaluation

1. To evaluate on `mozilla-foundation/common_voice_11_0`

```bash
python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "mozilla-foundation/common_voice_11_0" \
  --config "fr" \
  --split "test" \
  --log_outputs \
  --outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"
```

2. To evaluate on `speech-recognition-community-v2/dev_data`

```bash
python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "speech-recognition-community-v2/dev_data" \
  --config "fr" \
  --split "validation" \
  --chunk_length_s 30.0 \
  --stride_length_s 5.0 \
  --log_outputs \
  --outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"
```