218 lines
6.2 KiB
Markdown
218 lines
6.2 KiB
Markdown
---
|
|
license: apache-2.0
|
|
language: fr
|
|
library_name: transformers
|
|
thumbnail: null
|
|
tags:
|
|
- automatic-speech-recognition
|
|
- hf-asr-leaderboard
|
|
- robust-speech-event
|
|
- CTC
|
|
- Wav2vec2
|
|
datasets:
|
|
- common_voice
|
|
- mozilla-foundation/common_voice_11_0
|
|
- facebook/multilingual_librispeech
|
|
- facebook/voxpopuli
|
|
- gigant/african_accented_french
|
|
metrics:
|
|
- wer
|
|
model-index:
|
|
- name: Fine-tuned wav2vec2-FR-7K-large model for ASR in French
|
|
results:
|
|
- task:
|
|
name: Automatic Speech Recognition
|
|
type: automatic-speech-recognition
|
|
dataset:
|
|
name: Common Voice 11.0
|
|
type: mozilla-foundation/common_voice_11_0
|
|
args: fr
|
|
metrics:
|
|
- name: Test WER
|
|
type: wer
|
|
value: 11.44
|
|
- name: Test WER (+LM)
|
|
type: wer
|
|
value: 9.66
|
|
- task:
|
|
name: Automatic Speech Recognition
|
|
type: automatic-speech-recognition
|
|
dataset:
|
|
name: Multilingual LibriSpeech (MLS)
|
|
type: facebook/multilingual_librispeech
|
|
args: french
|
|
metrics:
|
|
- name: Test WER
|
|
type: wer
|
|
value: 5.93
|
|
- name: Test WER (+LM)
|
|
type: wer
|
|
value: 5.13
|
|
- task:
|
|
name: Automatic Speech Recognition
|
|
type: automatic-speech-recognition
|
|
dataset:
|
|
name: VoxPopuli
|
|
type: facebook/voxpopuli
|
|
args: fr
|
|
metrics:
|
|
- name: Test WER
|
|
type: wer
|
|
value: 9.33
|
|
- name: Test WER (+LM)
|
|
type: wer
|
|
value: 8.51
|
|
- task:
|
|
name: Automatic Speech Recognition
|
|
type: automatic-speech-recognition
|
|
dataset:
|
|
name: African Accented French
|
|
type: gigant/african_accented_french
|
|
args: fr
|
|
metrics:
|
|
- name: Test WER
|
|
type: wer
|
|
value: 16.22
|
|
- name: Test WER (+LM)
|
|
type: wer
|
|
value: 15.39
|
|
- task:
|
|
name: Automatic Speech Recognition
|
|
type: automatic-speech-recognition
|
|
dataset:
|
|
name: Robust Speech Event - Dev Data
|
|
type: speech-recognition-community-v2/dev_data
|
|
args: fr
|
|
metrics:
|
|
- name: Test WER
|
|
type: wer
|
|
value: 16.56
|
|
- name: Test WER (+LM)
|
|
type: wer
|
|
value: 12.96
|
|
- task:
|
|
name: Automatic Speech Recognition
|
|
type: automatic-speech-recognition
|
|
dataset:
|
|
name: Fleurs
|
|
type: google/fleurs
|
|
args: fr_fr
|
|
metrics:
|
|
- name: Test WER
|
|
type: wer
|
|
value: 10.10
|
|
- name: Test WER (+LM)
|
|
type: wer
|
|
value: 8.84
|
|
---
|
|
|
|
# Fine-tuned wav2vec2-FR-7K-large model for ASR in French
|
|
|
|
<style>
|
|
img {
|
|
display: inline;
|
|
}
|
|
</style>
|
|
|
|

|
|

|
|

|
|
|
|
This model is a fine-tuned version of [LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large), trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Multilingual TEDx](http://www.openslr.org/100), [MediaSpeech](https://www.openslr.org/108), and [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french). When using the model make sure that your speech input is also sampled at 16Khz.
|
|
|
|
## Usage
|
|
|
|
1. To use on a local audio file with the language model
|
|
|
|
```python
|
|
import torch
|
|
import torchaudio
|
|
|
|
from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
|
|
|
|
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
|
|
|
model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
|
|
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")
|
|
model_sample_rate = processor_with_lm.feature_extractor.sampling_rate
|
|
|
|
wav_path = "example.wav" # path to your audio file
|
|
waveform, sample_rate = torchaudio.load(wav_path)
|
|
waveform = waveform.squeeze(axis=0) # mono
|
|
|
|
# resample
|
|
if sample_rate != model_sample_rate:
|
|
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
|
|
waveform = resampler(waveform)
|
|
|
|
# normalize
|
|
input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
|
|
|
|
with torch.inference_mode():
|
|
logits = model(input_dict.input_values.to(device)).logits
|
|
|
|
predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
|
|
```
|
|
|
|
2. To use on a local audio file without the language model
|
|
|
|
```python
|
|
import torch
|
|
import torchaudio
|
|
|
|
from transformers import AutoModelForCTC, Wav2Vec2Processor
|
|
|
|
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
|
|
|
model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
|
|
processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
|
|
model_sample_rate = processor.feature_extractor.sampling_rate
|
|
|
|
wav_path = "example.wav" # path to your audio file
|
|
waveform, sample_rate = torchaudio.load(wav_path)
|
|
waveform = waveform.squeeze(axis=0) # mono
|
|
|
|
# resample
|
|
if sample_rate != model_sample_rate:
|
|
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
|
|
waveform = resampler(waveform)
|
|
|
|
# normalize
|
|
input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
|
|
|
|
with torch.inference_mode():
|
|
logits = model(input_dict.input_values.to(device)).logits
|
|
|
|
# decode
|
|
predicted_ids = torch.argmax(logits, dim=-1)
|
|
predicted_sentence = processor.batch_decode(predicted_ids)[0]
|
|
```
|
|
|
|
## Evaluation
|
|
|
|
1. To evaluate on `mozilla-foundation/common_voice_11_0`
|
|
|
|
```bash
|
|
python eval.py \
|
|
--model_id "bhuang/asr-wav2vec2-french" \
|
|
--dataset "mozilla-foundation/common_voice_11_0" \
|
|
--config "fr" \
|
|
--split "test" \
|
|
--log_outputs \
|
|
--outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"
|
|
```
|
|
|
|
2. To evaluate on `speech-recognition-community-v2/dev_data`
|
|
|
|
```bash
|
|
python eval.py \
|
|
--model_id "bhuang/asr-wav2vec2-french" \
|
|
--dataset "speech-recognition-community-v2/dev_data" \
|
|
--config "fr" \
|
|
--split "validation" \
|
|
--chunk_length_s 30.0 \
|
|
--stride_length_s 5.0 \
|
|
--log_outputs \
|
|
--outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"
|
|
```
|