Files
asr-wav2vec2-ctc-french/README.md
ModelHub XC ddba1294a7 初始化项目,由ModelHub XC社区提供模型
Model: bofenghuang/asr-wav2vec2-ctc-french
Source: Original Platform
2026-05-21 11:36:18 +08:00

6.2 KiB

license, language, library_name, thumbnail, tags, datasets, metrics, model-index
license language library_name thumbnail tags datasets metrics model-index
apache-2.0 fr transformers null
automatic-speech-recognition
hf-asr-leaderboard
robust-speech-event
CTC
Wav2vec2
common_voice
mozilla-foundation/common_voice_11_0
facebook/multilingual_librispeech
facebook/voxpopuli
gigant/african_accented_french
wer
name results
Fine-tuned wav2vec2-FR-7K-large model for ASR in French
task dataset metrics
name type
Automatic Speech Recognition automatic-speech-recognition
name type args
Common Voice 11.0 mozilla-foundation/common_voice_11_0 fr
name type value
Test WER wer 11.44
name type value
Test WER (+LM) wer 9.66
task dataset metrics
name type
Automatic Speech Recognition automatic-speech-recognition
name type args
Multilingual LibriSpeech (MLS) facebook/multilingual_librispeech french
name type value
Test WER wer 5.93
name type value
Test WER (+LM) wer 5.13
task dataset metrics
name type
Automatic Speech Recognition automatic-speech-recognition
name type args
VoxPopuli facebook/voxpopuli fr
name type value
Test WER wer 9.33
name type value
Test WER (+LM) wer 8.51
task dataset metrics
name type
Automatic Speech Recognition automatic-speech-recognition
name type args
African Accented French gigant/african_accented_french fr
name type value
Test WER wer 16.22
name type value
Test WER (+LM) wer 15.39
task dataset metrics
name type
Automatic Speech Recognition automatic-speech-recognition
name type args
Robust Speech Event - Dev Data speech-recognition-community-v2/dev_data fr
name type value
Test WER wer 16.56
name type value
Test WER (+LM) wer 12.96
task dataset metrics
name type
Automatic Speech Recognition automatic-speech-recognition
name type args
Fleurs google/fleurs fr_fr
name type value
Test WER wer 10.10
name type value
Test WER (+LM) wer 8.84

Fine-tuned wav2vec2-FR-7K-large model for ASR in French

<style> img { display: inline; } </style>

Model architecture Model size Language

This model is a fine-tuned version of LeBenchmark/wav2vec2-FR-7K-large, trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of Common Voice 11.0, Multilingual LibriSpeech, Voxpopuli, Multilingual TEDx, MediaSpeech, and African Accented French. When using the model make sure that your speech input is also sampled at 16Khz.

Usage

  1. To use on a local audio file with the language model
import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor_with_lm.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
  1. To use on a local audio file without the language model
import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2Processor

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]

Evaluation

  1. To evaluate on mozilla-foundation/common_voice_11_0
python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "mozilla-foundation/common_voice_11_0" \
  --config "fr" \
  --split "test" \
  --log_outputs \
  --outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"
  1. To evaluate on speech-recognition-community-v2/dev_data
python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "speech-recognition-community-v2/dev_data" \
  --config "fr" \
  --split "validation" \
  --chunk_length_s 30.0 \
  --stride_length_s 5.0 \
  --log_outputs \
  --outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"