ModelHub XC 1e68ad8aa0 初始化项目,由ModelHub XC社区提供模型
Model: bofenghuang/whisper-small-cv11-french
Source: Original Platform
2026-05-16 10:09:51 +08:00

license, language, library_name, thumbnail, tags, datasets, metrics, model-index
license language library_name thumbnail tags datasets metrics model-index
apache-2.0 fr transformers null
automatic-speech-recognition
hf-asr-leaderboard
whisper-event
mozilla-foundation/common_voice_11_0
wer
name results
Fine-tuned whisper-small model for ASR in French
task dataset metrics
name type
Automatic Speech Recognition automatic-speech-recognition
name type config split args
Common Voice 11.0 mozilla-foundation/common_voice_11_0 fr test fr
name type value
WER (Greedy) wer 11.76
name type value
WER (Beam 5) wer 10.99
task dataset metrics
name type
Automatic Speech Recognition automatic-speech-recognition
name type config split args
Multilingual LibriSpeech (MLS) facebook/multilingual_librispeech french test french
name type value
WER (Greedy) wer 9.65
name type value
WER (Beam 5) wer 8.91
task dataset metrics
name type
Automatic Speech Recognition automatic-speech-recognition
name type config split args
VoxPopuli facebook/voxpopuli fr test fr
name type value
WER (Greedy) wer 14.45
name type value
WER (Beam 5) wer 13.66
task dataset metrics
name type
Automatic Speech Recognition automatic-speech-recognition
name type config split args
Fleurs google/fleurs fr_fr test fr_fr
name type value
WER (Greedy) wer 10.76
name type value
WER (Beam 5) wer 9.83
task dataset metrics
name type
Automatic Speech Recognition automatic-speech-recognition
name type config split args
African Accented French gigant/african_accented_french fr test fr
name type value
WER (Greedy) wer 10.81
name type value
WER (Beam 5) wer 9.26
<style> img { display: inline; } </style>

Model architecture Model size Language

Fine-tuned whisper-small model for ASR in French

This model is a fine-tuned version of openai/whisper-small, trained on the mozilla-foundation/common_voice_11_0 fr dataset. When using the model make sure that your speech input is also sampled at 16Khz. This model also predicts casing and punctuation.

Performance

Below are the WERs of the pre-trained models on the Common Voice 9.0, Multilingual LibriSpeech, Voxpopuli and Fleurs. These results are reported in the original paper.

Model Common Voice 9.0 MLS VoxPopuli Fleurs
openai/whisper-small 22.7 16.2 15.7 15.0
openai/whisper-medium 16.0 8.9 12.2 8.7
openai/whisper-large 14.7 8.9 11.0 7.7
openai/whisper-large-v2 13.9 7.3 11.4 8.3

Below are the WERs of the fine-tuned models on the Common Voice 11.0, Multilingual LibriSpeech, Voxpopuli, and Fleurs. Note that these evaluation datasets have been filtered and preprocessed to only contain French alphabet characters and are removed of punctuation outside of apostrophe. The results in the table are reported as WER (greedy search) / WER (beam search with beam width 5).

Model Common Voice 11.0 MLS VoxPopuli Fleurs
bofenghuang/whisper-small-cv11-french 11.76 / 10.99 9.65 / 8.91 14.45 / 13.66 10.76 / 9.83
bofenghuang/whisper-medium-cv11-french 9.03 / 8.54 6.34 / 5.86 11.64 / 11.35 7.13 / 6.85
bofenghuang/whisper-medium-french 9.03 / 8.73 4.60 / 4.44 9.53 / 9.46 6.33 / 5.94
bofenghuang/whisper-large-v2-cv11-french 8.05 / 7.67 5.56 / 5.28 11.50 / 10.69 5.42 / 5.05
bofenghuang/whisper-large-v2-french 8.15 / 7.83 4.20 / 4.03 9.10 / 8.66 5.22 / 4.98

Usage

Inference with 🤗 Pipeline

import torch

from datasets import load_dataset
from transformers import pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-small-cv11-french", device=device)

# NB: set forced_decoder_ids for generation utils
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]

# Run
generated_sentences = pipe(waveform, max_new_tokens=225)["text"]  # greedy
# generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"]  # beam search

# Normalise predicted sentences if necessary

Inference with 🤗 low-level APIs

import torch
import torchaudio

from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-small-cv11-french").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-small-cv11-french", language="french", task="transcribe")

# NB: set forced_decoder_ids for generation utils
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")

# 16_000
model_sample_rate = processor.feature_extractor.sampling_rate

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]

# Resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# Get feat
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)

# Generate
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)  # greedy
# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5)  # beam search

# Detokenize
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Normalise predicted sentences if necessary
Description
Model synced from source: bofenghuang/whisper-small-cv11-french
Readme 3.4 MiB
Languages
Roff 100%