library_name, license, language, metrics, tags, datasets, model-index
library_name license language metrics tags datasets model-index
transformers mit it
per
audio
automatic-speech-recognition
speech
phonemize
phoneme
facebook/multilingual_librispeech
name results
Wav2Vec2-base Italian finetuned for phonemes by LMSSC
task dataset metrics
type name
automatic-speech-recognition Speech Recognition
name type args
Multilingual Librispeech facebook/multilingual_librispeech it
type value name
per 4.34 Test PER on Multilingual Librispeech IT | Trained
type value name
per 4.25 Val PER on Multilingual Librispeech IT | Trained

Fine-tuned Italian Voxpopuli v2 wav2vec2-base model for speech-to-phoneme task in Italian

Fine-tuned facebook/wav2vec2-base-it-voxpopuli-v2 for Italian speech-to-phoneme (without language model) using the train and validation splits of Multilingual Librispeech.

Audio samplerate for usage

When using this model, make sure that your speech input is sampled at 16kHz.

Output

As this model is specifically trained for a speech-to-phoneme task, the output is sequence of IPA-encoded words, without punctuation. If you don't read the phonetic alphabet fluently, you can use this excellent IPA reader website to convert the transcript back to audio synthetic speech in order to check the quality of the phonetic transcription.

Training procedure

The model has been finetuned on Multilingual Librispeech (IT) for 30 epochs on a 1xADA_6000 GPU at Cnam/LMSSC using a ddp strategy and gradient-accumulation procedure (256 audios per update, corresponding roughly to 25 minutes of speech per update -> 2k updates per epoch)

  • Learning rate schedule : Double Tri-state schedule

    • Warmup from 1e-5 for 7% of total updates
    • Constant at 1e-4 for 28% of total updates
    • Linear decrease to 1e-6 for 36% of total updates
    • Second warmup boost to 3e-5 for 3% of total updates
    • Constant at 3e-5 for 12% of total updates
    • Linear decrease to 1e-7 for remaining 14% of updates
  • The set of hyperparameters used for training are the same as those detailed in Annex B and Table 6 of wav2vec2 paper.

Usage (using the online Inference API)

Just record your voice on the Inference API on this webpage, and then click on "Compute", that's all !

Usage (with HuggingSound library)

The model can be used directly using the HuggingSound library:

import pandas as pd
from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("Cnam-LMSSC/wav2vec2-italian-phonemizer")
audio_paths = ["./test_rilettura_testo.wav", "./10179_11051_000021.flac"]

# No need for the Audio files to be sampled at 16 kHz here,
# they are automatically resampled by Huggingsound

transcriptions = model.transcribe(audio_paths)

# (Optionnal) Display results in a table :
## transcriptions is list of dicts also containing timestamps and probabilities !

df = pd.DataFrame(transcriptions)
df['Audio file'] = pd.DataFrame(audio_paths)
df.set_index('Audio file', inplace=True)
df[['transcription']]

Output :

Audio file Phonetic transcription (IPA)
./test_rilettura_testo.wav prezɪ lɪ kwatːrotʃɛnto fjorinɪ d̪iː ɔro e reze le debite ɡratsje al pretore sɪ parti e messɔzɪ al merkatantare divɛnne wɔmo sadʒːo e dɪ ɡran manedʒːo
./10179_11051_000021.flac la bʊɔna femina ke ɛra fʊdʒːita il tutːo vedɛva e molto sʊspeza restava e parevale ʊn ora mille annɪ dɪ fʊrarla e dɪ potɛr operare tal effɛtːo

Inference script (if you do not want to use the huggingsound library) :

import torch
from transformers import AutoModelForCTC, Wav2Vec2Processor
from datasets import load_dataset
import soundfile as sf # Or Librosa if you prefer to ... 

MODEL_ID = "Cnam-LMSSC/wav2vec2-italian-phonemizer"

model = AutoModelForCTC.from_pretrained(MODEL_ID)
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)

audio = sf.read('example.wav')
# Make sure you have a 16 kHz sampled audio file, or resample it !

inputs = processor(np.array(audio[0]),sampling_rate=16_000., return_tensors="pt")

with torch.no_grad():
  logits = model(**inputs).logits

predicted_ids = torch.argmax(logits,dim = -1)
transcription = processor.batch_decode(predicted_ids)

print("Phonetic transcription : ", transcription)

Output :

'ˈsoːno ˈmolto ˈljɛːto di prezenˈtarvi la ˈnɔstra soluˈttsjone per fonemiˈddzaːre fatʃilˈmente ʎi ˈawdjo funˈtsjoːna davˈveːro ˈmolto ˈːne'

Test Results:

In the table below, we report the Phoneme Error Rate (PER) of the model on Multilingual Librispeech (using the Italian configs for the dataset of course) :

Model Test Set PER
Cnam-LMSSC/wav2vec2-italian-phonemizer Multilingual Librispeech (Italian) 4.34%

Citation

If you use this finetuned model for any publication, please use this to cite our work :

@misc {lmssc-wav2vec2-base-phonemizer-italian_2026,
	author       = { Olivier, Malo },
	title        = { wav2vec2-italian-phonemizer (Revision 4d8a3a1) },
	year         = 2026,
	url          = { https://huggingface.co/Cnam-LMSSC/wav2vec2-italian-phonemizer },
	doi          = { 10.57967/hf/7982 },
	publisher    = { Hugging Face }
}
Description
Model synced from source: Cnam-LMSSC/wav2vec2-italian-phonemizer
Readme 28 KiB