language, datasets, metrics, pipeline_tag, tags, license, widget, model-index
language datasets metrics pipeline_tag tags license widget model-index
vi
vivos
common_voice
FOSD
VLSP
wer
automatic-speech-recognition
audio
speech
Transformer
wav2vec2
automatic-speech-recognition
vietnamese
cc-by-nc-4.0
example_title src
common_voice_vi_30519758.mp3 https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h/raw/main/examples/common_voice_vi_30519758.mp3
example_title src
VIVOSDEV15_020.wav https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h/raw/main/examples/VIVOSDEV15_020.wav
name results
Wav2vec2 Base Vietnamese 160h
task dataset metrics
name type
Speech Recognition automatic-speech-recognition
name type args
common-voice-vietnamese common_voice vi
name type value
Test WER wer 10.78
task dataset metrics
name type
Speech Recognition automatic-speech-recognition
name type args
VIVOS vivos vi
name type value
Test WER wer 15.05

PWC PWC

Vietnamese Speech Recognition using Wav2vec 2.0

Table of contents

  1. Model Description
  2. Implementation
  3. Benchmark Result
  4. Example Usage
  5. Evaluation
  6. Citation
  7. Contact

Model Description

Fine-tuned the Wav2vec2-based model on about 160 hours of Vietnamese speech dataset from different resources, including VIOS, COMMON VOICE, FOSD and VLSP 100h. We have not yet incorporated the Language Model into our ASR system but still gained a promising result.

Implementation

We also provide code for Pre-training and Fine-tuning the Wav2vec2 model. If you wish to train on your dataset, check it out here:

Benchmark WER Result

VIVOS COMMON VOICE 8.0
without LM 15.05 10.78
with LM in progress in progress

Example Usage Open In Colab

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import librosa
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model.to(device)

def transcribe(wav):
  input_values = processor(wav, sampling_rate=16000, return_tensors="pt").input_values
  logits = model(input_values.to(device)).logits
  pred_ids = torch.argmax(logits, dim=-1)
  pred_transcript = processor.batch_decode(pred_ids)[0]
  return pred_transcript


wav, _ = librosa.load('path/to/your/audio/file', sr = 16000)
print(f"transcript: {transcribe(wav)}")

Evaluation Open In Colab

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
import re
from datasets import load_dataset, load_metric, Audio

wer = load_metric("wer")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# load processor and model
processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model.to(device)
model.eval()

# Load dataset
test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "vi", split="test", use_auth_token="your_huggingface_auth_token")
test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16000))
chars_to_ignore = r'[,?.!\-;:"“%\'<EFBFBD>]' # ignore special characters

# preprocess data
def preprocess(batch):
  audio = batch["audio"]
  batch["input_values"] = audio["array"]
  batch["transcript"] = re.sub(chars_to_ignore, '', batch["sentence"]).lower()
  return batch

# run inference
def inference(batch):
  input_values = processor(batch["input_values"], 
                            sampling_rate=16000, 
                            return_tensors="pt").input_values
  logits = model(input_values.to(device)).logits
  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_transcript"] = processor.batch_decode(pred_ids) 
  return batch
  
test_dataset = test_dataset.map(preprocess)
result = test_dataset.map(inference, batched=True, batch_size=1)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_transcript"], references=result["transcript"])))

Test Result: 10.78%

Citation

DOI BibTeX

@mics{Duy_Khanh_Finetune_Wav2vec_2_0_2022,
  author = {Duy Khanh, Le},
  doi = {10.5281/zenodo.6542357},
  license = {CC-BY-NC-4.0},
  month = {5},
  title = {{Finetune Wav2vec 2.0 For Vietnamese Speech Recognition}},
  url = {https://github.com/khanld/ASR-Wa2vec-Finetune},
  year = {2022}
}

APA

Duy Khanh, L. (2022). Finetune Wav2vec 2.0 For Vietnamese Speech Recognition [Data set]. https://doi.org/10.5281/zenodo.6542357

Contact

Description
Model synced from source: khanhld/wav2vec2-base-vietnamese-160h
Readme 124 KiB