初始化项目，由ModelHub XC社区提供模型

Model: scottykwok/wav2vec2-large-xlsr-cantonese Source: Original Platform
2026-05-27 04:48:16 +08:00
commit f180d4ae3d
13 changed files with 1709 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,53 @@
+---
+language: zh
+tags:
+- automatic-speech-recognition
+license: cc-by-sa-4.0
+datasets:
+- common_voice
+metrics:
+- cer
+---
+
+# Wav2vec2-large-xlsr-cantonese
+This model was based on [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), finetuned using Common Voice/zh-HK/6.1.0.
+
+The training code is similar to [user ctl](https://huggingface.co/ctl/wav2vec2-large-xlsr-cantonese), except that the number of training epochs was 80 (doubled) and fp16_backend is apex. The model was trained using a single RTX 3090 and docker image is nvidia/cuda:11.1-cudnn8-devel.
+
+CER is 15.11% when evaluate against common voice zh-HK test set.
+
+# Result (CER)
+15.11% 
+
+# Source Code
+See this GitHub Repo [cantonese-selfish-project](https://github.com/scottykwok/cantonese-selfish-project/) and [demo video](https://youtu.be/k_9RQ-ilGEc).
+
+# Usage
+```python
+import soundfile as sf
+import torch
+from datasets import load_dataset
+from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+
+# load pretrained model
+processor = Wav2Vec2Processor.from_pretrained("scottykwok/wav2vec2-large-xlsr-cantonese")
+model = Wav2Vec2ForCTC.from_pretrained("scottykwok/wav2vec2-large-xlsr-cantonese")
+
+# load audio - must be 16kHz mono
+audio_input, sample_rate = sf.read('audio.wav')
+
+# pad input values and return pt tensor
+input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values
+
+# INFERENCE
+# retrieve logits & take argmax
+logits = model(input_values).logits
+predicted_ids = torch.argmax(logits, dim=-1)
+
+# transcribe
+transcription = processor.decode(predicted_ids[0])
+print("-" *20)
+print("Transcription:\n", transcription.lower())
+print("-" *20)
+
+```