This model is a Hubert Base fine-tuned on the large-scale Japanese ASR corpus ReazonSpeech v2.0 using the k2 framework.
Usage
You can use this model through transformers library:
importlibrosaimportnumpyasnpfromtransformersimportAutoProcessor,HubertForCTCmodel=HubertForCTC.from_pretrained("reazon-research/japanese-hubert-base-k2-rs35kh",torch_dtype=torch.bfloat16,attn_implementation="flash_attention_2",).to("cuda")processor=AutoProcessor.from_pretrained("reazon-research/japanese-hubert-base-k2-rs35kh")audio,_=librosa.load(audio_filepath,sr=16_000)audio=np.pad(audio,pad_width=int(0.5*16_000))# Recommend to pad audio before inferenceinput_values=processor(audio,return_tensors="pt",sampling_rate=16_000).input_values.to("cuda").to(torch.bfloat16)withtorch.inference_mode():logits=model(input_values).logits.cpu()predicted_ids=torch.argmax(logits,dim=-1)[0]transcription=processor.decode(predicted_ids,skip_special_tokens=True)
Test Results
We report the Character Error Rate (CER) of our model and the other wav2vec2 families.
@misc{japanese-hubert-base-k2-rs35kh,title={japanese-hubert-base-k2-rs35kh},author={Sasaki, Yuta},url={https://huggingface.co/reazon-research/japanese-hubert-base-k2-rs35kh},year={2025}}@article{yang2024k2ssl,title={k2SSL: A faster and better framework for self-supervised speech representation learning},author={Yang, Yifan and Zhuo, Jianheng and Jin, Zengrui and Ma, Ziyang and Yang, Xiaoyu and Yao, Zengwei and Guo, Liyong and Kang, Wei and Kuang, Fangjun and Lin, Long and others},journal={arXiv preprint arXiv:2411.17100},year={2024}}