reazon-research/japanese-hubert-base-k2-rs35kh

Files

ModelHub XC f2177f9d4e 初始化项目，由ModelHub XC社区提供模型

Model: reazon-research/japanese-hubert-base-k2-rs35kh
Source: Original Platform

2026-05-08 11:40:38 +08:00

3.7 KiB

Raw Blame History

library_name, tags, datasets, language, metrics, base_model, license, pipeline_tag

library_name

`japanese-hubert-base-k2-rs35kh`

This model is a Hubert Base fine-tuned on the large-scale Japanese ASR corpus ReazonSpeech v2.0 using the k2 framework.

Usage

You can use this model through transformers library:

import librosa
import numpy as np
from transformers import AutoProcessor, HubertForCTC

model = HubertForCTC.from_pretrained(
    "reazon-research/japanese-hubert-base-k2-rs35kh",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).to("cuda")
processor = AutoProcessor.from_pretrained("reazon-research/japanese-hubert-base-k2-rs35kh")

audio, _ = librosa.load(audio_filepath, sr=16_000)
audio = np.pad(audio, pad_width=int(0.5 * 16_000))  # Recommend to pad audio before inference
input_values = processor(
    audio,
    return_tensors="pt",
    sampling_rate=16_000
).input_values.to("cuda").to(torch.bfloat16)

with torch.inference_mode():
    logits = model(input_values).logits.cpu()
predicted_ids = torch.argmax(logits, dim=-1)[0]
transcription = processor.decode(predicted_ids, skip_special_tokens=True)

Test Results

We report the Character Error Rate (CER) of our model and the other wav2vec2 families.

Model	#Prameters⬇	AVERAGE⬇	JSUT-BASIC5000⬇	Common Voice⬇	TEDxJP-10K⬇
reazon-research/japanese-wav2vec2-large-rs35kh	319M	16.25%	11.00%	18.23%	19.53%
reazon-research/japanese-wav2vec2-base-rs35kh	96.7M	20.40%	13.22%	23.76%	24.23%
reazon-research/japanese-hubert-base-k2-rs35kh	98.4M	11.23%	9.94%	11.59%	12.18%
reazon-research/japanese-hubert-base-k2-rs35kh-bpe	98.4M	11.07%	9.76%	11.36%	12.10%

We also report the CER for long-form speech.

Model	#Prameters⬇	JSUT-BOOK⬇
reazon-research/japanese-wav2vec2-large-rs35kh	319M	30.98%
reazon-research/japanese-wav2vec2-base-rs35kh	96.7M	82.84%
reazon-research/japanese-hubert-base-k2-rs35kh	98.4M	27.05%
+ Silero VAD		19.59%
reazon-research/japanese-hubert-base-k2-rs35kh-bpe	98.4M	84.55%
+ Silero VAD		19.34%

Citation

@misc{japanese-hubert-base-k2-rs35kh,
  title={japanese-hubert-base-k2-rs35kh},
  author={Sasaki, Yuta},
  url = {https://huggingface.co/reazon-research/japanese-hubert-base-k2-rs35kh},
  year = {2025}
}

@article{yang2024k2ssl,
  title={k2SSL: A faster and better framework for self-supervised speech representation learning},
  author={Yang, Yifan and Zhuo, Jianheng and Jin, Zengrui and Ma, Ziyang and Yang, Xiaoyu and Yao, Zengwei and Guo, Liyong and Kang, Wei and Kuang, Fangjun and Lin, Long and others},
  journal={arXiv preprint arXiv:2411.17100},
  year={2024}
}

License

Apache Licence 2.0

3.7 KiB Raw Blame History