初始化项目，由ModelHub XC社区提供模型

Model: kotoba-tech/kotoba-whisper-v1.1 Source: Original Platform
2026-05-15 00:55:37 +08:00
commit da0101b756
19 changed files with 182908 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,35 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,195 @@
+---
+language: ja
+library_name: transformers
+license: apache-2.0
+tags:
+- audio
+- automatic-speech-recognition
+- hf-asr-leaderboard
+widget:
+- example_title: CommonVoice 8.0 (Test Split)
+  src: >-
+    https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0/resolve/main/sample.flac
+- example_title: JSUT Basic 5000
+  src: >-
+    https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac
+- example_title: ReazonSpeech (Test Split)
+  src: >-
+    https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac
+pipeline_tag: automatic-speech-recognition
+datasets:
+- japanese-asr/whisper_transcriptions.reazonspeech.large
+- japanese-asr/whisper_transcriptions.reazonspeech.large.wer_10.0
+- japanese-asr/whisper_transcriptions.reazonspeech.large.wer_10.0.vectorized
+---
+
+# Kotoba-Whisper-v1.1
+_Kotoba-Whisper-v1.1_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), with 
+additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes 
+ adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main). 
+These libraries are merged into Kotoba-Whisper-v1.1 via pipeline and will be applied seamlessly to the predicted transcription from [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0).
+The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech)
+
+
+Following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/blob/main/run_short_form_eval.py))
+along with the.
+
+
+| model                                                                                                                                             |   [CommonVoice 8 (Japanese test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) |   [JSUT Basic 5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) |   [ReazonSpeech (held out test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
+|:--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------:|
+| [kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0)                                                         |                                                                                                        17.6 |                                                                                    15.4 |                                                                                                        17.4 |
+| [kotoba-tech/kotoba-whisper-v2.1](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.1)                                            |                                                                                                        17.7 |                                                                                    15.4 |                                                                                                        17   |
+| [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)                                                         |                                                                                                        17.8 |                                                                                    15.2 |                                                                                                        17.8 |
+| [kotoba-tech/kotoba-whisper-v1.1](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1)                                                         |                                                                                                        17.9 |                                                                                    15   |                                                                                                        17.8 |
+| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                                                                         |                                                                                                        15.3 |                                                                                    13.4 |                                                                                                        20.5 |
+| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)                                                                         |                                                                                                        15.9 |                                                                                    10.6 |                                                                                                        34.6 |
+| [openai/whisper-large](https://huggingface.co/openai/whisper-large)                                                                               |                                                                                                        16.6 |                                                                                    11.3 |                                                                                                        40.7 |
+| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                                                                             |                                                                                                        17.9 |                                                                                    13.1 |                                                                                                        39.3 |
+| [openai/whisper-base](https://huggingface.co/openai/whisper-base)                                                                                 |                                                                                                        34.5 |                                                                                    26.4 |                                                                                                        76   |
+| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                                                                               |                                                                                                        21.5 |                                                                                    18.9 |                                                                                                        48.1 |
+| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                                                                 |                                                                                                        58.8 |                                                                                    38.3 |                                                                                                       153.3 |
+
+Regarding to the normalized CER, since those update from v1.1 will be removed by the normalization, kotoba-tech/kotoba-whisper-v1.1 marks the same CER values as [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0).
+
+### Latency
+Kotoba-whisper-v1.1 improves the punctuation and the timestamp of the output from Kotoba-whisper-v1.0. However, since we apply the punctuator and stable-ts to each chunk,
+we need to obtain the timestamps, which decreases the latency of the original kotoba-whisper-v1.0. See the following table comparing the inference speed on 
+transcribing **50min** Japanese speech audio, where we report the average over five independent runs.
+
+| model                                                    | return_timestamps   |   time (mean) |
+|:---------------------------------------------------------|:--------------------|--------------:|
+| kotoba-tech/kotoba-whisper-v1.0                          | False               |          10.8 |
+| kotoba-tech/kotoba-whisper-v1.0                          | True                |          15.7 |
+| kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) | True                |          17.9 |
+| kotoba-tech/kotoba-whisper-v1.1 (punctuator)             | True                |          17.7 |
+| kotoba-tech/kotoba-whisper-v1.1 (stable-ts)              | True                |          16.1 |
+| openai/whisper-large-v3                                  | False               |          29.1 |
+| openai/whisper-large-v3                                  | True                |          37.9 |
+
+
+See the full table [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/raw/main/latency.csv).
+
+## Transformers Usage
+Kotoba-Whisper-v1.1 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first 
+install the latest version of Transformers.
+
+```bash
+pip install --upgrade pip
+pip install --upgrade transformers accelerate torchaudio
+pip install stable-ts==2.16.0
+pip install punctuators==0.0.5
+```
+
+### Transcription
+The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
+class to transcribe audio files as follows:
+
+```python
+import torch
+from transformers import pipeline
+from datasets import load_dataset
+
+# config
+model_id = "kotoba-tech/kotoba-whisper-v1.1"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+generate_kwargs = {"language": "ja", "task": "transcribe"}
+
+# load model
+pipe = pipeline(
+    model=model_id,
+    torch_dtype=torch_dtype,
+    device=device,
+    model_kwargs=model_kwargs,
+    batch_size=16,
+    trust_remote_code=True,
+    punctuator=True
+)
+
+# load sample audio
+dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
+sample = dataset[0]["audio"]
+
+# run inference
+result = pipe(sample, chunk_length_s=15, return_timestamps=True, generate_kwargs=generate_kwargs)
+print(result)
+```
+
+- To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
+```diff
+- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)
+```
+
+- To deactivate punctuator:
+```diff
+-     punctuator=True,
+     punctuator=False,
+```
+
+### Transcription with Prompt
+Kotoba-whisper can generate transcription with prompting as below:
+
+```python
+import re
+import torch
+from transformers import pipeline
+from datasets import load_dataset
+
+# config
+model_id = "kotoba-tech/kotoba-whisper-v1.1"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+generate_kwargs = {"language": "japanese", "task": "transcribe"}
+
+# load model
+pipe = pipeline(
+    model=model_id,
+    torch_dtype=torch_dtype,
+    device=device,
+    model_kwargs=model_kwargs,
+    batch_size=16,
+    trust_remote_code=True
+)
+
+# load sample audio
+dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
+
+# --- Without prompt ---
+text = pipe(dataset[10]["audio"], chunk_length_s=15, generate_kwargs=generate_kwargs)['text']
+print(text)
+# 81歳、力強い走りに変わってきます。
+
+# --- With prompt ---: Let's change `81` to `91`.
+prompt = "91歳"
+generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
+text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
+# currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it
+text = re.sub(rf"\A\s*{prompt}\s*", "", text)
+print(text)
+# あっぶったでもスルガさん、91歳、力強い走りに変わってきます。
+```
+
+### Flash Attention 2
+We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) 
+if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
+
+```
+pip install flash-attn --no-build-isolation
+```
+
+Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
+
+```diff
+- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
+```
+
+
+## Acknowledgements
+* [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
+* Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration.
+* Hugging Face 🤗 for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper).
+* [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech).
--- a/added_tokens.json
+++ b/added_tokens.json
--- a/config.json
+++ b/config.json
@@ -0,0 +1,61 @@
+{
+  "_name_or_path": "kotoba-tech/kotoba-whisper-v1.0",
+  "activation_dropout": 0.0,
+  "activation_function": "gelu",
+  "apply_spec_augment": false,
+  "architectures": [
+    "WhisperForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "begin_suppress_tokens": [
+    220,
+    50257
+  ],
+  "bos_token_id": 50257,
+  "classifier_proj_size": 256,
+  "custom_pipelines": {
+    "kotoba-whisper": {
+      "impl": "kotoba_whisper.KotobaWhisperPipeline",
+      "pt": [
+        "WhisperForConditionalGeneration"
+      ],
+      "tf": [
+        "TFWhisperForConditionalGeneration"
+      ]
+    }
+  },
+  "d_model": 1280,
+  "decoder_attention_heads": 20,
+  "decoder_ffn_dim": 5120,
+  "decoder_layerdrop": 0.0,
+  "decoder_layers": 2,
+  "decoder_start_token_id": 50258,
+  "dropout": 0.0,
+  "encoder_attention_heads": 20,
+  "encoder_ffn_dim": 5120,
+  "encoder_layerdrop": 0.0,
+  "encoder_layers": 32,
+  "eos_token_id": 50257,
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "mask_feature_length": 10,
+  "mask_feature_min_masks": 0,
+  "mask_feature_prob": 0.0,
+  "mask_time_length": 10,
+  "mask_time_min_masks": 2,
+  "mask_time_prob": 0.05,
+  "max_length": 448,
+  "max_source_positions": 1500,
+  "max_target_positions": 448,
+  "median_filter_width": 7,
+  "model_type": "whisper",
+  "num_hidden_layers": 32,
+  "num_mel_bins": 128,
+  "pad_token_id": 50256,
+  "scale_embedding": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.41.0.dev0",
+  "use_cache": true,
+  "use_weighted_layer_sum": false,
+  "vocab_size": 51866
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,265 @@
+{
+  "alignment_heads": [
+    [
+      7,
+      0
+    ],
+    [
+      10,
+      17
+    ],
+    [
+      12,
+      18
+    ],
+    [
+      13,
+      12
+    ],
+    [
+      16,
+      1
+    ],
+    [
+      17,
+      14
+    ],
+    [
+      19,
+      11
+    ],
+    [
+      21,
+      4
+    ],
+    [
+      24,
+      1
+    ],
+    [
+      25,
+      6
+    ]
+  ],
+  "begin_suppress_tokens": [
+    220,
+    50257
+  ],
+  "bos_token_id": 50257,
+  "decoder_start_token_id": 50258,
+  "eos_token_id": 50257,
+  "forced_decoder_ids": [
+    [
+      1,
+      null
+    ],
+    [
+      2,
+      50360
+    ]
+  ],
+  "is_multilingual": true,
+  "lang_to_id": {
+    "<|af|>": 50327,
+    "<|am|>": 50334,
+    "<|ar|>": 50272,
+    "<|as|>": 50350,
+    "<|az|>": 50304,
+    "<|ba|>": 50355,
+    "<|be|>": 50330,
+    "<|bg|>": 50292,
+    "<|bn|>": 50302,
+    "<|bo|>": 50347,
+    "<|br|>": 50309,
+    "<|bs|>": 50315,
+    "<|ca|>": 50270,
+    "<|cs|>": 50283,
+    "<|cy|>": 50297,
+    "<|da|>": 50285,
+    "<|de|>": 50261,
+    "<|el|>": 50281,
+    "<|en|>": 50259,
+    "<|es|>": 50262,
+    "<|et|>": 50307,
+    "<|eu|>": 50310,
+    "<|fa|>": 50300,
+    "<|fi|>": 50277,
+    "<|fo|>": 50338,
+    "<|fr|>": 50265,
+    "<|gl|>": 50319,
+    "<|gu|>": 50333,
+    "<|haw|>": 50352,
+    "<|ha|>": 50354,
+    "<|he|>": 50279,
+    "<|hi|>": 50276,
+    "<|hr|>": 50291,
+    "<|ht|>": 50339,
+    "<|hu|>": 50286,
+    "<|hy|>": 50312,
+    "<|id|>": 50275,
+    "<|is|>": 50311,
+    "<|it|>": 50274,
+    "<|ja|>": 50266,
+    "<|jw|>": 50356,
+    "<|ka|>": 50329,
+    "<|kk|>": 50316,
+    "<|km|>": 50323,
+    "<|kn|>": 50306,
+    "<|ko|>": 50264,
+    "<|la|>": 50294,
+    "<|lb|>": 50345,
+    "<|ln|>": 50353,
+    "<|lo|>": 50336,
+    "<|lt|>": 50293,
+    "<|lv|>": 50301,
+    "<|mg|>": 50349,
+    "<|mi|>": 50295,
+    "<|mk|>": 50308,
+    "<|ml|>": 50296,
+    "<|mn|>": 50314,
+    "<|mr|>": 50320,
+    "<|ms|>": 50282,
+    "<|mt|>": 50343,
+    "<|my|>": 50346,
+    "<|ne|>": 50313,
+    "<|nl|>": 50271,
+    "<|nn|>": 50342,
+    "<|no|>": 50288,
+    "<|oc|>": 50328,
+    "<|pa|>": 50321,
+    "<|pl|>": 50269,
+    "<|ps|>": 50340,
+    "<|pt|>": 50267,
+    "<|ro|>": 50284,
+    "<|ru|>": 50263,
+    "<|sa|>": 50344,
+    "<|sd|>": 50332,
+    "<|si|>": 50322,
+    "<|sk|>": 50298,
+    "<|sl|>": 50305,
+    "<|sn|>": 50324,
+    "<|so|>": 50326,
+    "<|sq|>": 50317,
+    "<|sr|>": 50303,
+    "<|su|>": 50357,
+    "<|sv|>": 50273,
+    "<|sw|>": 50318,
+    "<|ta|>": 50287,
+    "<|te|>": 50299,
+    "<|tg|>": 50331,
+    "<|th|>": 50289,
+    "<|tk|>": 50341,
+    "<|tl|>": 50348,
+    "<|tr|>": 50268,
+    "<|tt|>": 50351,
+    "<|uk|>": 50280,
+    "<|ur|>": 50290,
+    "<|uz|>": 50337,
+    "<|vi|>": 50278,
+    "<|yi|>": 50335,
+    "<|yo|>": 50325,
+    "<|yue|>": 50358,
+    "<|zh|>": 50260
+  },
+  "max_initial_timestamp_index": 50,
+  "max_length": 448,
+  "no_timestamps_token_id": 50364,
+  "pad_token_id": 50257,
+  "prev_sot_token_id": 50362,
+  "return_timestamps": false,
+  "suppress_tokens": [
+    1,
+    2,
+    7,
+    8,
+    9,
+    10,
+    14,
+    25,
+    26,
+    27,
+    28,
+    29,
+    31,
+    58,
+    59,
+    60,
+    61,
+    62,
+    63,
+    90,
+    91,
+    92,
+    93,
+    359,
+    503,
+    522,
+    542,
+    873,
+    893,
+    902,
+    918,
+    922,
+    931,
+    1350,
+    1853,
+    1982,
+    2460,
+    2627,
+    3246,
+    3253,
+    3268,
+    3536,
+    3846,
+    3961,
+    4183,
+    4667,
+    6585,
+    6647,
+    7273,
+    9061,
+    9383,
+    10428,
+    10929,
+    11938,
+    12033,
+    12331,
+    12562,
+    13793,
+    14157,
+    14635,
+    15265,
+    15618,
+    16553,
+    16604,
+    18362,
+    18956,
+    20075,
+    21675,
+    22520,
+    26130,
+    26161,
+    26435,
+    28279,
+    29464,
+    31650,
+    32302,
+    32470,
+    36865,
+    42863,
+    47425,
+    49870,
+    50254,
+    50258,
+    50359,
+    50360,
+    50361,
+    50362,
+    50363
+  ],
+  "task_to_id": {
+    "transcribe": 50360,
+    "translate": 50359
+  },
+  "transformers_version": "4.41.0.dev0"
+}
--- a/kotoba_whisper.py
+++ b/kotoba_whisper.py
@@ -0,0 +1,306 @@
+from typing import Union, Optional, Dict, List, Any
+import requests
+
+import torch
+import numpy as np
+
+from transformers.pipelines.audio_utils import ffmpeg_read
+from transformers.pipelines.automatic_speech_recognition import AutomaticSpeechRecognitionPipeline, chunk_iter
+from transformers.utils import is_torchaudio_available
+from transformers.modeling_utils import PreTrainedModel
+from transformers.tokenization_utils import PreTrainedTokenizer
+from transformers.feature_extraction_sequence_utils import SequenceFeatureExtractor
+from stable_whisper import WhisperResult
+from punctuators.models import PunctCapSegModelONNX
+
+
+class Punctuator:
+
+    ja_punctuations = ["!", "?", "、", "。"]
+
+    def __init__(self, model: str = "pcs_47lang"):
+        self.punctuation_model = PunctCapSegModelONNX.from_pretrained(model)
+
+    def punctuate(self, pipeline_chunk: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+
+        def validate_punctuation(raw: str, punctuated: str):
+            if 'unk' in punctuated.lower() or any(p in raw for p in self.ja_punctuations):
+                return raw
+            if punctuated.count("。") > 1:
+                ind = punctuated.rfind("。")
+                punctuated = punctuated.replace("。", "")
+                punctuated = punctuated[:ind] + "。" + punctuated[ind:]
+            return punctuated
+
+        text_edit = self.punctuation_model.infer([c['text'] for c in pipeline_chunk])
+        return [
+            {
+                'timestamp': c['timestamp'],
+                'text': validate_punctuation(c['text'], "".join(e))
+            } for c, e in zip(pipeline_chunk, text_edit)
+        ]
+
+
+def _fix_timestamp(sample_rate: int, result: List[Dict[str, Any]], audio: np.ndarray) -> WhisperResult or None:
+
+    def replace_none_ts(parts):
+        total_dur = round(audio.shape[-1] / sample_rate, 3)
+        _medium_dur = _ts_nonzero_mask = None
+
+        def ts_nonzero_mask() -> np.ndarray:
+            nonlocal _ts_nonzero_mask
+            if _ts_nonzero_mask is None:
+                _ts_nonzero_mask = np.array([(p['end'] or p['start']) is not None for p in parts])
+            return _ts_nonzero_mask
+
+        def medium_dur() -> float:
+            nonlocal _medium_dur
+            if _medium_dur is None:
+                nonzero_dus = [p['end'] - p['start'] for p in parts if None not in (p['end'], p['start'])]
+                nonzero_durs = np.array(nonzero_dus)
+                _medium_dur = np.median(nonzero_durs) * 2 if len(nonzero_durs) else 2.0
+            return _medium_dur
+
+        def _curr_max_end(start: float, next_idx: float) -> float:
+            max_end = total_dur
+            if next_idx != len(parts):
+                mask = np.flatnonzero(ts_nonzero_mask()[next_idx:])
+                if len(mask):
+                    _part = parts[mask[0]+next_idx]
+                    max_end = _part['start'] or _part['end']
+
+            new_end = round(start + medium_dur(), 3)
+            if new_end > max_end:
+                return max_end
+            return new_end
+
+        for i, part in enumerate(parts, 1):
+            if part['start'] is None:
+                is_first = i == 1
+                if is_first:
+                    new_start = round((part['end'] or 0) - medium_dur(), 3)
+                    part['start'] = max(new_start, 0.0)
+                else:
+                    part['start'] = parts[i - 2]['end']
+            if part['end'] is None:
+                no_next_start = i == len(parts) or parts[i]['start'] is None
+                part['end'] = _curr_max_end(part['start'], i) if no_next_start else parts[i]['start']
+
+    words = [dict(start=word['timestamp'][0], end=word['timestamp'][1], word=word['text']) for word in result]
+    replace_none_ts(words)
+    return WhisperResult([words], force_order=True, check_sorted=True)
+
+
+def fix_timestamp(pipeline_output: List[Dict[str, Any]], audio: np.ndarray, sample_rate: int) -> List[Dict[str, Any]]:
+    result = _fix_timestamp(sample_rate=sample_rate, audio=audio, result=pipeline_output)
+    result.adjust_by_silence(
+        audio,
+        q_levels=20,
+        k_size=5,
+        sample_rate=sample_rate,
+        min_word_dur=None,
+        word_level=True,
+        verbose=True,
+        nonspeech_error=0.1,
+        use_word_position=True
+    )
+    if result.has_words:
+        result.regroup(True)
+    return [{"timestamp": [s.start, s.end], "text": s.text} for s in result.segments]
+
+
+class KotobaWhisperPipeline(AutomaticSpeechRecognitionPipeline):
+
+    def __init__(self,
+                 model: "PreTrainedModel",
+                 feature_extractor: Union["SequenceFeatureExtractor", str] = None,
+                 tokenizer: Optional[PreTrainedTokenizer] = None,
+                 device: Union[int, "torch.device"] = None,
+                 torch_dtype: Optional[Union[str, "torch.dtype"]] = None,
+                 punctuator: bool = True,
+                 stable_ts: bool = False,
+                 **kwargs):
+        self.type = "seq2seq_whisper"
+        self.stable_ts = stable_ts
+        if punctuator:
+            self.punctuator = Punctuator()
+        else:
+            self.punctuator = None
+        super().__init__(
+            model=model,
+            feature_extractor=feature_extractor,
+            tokenizer=tokenizer,
+            device=device,
+            torch_dtype=torch_dtype,
+            **kwargs
+        )
+
+    def preprocess(self, inputs, chunk_length_s=0, stride_length_s=None):
+        if isinstance(inputs, str):
+            if inputs.startswith("http://") or inputs.startswith("https://"):
+                # We need to actually check for a real protocol, otherwise it's impossible to use a local file
+                # like http_huggingface_co.png
+                inputs = requests.get(inputs).content
+            else:
+                with open(inputs, "rb") as f:
+                    inputs = f.read()
+
+        if isinstance(inputs, bytes):
+            inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate)
+
+        stride = None
+        extra = {}
+        if isinstance(inputs, dict):
+            stride = inputs.pop("stride", None)
+            # Accepting `"array"` which is the key defined in `datasets` for
+            # better integration
+            if not ("sampling_rate" in inputs and ("raw" in inputs or "array" in inputs)):
+                raise ValueError(
+                    "When passing a dictionary to AutomaticSpeechRecognitionPipeline, the dict needs to contain a "
+                    '"raw" key containing the numpy array representing the audio and a "sampling_rate" key, '
+                    "containing the sampling_rate associated with that array"
+                )
+
+            _inputs = inputs.pop("raw", None)
+            if _inputs is None:
+                # Remove path which will not be used from `datasets`.
+                inputs.pop("path", None)
+                _inputs = inputs.pop("array", None)
+            in_sampling_rate = inputs.pop("sampling_rate")
+            extra = inputs
+            inputs = _inputs
+            if in_sampling_rate != self.feature_extractor.sampling_rate:
+                if is_torchaudio_available():
+                    from torchaudio import functional as F
+                else:
+                    raise ImportError(
+                        "torchaudio is required to resample audio samples in AutomaticSpeechRecognitionPipeline. "
+                        "The torchaudio package can be installed through: `pip install torchaudio`."
+                    )
+
+                inputs = F.resample(
+                    torch.from_numpy(inputs), in_sampling_rate, self.feature_extractor.sampling_rate
+                ).numpy()
+                ratio = self.feature_extractor.sampling_rate / in_sampling_rate
+            else:
+                ratio = 1
+            if stride is not None:
+                if stride[0] + stride[1] > inputs.shape[0]:
+                    raise ValueError("Stride is too large for input")
+
+                # Stride needs to get the chunk length here, it's going to get
+                # swallowed by the `feature_extractor` later, and then batching
+                # can add extra data in the inputs, so we need to keep track
+                # of the original length in the stride so we can cut properly.
+                stride = (inputs.shape[0], int(round(stride[0] * ratio)), int(round(stride[1] * ratio)))
+        if not isinstance(inputs, np.ndarray):
+            raise ValueError(f"We expect a numpy ndarray as input, got `{type(inputs)}`")
+        if len(inputs.shape) != 1:
+            raise ValueError("We expect a single channel audio input for AutomaticSpeechRecognitionPipeline")
+
+        if chunk_length_s:
+            if stride_length_s is None:
+                stride_length_s = chunk_length_s / 6
+
+            if isinstance(stride_length_s, (int, float)):
+                stride_length_s = [stride_length_s, stride_length_s]
+
+            # XXX: Carefuly, this variable will not exist in `seq2seq` setting.
+            # Currently chunking is not possible at this level for `seq2seq` so
+            # it's ok.
+            align_to = getattr(self.model.config, "inputs_to_logits_ratio", 1)
+            chunk_len = int(round(chunk_length_s * self.feature_extractor.sampling_rate / align_to) * align_to)
+            stride_left = int(round(stride_length_s[0] * self.feature_extractor.sampling_rate / align_to) * align_to)
+            stride_right = int(round(stride_length_s[1] * self.feature_extractor.sampling_rate / align_to) * align_to)
+
+            if chunk_len < stride_left + stride_right:
+                raise ValueError("Chunk length must be superior to stride length")
+
+            for item in chunk_iter(
+                    inputs, self.feature_extractor, chunk_len, stride_left, stride_right, self.torch_dtype
+            ):
+                item["audio_array"] = inputs
+                yield item
+        else:
+            if inputs.shape[0] > self.feature_extractor.n_samples:
+                processed = self.feature_extractor(
+                    inputs,
+                    sampling_rate=self.feature_extractor.sampling_rate,
+                    truncation=False,
+                    padding="longest",
+                    return_tensors="pt",
+                )
+            else:
+                processed = self.feature_extractor(
+                    inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="pt"
+                )
+
+            if self.torch_dtype is not None:
+                processed = processed.to(dtype=self.torch_dtype)
+            if stride is not None:
+                processed["stride"] = stride
+            yield {"is_last": True, "audio_array": inputs, **processed, **extra}
+
+    def _forward(self, model_inputs, return_timestamps=False, **generate_kwargs):
+        attention_mask = model_inputs.pop("attention_mask", None)
+        stride = model_inputs.pop("stride", None)
+        is_last = model_inputs.pop("is_last")
+        audio_array = model_inputs.pop("audio_array")
+        encoder = self.model.get_encoder()
+        # Consume values so we can let extra information flow freely through
+        # the pipeline (important for `partial` in microphone)
+        if type(return_timestamps) is not bool:
+            raise ValueError("return_timestamps should be bool")
+        if "input_features" in model_inputs:
+            inputs = model_inputs.pop("input_features")
+        elif "input_values" in model_inputs:
+            inputs = model_inputs.pop("input_values")
+        else:
+            raise ValueError(
+                "Seq2Seq speech recognition model requires either a "
+                f"`input_features` or `input_values` key, but only has {model_inputs.keys()}"
+            )
+
+        # custom processing for Whisper timestamps and word-level timestamps
+        generate_kwargs["return_timestamps"] = True
+        if inputs.shape[-1] > self.feature_extractor.nb_max_frames:
+            generate_kwargs["input_features"] = inputs
+        else:
+            generate_kwargs["encoder_outputs"] = encoder(inputs, attention_mask=attention_mask)
+
+        tokens = self.model.generate(attention_mask=attention_mask, **generate_kwargs)
+        # whisper longform generation stores timestamps in "segments"
+        out = {"tokens": tokens}
+        if self.type == "seq2seq_whisper":
+            if stride is not None:
+                out["stride"] = stride
+
+        # Leftover
+        extra = model_inputs
+        return {"is_last": is_last, "audio_array": audio_array, **out, **extra}
+
+    def postprocess(self,
+                    model_outputs,
+                    decoder_kwargs: Optional[Dict] = None,
+                    return_timestamps=None,
+                    return_language=None):
+        assert len(model_outputs) > 0
+        for model_output in model_outputs:
+            audio_array = model_output.pop("audio_array")[0]
+        outputs = super().postprocess(
+            model_outputs=model_outputs,
+            decoder_kwargs=decoder_kwargs,
+            return_timestamps=True,
+            return_language=return_language
+        )
+        if self.stable_ts:
+            outputs["chunks"] = fix_timestamp(
+                pipeline_output=outputs["chunks"], audio=audio_array, sample_rate=self.feature_extractor.sampling_rate
+            )
+        if self.punctuator:
+            outputs["chunks"] = self.punctuator.punctuate(outputs["chunks"])
+        outputs["text"] = "".join([c["text"] for c in outputs["chunks"]])
+        if not return_timestamps:
+            outputs.pop("chunks")
+        return outputs
+
--- a/latency.csv
+++ b/latency.csv
@@ -0,0 +1,20 @@
+model,chunk_length_s,stable_ts,punctuator,attention,device,file,return_timestamps,batch,time (mean),time (std),time (all)
+openai/whisper-large-v3,15,,,,cuda:0,long_interview_1.mp3,True,32,37.87098135948181,0.9975159668837961,"[39.03658604621887, 38.64555096626282, 37.84633827209473, 37.21843934059143, 36.60799217224121]"
+openai/whisper-large-v3,15,,,flash_attention_2,cuda:0,long_interview_1.mp3,False,32,28.436019563674925,0.47368126646613146,"[28.606680631637573, 28.296039581298828, 29.176307678222656, 28.008386850357056, 28.09268307685852]"
+openai/whisper-large-v3,15,,,sdpa,cuda:0,long_interview_1.mp3,False,32,28.914933681488037,0.21978470408766382,"[29.01437497138977, 28.632374048233032, 29.222826719284058, 28.87388014793396, 28.831212520599365]"
+openai/whisper-large-v3,15,,,,cuda:0,long_interview_1.mp3,False,32,29.102856159210205,0.9922645461609332,"[28.25994610786438, 28.26285481452942, 29.124175310134888, 30.689085006713867, 29.17821955680847]"
+kotoba-tech/kotoba-whisper-v1.1,15,True,True,flash_attention_2,cuda:0,long_interview_1.mp3,True,256,17.678295278549193,0.22182761219337574,"[18.054609298706055, 17.626131772994995, 17.602790117263794, 17.464856386184692, 17.643088817596436]"
+kotoba-tech/kotoba-whisper-v1.1,15,True,True,sdpa,cuda:0,long_interview_1.mp3,True,256,18.204185914993285,0.17810196179511653,"[18.498502254486084, 18.186529874801636, 18.095500469207764, 18.03645157814026, 18.20394539833069]"
+kotoba-tech/kotoba-whisper-v1.1,15,True,True,,cuda:0,long_interview_1.mp3,True,256,17.916395521163942,0.3377347175013587,"[18.50807523727417, 17.8379385471344, 17.65992569923401, 17.81332492828369, 17.762713193893433]"
+kotoba-tech/kotoba-whisper-v1.1,15,True,False,flash_attention_2,cuda:0,long_interview_1.mp3,True,256,16.014582490921022,0.35303348844305815,"[16.54080867767334, 16.039757013320923, 16.108657360076904, 15.648179054260254, 15.735510349273682]"
+kotoba-tech/kotoba-whisper-v1.1,15,True,False,sdpa,cuda:0,long_interview_1.mp3,True,256,16.489454126358034,0.2974532799091081,"[17.013059377670288, 16.346179008483887, 16.448065280914307, 16.325897932052612, 16.314069032669067]"
+kotoba-tech/kotoba-whisper-v1.1,15,True,False,,cuda:0,long_interview_1.mp3,True,256,16.10833501815796,0.28295738976500673,"[16.60477614402771, 15.89296007156372, 15.998892784118652, 16.041422605514526, 16.003623485565186]"
+kotoba-tech/kotoba-whisper-v1.1,15,False,True,flash_attention_2,cuda:0,long_interview_1.mp3,True,256,17.634469079971314,0.37421288687576754,"[18.294931173324585, 17.515852212905884, 17.49791431427002, 17.49921679496765, 17.364430904388428]"
+kotoba-tech/kotoba-whisper-v1.1,15,False,True,sdpa,cuda:0,long_interview_1.mp3,True,256,18.02405333518982,0.20277988430790378,"[18.34075951576233, 17.888415813446045, 17.907806873321533, 18.11445140838623, 17.86883306503296]"
+kotoba-tech/kotoba-whisper-v1.1,15,False,True,,cuda:0,long_interview_1.mp3,True,256,17.70981011390686,0.37067887702647784,"[18.362802028656006, 17.600308895111084, 17.594009399414062, 17.552326440811157, 17.439603805541992]"
+kotoba-tech/kotoba-whisper-v1.0,15,,,flash_attention_2,cuda:0,long_interview_1.mp3,True,256,15.530676031112671,0.3633174816819969,"[16.119226217269897, 15.388941764831543, 15.455092668533325, 15.553504943847656, 15.136614561080933]"
+kotoba-tech/kotoba-whisper-v1.0,15,,,sdpa,cuda:0,long_interview_1.mp3,True,256,15.825398015975953,0.2933998625987351,"[16.23697257041931, 15.700171947479248, 15.733088254928589, 15.471559524536133, 15.98519778251648]"
+kotoba-tech/kotoba-whisper-v1.0,15,,,,cuda:0,long_interview_1.mp3,True,256,15.736223077774047,0.23519827876867205,"[16.14710545539856, 15.576295137405396, 15.654626369476318, 15.59735655784607, 15.705731868743896]"
+kotoba-tech/kotoba-whisper-v1.0,15,,,flash_attention_2,cuda:0,long_interview_1.mp3,False,256,10.713608121871948,0.278930456482726,"[11.21053671836853, 10.62143874168396, 10.551162481307983, 10.596262693405151, 10.588639974594116]"
+kotoba-tech/kotoba-whisper-v1.0,15,,,sdpa,cuda:0,long_interview_1.mp3,False,256,10.769530439376831,0.20777072275326197,"[11.120648622512817, 10.576390266418457, 10.681758880615234, 10.703449249267578, 10.765405178070068]"
+kotoba-tech/kotoba-whisper-v1.0,15,,,,cuda:0,long_interview_1.mp3,False,256,10.77918643951416,0.20745621012396212,"[11.07450532913208, 10.564246892929077, 10.596351146697998, 10.824815034866333, 10.836013793945312]"
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:1de8b4eb1b4c069060fc6da4f345c0d3a4153473c9f0a349554649064b4371a0
+size 3025686376
--- a/normalizer.json
+++ b/normalizer.json
--- a/pipeline/kotoba_whisper.py
+++ b/pipeline/kotoba_whisper.py
@@ -0,0 +1,306 @@
+from typing import Union, Optional, Dict, List, Any
+import requests
+
+import torch
+import numpy as np
+
+from transformers.pipelines.audio_utils import ffmpeg_read
+from transformers.pipelines.automatic_speech_recognition import AutomaticSpeechRecognitionPipeline, chunk_iter
+from transformers.utils import is_torchaudio_available
+from transformers.modeling_utils import PreTrainedModel
+from transformers.tokenization_utils import PreTrainedTokenizer
+from transformers.feature_extraction_sequence_utils import SequenceFeatureExtractor
+from stable_whisper import WhisperResult
+from punctuators.models import PunctCapSegModelONNX
+
+
+class Punctuator:
+
+    ja_punctuations = ["!", "?", "、", "。"]
+
+    def __init__(self, model: str = "pcs_47lang"):
+        self.punctuation_model = PunctCapSegModelONNX.from_pretrained(model)
+
+    def punctuate(self, pipeline_chunk: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+
+        def validate_punctuation(raw: str, punctuated: str):
+            if 'unk' in punctuated.lower() or any(p in raw for p in self.ja_punctuations):
+                return raw
+            if punctuated.count("。") > 1:
+                ind = punctuated.rfind("。")
+                punctuated = punctuated.replace("。", "")
+                punctuated = punctuated[:ind] + "。" + punctuated[ind:]
+            return punctuated
+
+        text_edit = self.punctuation_model.infer([c['text'] for c in pipeline_chunk])
+        return [
+            {
+                'timestamp': c['timestamp'],
+                'text': validate_punctuation(c['text'], "".join(e))
+            } for c, e in zip(pipeline_chunk, text_edit)
+        ]
+
+
+def _fix_timestamp(sample_rate: int, result: List[Dict[str, Any]], audio: np.ndarray) -> WhisperResult or None:
+
+    def replace_none_ts(parts):
+        total_dur = round(audio.shape[-1] / sample_rate, 3)
+        _medium_dur = _ts_nonzero_mask = None
+
+        def ts_nonzero_mask() -> np.ndarray:
+            nonlocal _ts_nonzero_mask
+            if _ts_nonzero_mask is None:
+                _ts_nonzero_mask = np.array([(p['end'] or p['start']) is not None for p in parts])
+            return _ts_nonzero_mask
+
+        def medium_dur() -> float:
+            nonlocal _medium_dur
+            if _medium_dur is None:
+                nonzero_dus = [p['end'] - p['start'] for p in parts if None not in (p['end'], p['start'])]
+                nonzero_durs = np.array(nonzero_dus)
+                _medium_dur = np.median(nonzero_durs) * 2 if len(nonzero_durs) else 2.0
+            return _medium_dur
+
+        def _curr_max_end(start: float, next_idx: float) -> float:
+            max_end = total_dur
+            if next_idx != len(parts):
+                mask = np.flatnonzero(ts_nonzero_mask()[next_idx:])
+                if len(mask):
+                    _part = parts[mask[0]+next_idx]
+                    max_end = _part['start'] or _part['end']
+
+            new_end = round(start + medium_dur(), 3)
+            if new_end > max_end:
+                return max_end
+            return new_end
+
+        for i, part in enumerate(parts, 1):
+            if part['start'] is None:
+                is_first = i == 1
+                if is_first:
+                    new_start = round((part['end'] or 0) - medium_dur(), 3)
+                    part['start'] = max(new_start, 0.0)
+                else:
+                    part['start'] = parts[i - 2]['end']
+            if part['end'] is None:
+                no_next_start = i == len(parts) or parts[i]['start'] is None
+                part['end'] = _curr_max_end(part['start'], i) if no_next_start else parts[i]['start']
+
+    words = [dict(start=word['timestamp'][0], end=word['timestamp'][1], word=word['text']) for word in result]
+    replace_none_ts(words)
+    return WhisperResult([words], force_order=True, check_sorted=True)
+
+
+def fix_timestamp(pipeline_output: List[Dict[str, Any]], audio: np.ndarray, sample_rate: int) -> List[Dict[str, Any]]:
+    result = _fix_timestamp(sample_rate=sample_rate, audio=audio, result=pipeline_output)
+    result.adjust_by_silence(
+        audio,
+        q_levels=20,
+        k_size=5,
+        sample_rate=sample_rate,
+        min_word_dur=None,
+        word_level=True,
+        verbose=True,
+        nonspeech_error=0.1,
+        use_word_position=True
+    )
+    if result.has_words:
+        result.regroup(True)
+    return [{"timestamp": [s.start, s.end], "text": s.text} for s in result.segments]
+
+
+class KotobaWhisperPipeline(AutomaticSpeechRecognitionPipeline):
+
+    def __init__(self,
+                 model: "PreTrainedModel",
+                 feature_extractor: Union["SequenceFeatureExtractor", str] = None,
+                 tokenizer: Optional[PreTrainedTokenizer] = None,
+                 device: Union[int, "torch.device"] = None,
+                 torch_dtype: Optional[Union[str, "torch.dtype"]] = None,
+                 punctuator: bool = True,
+                 stable_ts: bool = False,
+                 **kwargs):
+        self.type = "seq2seq_whisper"
+        self.stable_ts = stable_ts
+        if punctuator:
+            self.punctuator = Punctuator()
+        else:
+            self.punctuator = None
+        super().__init__(
+            model=model,
+            feature_extractor=feature_extractor,
+            tokenizer=tokenizer,
+            device=device,
+            torch_dtype=torch_dtype,
+            **kwargs
+        )
+
+    def preprocess(self, inputs, chunk_length_s=0, stride_length_s=None):
+        if isinstance(inputs, str):
+            if inputs.startswith("http://") or inputs.startswith("https://"):
+                # We need to actually check for a real protocol, otherwise it's impossible to use a local file
+                # like http_huggingface_co.png
+                inputs = requests.get(inputs).content
+            else:
+                with open(inputs, "rb") as f:
+                    inputs = f.read()
+
+        if isinstance(inputs, bytes):
+            inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate)
+
+        stride = None
+        extra = {}
+        if isinstance(inputs, dict):
+            stride = inputs.pop("stride", None)
+            # Accepting `"array"` which is the key defined in `datasets` for
+            # better integration
+            if not ("sampling_rate" in inputs and ("raw" in inputs or "array" in inputs)):
+                raise ValueError(
+                    "When passing a dictionary to AutomaticSpeechRecognitionPipeline, the dict needs to contain a "
+                    '"raw" key containing the numpy array representing the audio and a "sampling_rate" key, '
+                    "containing the sampling_rate associated with that array"
+                )
+
+            _inputs = inputs.pop("raw", None)
+            if _inputs is None:
+                # Remove path which will not be used from `datasets`.
+                inputs.pop("path", None)
+                _inputs = inputs.pop("array", None)
+            in_sampling_rate = inputs.pop("sampling_rate")
+            extra = inputs
+            inputs = _inputs
+            if in_sampling_rate != self.feature_extractor.sampling_rate:
+                if is_torchaudio_available():
+                    from torchaudio import functional as F
+                else:
+                    raise ImportError(
+                        "torchaudio is required to resample audio samples in AutomaticSpeechRecognitionPipeline. "
+                        "The torchaudio package can be installed through: `pip install torchaudio`."
+                    )
+
+                inputs = F.resample(
+                    torch.from_numpy(inputs), in_sampling_rate, self.feature_extractor.sampling_rate
+                ).numpy()
+                ratio = self.feature_extractor.sampling_rate / in_sampling_rate
+            else:
+                ratio = 1
+            if stride is not None:
+                if stride[0] + stride[1] > inputs.shape[0]:
+                    raise ValueError("Stride is too large for input")
+
+                # Stride needs to get the chunk length here, it's going to get
+                # swallowed by the `feature_extractor` later, and then batching
+                # can add extra data in the inputs, so we need to keep track
+                # of the original length in the stride so we can cut properly.
+                stride = (inputs.shape[0], int(round(stride[0] * ratio)), int(round(stride[1] * ratio)))
+        if not isinstance(inputs, np.ndarray):
+            raise ValueError(f"We expect a numpy ndarray as input, got `{type(inputs)}`")
+        if len(inputs.shape) != 1:
+            raise ValueError("We expect a single channel audio input for AutomaticSpeechRecognitionPipeline")
+
+        if chunk_length_s:
+            if stride_length_s is None:
+                stride_length_s = chunk_length_s / 6
+
+            if isinstance(stride_length_s, (int, float)):
+                stride_length_s = [stride_length_s, stride_length_s]
+
+            # XXX: Carefuly, this variable will not exist in `seq2seq` setting.
+            # Currently chunking is not possible at this level for `seq2seq` so
+            # it's ok.
+            align_to = getattr(self.model.config, "inputs_to_logits_ratio", 1)
+            chunk_len = int(round(chunk_length_s * self.feature_extractor.sampling_rate / align_to) * align_to)
+            stride_left = int(round(stride_length_s[0] * self.feature_extractor.sampling_rate / align_to) * align_to)
+            stride_right = int(round(stride_length_s[1] * self.feature_extractor.sampling_rate / align_to) * align_to)
+
+            if chunk_len < stride_left + stride_right:
+                raise ValueError("Chunk length must be superior to stride length")
+
+            for item in chunk_iter(
+                    inputs, self.feature_extractor, chunk_len, stride_left, stride_right, self.torch_dtype
+            ):
+                item["audio_array"] = inputs
+                yield item
+        else:
+            if inputs.shape[0] > self.feature_extractor.n_samples:
+                processed = self.feature_extractor(
+                    inputs,
+                    sampling_rate=self.feature_extractor.sampling_rate,
+                    truncation=False,
+                    padding="longest",
+                    return_tensors="pt",
+                )
+            else:
+                processed = self.feature_extractor(
+                    inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="pt"
+                )
+
+            if self.torch_dtype is not None:
+                processed = processed.to(dtype=self.torch_dtype)
+            if stride is not None:
+                processed["stride"] = stride
+            yield {"is_last": True, "audio_array": inputs, **processed, **extra}
+
+    def _forward(self, model_inputs, return_timestamps=False, **generate_kwargs):
+        attention_mask = model_inputs.pop("attention_mask", None)
+        stride = model_inputs.pop("stride", None)
+        is_last = model_inputs.pop("is_last")
+        audio_array = model_inputs.pop("audio_array")
+        encoder = self.model.get_encoder()
+        # Consume values so we can let extra information flow freely through
+        # the pipeline (important for `partial` in microphone)
+        if type(return_timestamps) is not bool:
+            raise ValueError("return_timestamps should be bool")
+        if "input_features" in model_inputs:
+            inputs = model_inputs.pop("input_features")
+        elif "input_values" in model_inputs:
+            inputs = model_inputs.pop("input_values")
+        else:
+            raise ValueError(
+                "Seq2Seq speech recognition model requires either a "
+                f"`input_features` or `input_values` key, but only has {model_inputs.keys()}"
+            )
+
+        # custom processing for Whisper timestamps and word-level timestamps
+        generate_kwargs["return_timestamps"] = True
+        if inputs.shape[-1] > self.feature_extractor.nb_max_frames:
+            generate_kwargs["input_features"] = inputs
+        else:
+            generate_kwargs["encoder_outputs"] = encoder(inputs, attention_mask=attention_mask)
+
+        tokens = self.model.generate(attention_mask=attention_mask, **generate_kwargs)
+        # whisper longform generation stores timestamps in "segments"
+        out = {"tokens": tokens}
+        if self.type == "seq2seq_whisper":
+            if stride is not None:
+                out["stride"] = stride
+
+        # Leftover
+        extra = model_inputs
+        return {"is_last": is_last, "audio_array": audio_array, **out, **extra}
+
+    def postprocess(self,
+                    model_outputs,
+                    decoder_kwargs: Optional[Dict] = None,
+                    return_timestamps=None,
+                    return_language=None):
+        assert len(model_outputs) > 0
+        for model_output in model_outputs:
+            audio_array = model_output.pop("audio_array")[0]
+        outputs = super().postprocess(
+            model_outputs=model_outputs,
+            decoder_kwargs=decoder_kwargs,
+            return_timestamps=True,
+            return_language=return_language
+        )
+        if self.stable_ts:
+            outputs["chunks"] = fix_timestamp(
+                pipeline_output=outputs["chunks"], audio=audio_array, sample_rate=self.feature_extractor.sampling_rate
+            )
+        if self.punctuator:
+            outputs["chunks"] = self.punctuator.punctuate(outputs["chunks"])
+        outputs["text"] = "".join([c["text"] for c in outputs["chunks"]])
+        if not return_timestamps:
+            outputs.pop("chunks")
+        return outputs
+
--- a/pipeline/push_pipeline.py
+++ b/pipeline/push_pipeline.py
@@ -0,0 +1,29 @@
+from kotoba_whisper import KotobaWhisperPipeline
+from transformers.pipelines import PIPELINE_REGISTRY, pipeline
+from transformers import WhisperForConditionalGeneration, TFWhisperForConditionalGeneration
+
+
+model_alias = "kotoba-tech/kotoba-whisper-v1.1"
+PIPELINE_REGISTRY.register_pipeline(
+    "kotoba-whisper",
+    pipeline_class=KotobaWhisperPipeline,
+    pt_model=WhisperForConditionalGeneration,
+    tf_model=TFWhisperForConditionalGeneration
+)
+pipe = pipeline(
+    task="kotoba-whisper",
+    model="kotoba-tech/kotoba-whisper-v1.0",
+    chunk_length_s=15,
+    batch_size=16,
+    punctuator=True,
+    stable_ts=True,
+)
+pipe.push_to_hub(model_alias)
+pipe = pipeline(model=model_alias,
+                punctuator=True,
+                stable_ts=True,
+                chunk_length_s=15,
+                batch_size=16,
+                trust_remote_code=True)
+
+
--- a/pipeline/test_pipeline.py
+++ b/pipeline/test_pipeline.py
@@ -0,0 +1,154 @@
+from pprint import pprint
+from datasets import load_dataset
+from transformers.pipelines import pipeline
+
+model_alias = "kotoba-tech/kotoba-whisper-v1.1"
+
+print("""### P + S ###""")
+pipe = pipeline(model=model_alias,
+                punctuator=True,
+                stable_ts=True,
+                chunk_length_s=15,
+                batch_size=16,
+                trust_remote_code=True)
+dataset = load_dataset("kotoba-tech/kotoba-whisper-eval", split="train")
+for i in dataset:
+    if i["audio"]["path"] == "long_interview_1.mp3":
+        i["audio"]["array"] = i["audio"]["array"][:7938000]
+    prediction = pipe(
+        i["audio"],
+        return_timestamps=True,
+        generate_kwargs={"language": "japanese", "task": "transcribe"}
+    )
+    pprint(prediction)
+    break
+
+print("""### P ###""")
+pipe = pipeline(model=model_alias,
+                punctuator=True,
+                stable_ts=False,
+                chunk_length_s=15,
+                batch_size=16,
+                trust_remote_code=True)
+dataset = load_dataset("kotoba-tech/kotoba-whisper-eval", split="train")
+for i in dataset:
+    if i["audio"]["path"] == "long_interview_1.mp3":
+        i["audio"]["array"] = i["audio"]["array"][:7938000]
+    prediction = pipe(
+        i["audio"],
+        return_timestamps=True,
+        generate_kwargs={"language": "japanese", "task": "transcribe"}
+    )
+    pprint(prediction)
+    break
+
+print("""### S ###""")
+pipe = pipeline(model=model_alias,
+                punctuator=False,
+                stable_ts=True,
+                chunk_length_s=15,
+                batch_size=16,
+                trust_remote_code=True)
+dataset = load_dataset("kotoba-tech/kotoba-whisper-eval", split="train")
+for i in dataset:
+    if i["audio"]["path"] == "long_interview_1.mp3":
+        i["audio"]["array"] = i["audio"]["array"][:7938000]
+    prediction = pipe(
+        i["audio"],
+        return_timestamps=True,
+        generate_kwargs={"language": "japanese", "task": "transcribe"}
+    )
+    pprint(prediction)
+    break
+
+print("""### RAW ###""")
+pipe = pipeline(model=model_alias,
+                punctuator=False,
+                stable_ts=False,
+                chunk_length_s=15,
+                batch_size=16,
+                trust_remote_code=True)
+dataset = load_dataset("kotoba-tech/kotoba-whisper-eval", split="train")
+for i in dataset:
+    if i["audio"]["path"] == "long_interview_1.mp3":
+        i["audio"]["array"] = i["audio"]["array"][:7938000]
+    prediction = pipe(
+        i["audio"],
+        return_timestamps=True,
+        generate_kwargs={"language": "japanese", "task": "transcribe"}
+    )
+    pprint(prediction)
+    break
+
+print("""### P + S ###""")
+pipe = pipeline(model=model_alias,
+                punctuator=True,
+                stable_ts=True,
+                chunk_length_s=15,
+                batch_size=16,
+                trust_remote_code=True)
+dataset = load_dataset("kotoba-tech/kotoba-whisper-eval", split="train")
+for i in dataset:
+    if i["audio"]["path"] == "long_interview_1.mp3":
+        i["audio"]["array"] = i["audio"]["array"][:7938000]
+    prediction = pipe(
+        i["audio"],
+        generate_kwargs={"language": "japanese", "task": "transcribe"}
+    )
+    pprint(prediction)
+    break
+
+print("""### P ###""")
+pipe = pipeline(model=model_alias,
+                punctuator=True,
+                stable_ts=False,
+                chunk_length_s=15,
+                batch_size=16,
+                trust_remote_code=True)
+dataset = load_dataset("kotoba-tech/kotoba-whisper-eval", split="train")
+for i in dataset:
+    if i["audio"]["path"] == "long_interview_1.mp3":
+        i["audio"]["array"] = i["audio"]["array"][:7938000]
+    prediction = pipe(
+        i["audio"],
+        generate_kwargs={"language": "japanese", "task": "transcribe"}
+    )
+    pprint(prediction)
+    break
+
+print("""### S ###""")
+pipe = pipeline(model=model_alias,
+                punctuator=False,
+                stable_ts=True,
+                chunk_length_s=15,
+                batch_size=16,
+                trust_remote_code=True)
+dataset = load_dataset("kotoba-tech/kotoba-whisper-eval", split="train")
+for i in dataset:
+    if i["audio"]["path"] == "long_interview_1.mp3":
+        i["audio"]["array"] = i["audio"]["array"][:7938000]
+    prediction = pipe(
+        i["audio"],
+        generate_kwargs={"language": "japanese", "task": "transcribe"}
+    )
+    pprint(prediction)
+    break
+
+print("""### RAW ###""")
+pipe = pipeline(model=model_alias,
+                punctuator=False,
+                stable_ts=False,
+                chunk_length_s=15,
+                batch_size=16,
+                trust_remote_code=True)
+dataset = load_dataset("kotoba-tech/kotoba-whisper-eval", split="train")
+for i in dataset:
+    if i["audio"]["path"] == "long_interview_1.mp3":
+        i["audio"]["array"] = i["audio"]["array"][:7938000]
+    prediction = pipe(
+        i["audio"],
+        generate_kwargs={"language": "japanese", "task": "transcribe"}
+    )
+    pprint(prediction)
+    break
+
--- a/preprocessor_config.json
+++ b/preprocessor_config.json
@@ -0,0 +1,14 @@
+{
+  "chunk_length": 30,
+  "feature_extractor_type": "WhisperFeatureExtractor",
+  "feature_size": 128,
+  "hop_length": 160,
+  "n_fft": 400,
+  "n_samples": 480000,
+  "nb_max_frames": 3000,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "processor_class": "WhisperProcessor",
+  "return_attention_mask": false,
+  "sampling_rate": 16000
+}
--- a/run_short_form_eval.py
+++ b/run_short_form_eval.py
@@ -0,0 +1,125 @@
+"""Compute CER/WER for Japanese ASR models."""
+import json
+import os
+import argparse
+from pprint import pprint
+
+import torch
+import pandas as pd
+from transformers import pipeline
+from transformers.models.whisper.english_normalizer import BasicTextNormalizer
+from datasets import load_dataset
+from evaluate import load
+
+parser = argparse.ArgumentParser(description='Compute CER/WER for Japanese ASR model.')
+parser.add_argument('-m', '--model', default="kotoba-tech/kotoba-whisper-v1.1", type=str)
+parser.add_argument('-d', '--dataset', default="japanese-asr/ja_asr.jsut_basic5000", type=str)
+parser.add_argument('-a', '--attn', default="sdpa", type=str)
+parser.add_argument('-b', '--batch', default=16, type=int)
+parser.add_argument('-c', '--chunk-length', default=15, type=int)
+parser.add_argument('-o', '--output-dir', default="eval_pipeline", type=str)
+parser.add_argument('-p', '--punctuator', action="store_true")
+parser.add_argument('-s', '--stable-ts', action="store_true")
+parser.add_argument('--pretty-table', action="store_true")
+arg = parser.parse_args()
+
+os.makedirs(arg.output_dir, exist_ok=True)
+output_metric_file = f"{arg.output_dir}/metric.jsonl"
+
+# display mode
+if arg.pretty_table:
+    with open(output_metric_file) as f:
+        metrics = [json.loads(s) for s in f.read().split("\n") if len(s) > 0]
+    df_metric = pd.DataFrame(metrics).round(1).sort_values(["dataset", "model"])
+    df_metric["cer/wer (norm)"] = [f"{c}/{w}" for c, w in zip(df_metric["cer_norm"], df_metric["wer_norm"])]
+    df_metric["cer/wer (raw)"] = [f"{c}/{w}" for c, w in zip(df_metric["cer_raw"], df_metric["wer_raw"])]
+
+    def pretty(m, p, s):
+        if p and s:
+            return f"{m} (punctuator + stable-ts)"
+        if s:
+            return f"{m} (stable-ts)"
+        if p:
+            return f"{m} (punctuator)"
+        return m
+
+    df_metric["model"] = [pretty(m, p, s) for m, p, s in zip(df_metric["model"], df_metric["punctuator"], df_metric["stable_ts"])]
+    df_metric = df_metric[["model", "dataset", "punctuator", "stable_ts", "cer/wer (raw)", "cer/wer (norm)"]]
+    print(df_metric)
+    df_metric = df_metric.drop_duplicates()
+    print("\nNORM")
+    print(df_metric.pivot(values="cer/wer (norm)", columns="dataset", index="model").to_markdown())
+    print("\nRAW")
+    print(df_metric.pivot(values="cer/wer (raw)", columns="dataset", index="model").to_markdown())
+    exit()
+
+# model config
+torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+model_kwargs = {"attn_implementation": arg.attn} if torch.cuda.is_available() and arg.attn else {}
+generate_kwargs = {"language": "japanese", "task": "transcribe"}
+pipeline_config = dict(
+    model=arg.model,
+    torch_dtype=torch_dtype,
+    device=device,
+    model_kwargs=model_kwargs,
+    chunk_length_s=arg.chunk_length,
+    batch_size=arg.batch
+)
+
+# instantiate pipeline
+metric = {"model": arg.model, "dataset": arg.dataset, "chunk_length_s": arg.chunk_length}
+if arg.model in ["kotoba-tech/kotoba-whisper-v1.1"]:
+    pipe = pipeline(trust_remote_code=True, punctuator=arg.punctuator, stable_ts=arg.stable_ts, **pipeline_config)
+    stable_ts, punctuator = arg.stable_ts, arg.punctuator
+else:
+    pipe = pipeline("automatic-speech-recognition", **pipeline_config)
+    stable_ts, punctuator = None, None
+metric.update({"punctuator": punctuator, "stable_ts": stable_ts})
+
+# load the dataset and get prediction
+dataset = load_dataset(arg.dataset, split="test")
+output = pipe(dataset['audio'], generate_kwargs=generate_kwargs)
+normalizer = BasicTextNormalizer()
+prediction_norm = [normalizer(i['text']).replace(" ", "") for i in output]
+references_norm = [normalizer(i).replace(" ", "") for i in dataset['transcription']]
+prediction_raw = [i['text'].replace(" ", "") for i in output]
+references_raw = [i.replace(" ", "") for i in dataset['transcription']]
+
+# compute metrics
+cer_metric = load("cer")
+cer_norm = 100 * cer_metric.compute(predictions=prediction_norm, references=references_norm)
+cer_raw = 100 * cer_metric.compute(predictions=prediction_raw, references=references_raw)
+wer_metric = load("wer")
+wer_norm = 100 * wer_metric.compute(predictions=prediction_norm, references=references_norm)
+wer_raw = 100 * wer_metric.compute(predictions=prediction_raw, references=references_raw)
+metric.update({"cer_raw": cer_raw, "wer_raw": wer_raw, "cer_norm": cer_norm, "wer_norm": wer_norm})
+
+# save the results
+metrics = []
+if os.path.exists(output_metric_file):
+    with open(output_metric_file) as f:
+        metrics += [json.loads(s) for s in f.read().split("\n") if len(s) > 0]
+output_prediction_file = f"{arg.output_dir}/prediction.csv"
+dfs = None
+if os.path.exists(output_prediction_file):
+    dfs = pd.read_csv(output_prediction_file, index_col=0)
+metrics.append(metric)
+pprint(metrics)
+with open(output_metric_file, "w") as f:
+    f.write("\n".join([json.dumps(s) for s in metrics]))
+
+# save prediction
+audio_id = [i["path"] for i in dataset['audio']]
+df = pd.DataFrame(
+    [audio_id, references_norm, prediction_norm, references_raw, prediction_raw],
+    index=["id", "reference_norm", "prediction_norm", "reference_raw", "prediction_raw"]
+).T
+df["model"] = arg.model
+df["dataset"] = arg.dataset
+df["stable_ts"] = stable_ts
+df["punctuator"] = punctuator
+df["chunk_length_s"] = arg.chunk_length
+dfs = df if dfs is None else pd.concat([dfs, df])
+dfs.to_csv(output_prediction_file, index=False)
+
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,139 @@
+{
+  "additional_special_tokens": [
+    "<|startoftranscript|>",
+    "<|en|>",
+    "<|zh|>",
+    "<|de|>",
+    "<|es|>",
+    "<|ru|>",
+    "<|ko|>",
+    "<|fr|>",
+    "<|ja|>",
+    "<|pt|>",
+    "<|tr|>",
+    "<|pl|>",
+    "<|ca|>",
+    "<|nl|>",
+    "<|ar|>",
+    "<|sv|>",
+    "<|it|>",
+    "<|id|>",
+    "<|hi|>",
+    "<|fi|>",
+    "<|vi|>",
+    "<|he|>",
+    "<|uk|>",
+    "<|el|>",
+    "<|ms|>",
+    "<|cs|>",
+    "<|ro|>",
+    "<|da|>",
+    "<|hu|>",
+    "<|ta|>",
+    "<|no|>",
+    "<|th|>",
+    "<|ur|>",
+    "<|hr|>",
+    "<|bg|>",
+    "<|lt|>",
+    "<|la|>",
+    "<|mi|>",
+    "<|ml|>",
+    "<|cy|>",
+    "<|sk|>",
+    "<|te|>",
+    "<|fa|>",
+    "<|lv|>",
+    "<|bn|>",
+    "<|sr|>",
+    "<|az|>",
+    "<|sl|>",
+    "<|kn|>",
+    "<|et|>",
+    "<|mk|>",
+    "<|br|>",
+    "<|eu|>",
+    "<|is|>",
+    "<|hy|>",
+    "<|ne|>",
+    "<|mn|>",
+    "<|bs|>",
+    "<|kk|>",
+    "<|sq|>",
+    "<|sw|>",
+    "<|gl|>",
+    "<|mr|>",
+    "<|pa|>",
+    "<|si|>",
+    "<|km|>",
+    "<|sn|>",
+    "<|yo|>",
+    "<|so|>",
+    "<|af|>",
+    "<|oc|>",
+    "<|ka|>",
+    "<|be|>",
+    "<|tg|>",
+    "<|sd|>",
+    "<|gu|>",
+    "<|am|>",
+    "<|yi|>",
+    "<|lo|>",
+    "<|uz|>",
+    "<|fo|>",
+    "<|ht|>",
+    "<|ps|>",
+    "<|tk|>",
+    "<|nn|>",
+    "<|mt|>",
+    "<|sa|>",
+    "<|lb|>",
+    "<|my|>",
+    "<|bo|>",
+    "<|tl|>",
+    "<|mg|>",
+    "<|as|>",
+    "<|tt|>",
+    "<|haw|>",
+    "<|ln|>",
+    "<|ha|>",
+    "<|ba|>",
+    "<|jw|>",
+    "<|su|>",
+    "<|yue|>",
+    "<|translate|>",
+    "<|transcribe|>",
+    "<|startoflm|>",
+    "<|startofprev|>",
+    "<|nospeech|>",
+    "<|notimestamps|>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
--- a/vocab.json
+++ b/vocab.json