Support streaming zipformer CTC (#496)

* Support streaming zipformer CTC * test online zipformer2 CTC * Update doc of sherpa-onnx.cc * Add Python APIs for streaming zipformer2 ctc * Add Python API examples for streaming zipformer2 ctc * Swift API for streaming zipformer2 CTC * NodeJS API for streaming zipformer2 CTC * Kotlin API for streaming zipformer2 CTC * Golang API for streaming zipformer2 CTC * C# API for streaming zipformer2 CTC * Release v1.9.6
2023-12-22 13:46:33 +08:00
parent 7634f5f034
commit e475e750ac
70 changed files with 1517 additions and 211 deletions
--- a/nodejs-examples/README.md
+++ b/nodejs-examples/README.md
@@ -85,7 +85,7 @@ npm install wav naudiodon2
 how to decode a file with a NeMo CTC model. In the code we use
 [stt_en_conformer_ctc_small](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-ctc/nemo/english.html#stt-en-conformer-ctc-small).

-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-ctc-en-conformer-small.tar.bz2
@@ -99,7 +99,7 @@ node ./test-offline-nemo-ctc.js
 how to decode a file with a non-streaming Paraformer model. In the code we use
 [sherpa-onnx-paraformer-zh-2023-03-28](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-paraformer/paraformer-models.html#csukuangfj-sherpa-onnx-paraformer-zh-2023-03-28-chinese).

-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-zh-2023-03-28.tar.bz2
@@ -113,7 +113,7 @@ node ./test-offline-paraformer.js
 how to decode a file with a non-streaming transducer model. In the code we use
 [sherpa-onnx-zipformer-en-2023-06-26](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-transducer/zipformer-transducer-models.html#csukuangfj-sherpa-onnx-zipformer-en-2023-06-26-english).

-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-en-2023-06-26.tar.bz2
@@ -126,7 +126,7 @@ node ./test-offline-transducer.js
 how to decode a file with a Whisper model. In the code we use
 [sherpa-onnx-whisper-tiny.en](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/whisper/tiny.en.html).

-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-tiny.en.tar.bz2
@@ -140,7 +140,7 @@ demonstrates how to do real-time speech recognition from microphone
 with a streaming Paraformer model. In the code we use
 [sherpa-onnx-streaming-paraformer-bilingual-zh-en](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-paraformer/paraformer-models.html#csukuangfj-sherpa-onnx-streaming-paraformer-bilingual-zh-en-chinese-english).

-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-paraformer-bilingual-zh-en.tar.bz2
@@ -153,7 +153,7 @@ node ./test-online-paraformer-microphone.js
 how to decode a file using a streaming Paraformer model. In the code we use
 [sherpa-onnx-streaming-paraformer-bilingual-zh-en](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-paraformer/paraformer-models.html#csukuangfj-sherpa-onnx-streaming-paraformer-bilingual-zh-en-chinese-english).

-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-paraformer-bilingual-zh-en.tar.bz2
@@ -167,7 +167,7 @@ demonstrates how to do real-time speech recognition with microphone using a stre
 we use [sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#csukuangfj-sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-bilingual-chinese-english).


-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2
@@ -180,7 +180,7 @@ node ./test-online-transducer-microphone.js
 how to decode a file using a streaming transducer model. In the code
 we use [sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#csukuangfj-sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20-bilingual-chinese-english).

-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2
@@ -188,13 +188,26 @@ tar xvf sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2
 node ./test-online-transducer.js
 ```

+## ./test-online-zipformer2-ctc.js
+[./test-online-zipformer2-ctc.js](./test-online-zipformer2-ctc.js) demonstrates
+how to decode a file using a streaming zipformer2 CTC model. In the code
+we use [sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-ctc/zipformer-ctc-models.html#sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13-chinese).
+
+You can use the following command to run it:
+
+```bash
+wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13.tar.bz2
+tar xvf sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13.tar.bz2
+node ./test-online-zipformer2-ctc.js
+```
+
 ## ./test-vad-microphone-offline-paraformer.js

 [./test-vad-microphone-offline-paraformer.js](./test-vad-microphone-offline-paraformer.js)
 demonstrates how to use [silero-vad](https://github.com/snakers4/silero-vad)
 with non-streaming Paraformer for speech recognition from microphone.

-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
@@ -209,7 +222,7 @@ node ./test-vad-microphone-offline-paraformer.js
 demonstrates how to use [silero-vad](https://github.com/snakers4/silero-vad)
 with a non-streaming transducer model for speech recognition from microphone.

-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
@@ -224,7 +237,7 @@ node ./test-vad-microphone-offline-transducer.js
 demonstrates how to use [silero-vad](https://github.com/snakers4/silero-vad)
 with whisper for speech recognition from microphone.

-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
@@ -238,7 +251,7 @@ node ./test-vad-microphone-offline-whisper.js
 [./test-vad-microphone.js](./test-vad-microphone.js)
 demonstrates how to use [silero-vad](https://github.com/snakers4/silero-vad).

-You can use the following command run it:
+You can use the following command to run it:

 ```bash
 wget -q https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
--- a/nodejs-examples/test-online-zipformer2-ctc.js
+++ b/nodejs-examples/test-online-zipformer2-ctc.js
@@ -0,0 +1,97 @@
+// Copyright (c)  2023  Xiaomi Corporation (authors: Fangjun Kuang)
+//
+const fs = require('fs');
+const {Readable} = require('stream');
+const wav = require('wav');
+
+const sherpa_onnx = require('sherpa-onnx');
+
+function createRecognizer() {
+  const featConfig = new sherpa_onnx.FeatureConfig();
+  featConfig.sampleRate = 16000;
+  featConfig.featureDim = 80;
+
+  // test online recognizer
+  const zipformer2Ctc = new sherpa_onnx.OnlineZipformer2CtcModelConfig();
+  zipformer2Ctc.model =
+      './sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/ctc-epoch-20-avg-1-chunk-16-left-128.onnx';
+  const tokens =
+      './sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/tokens.txt';
+
+  const modelConfig = new sherpa_onnx.OnlineModelConfig();
+  modelConfig.zipformer2Ctc = zipformer2Ctc;
+  modelConfig.tokens = tokens;
+
+  const recognizerConfig = new sherpa_onnx.OnlineRecognizerConfig();
+  recognizerConfig.featConfig = featConfig;
+  recognizerConfig.modelConfig = modelConfig;
+  recognizerConfig.decodingMethod = 'greedy_search';
+
+  recognizer = new sherpa_onnx.OnlineRecognizer(recognizerConfig);
+  return recognizer;
+}
+recognizer = createRecognizer();
+stream = recognizer.createStream();
+
+const waveFilename =
+    './sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13/test_wavs/DEV_T0000000000.wav';
+
+const reader = new wav.Reader();
+const readable = new Readable().wrap(reader);
+
+function decode(samples) {
+  stream.acceptWaveform(recognizer.config.featConfig.sampleRate, samples);
+
+  while (recognizer.isReady(stream)) {
+    recognizer.decode(stream);
+  }
+  const r = recognizer.getResult(stream);
+  console.log(r.text);
+}
+
+reader.on('format', ({audioFormat, bitDepth, channels, sampleRate}) => {
+  if (sampleRate != recognizer.config.featConfig.sampleRate) {
+    throw new Error(`Only support sampleRate ${
+        recognizer.config.featConfig.sampleRate}. Given ${sampleRate}`);
+  }
+
+  if (audioFormat != 1) {
+    throw new Error(`Only support PCM format. Given ${audioFormat}`);
+  }
+
+  if (channels != 1) {
+    throw new Error(`Only a single channel. Given ${channel}`);
+  }
+
+  if (bitDepth != 16) {
+    throw new Error(`Only support 16-bit samples. Given ${bitDepth}`);
+  }
+});
+
+fs.createReadStream(waveFilename, {'highWaterMark': 4096})
+    .pipe(reader)
+    .on('finish', function(err) {
+      // tail padding
+      const floatSamples =
+          new Float32Array(recognizer.config.featConfig.sampleRate * 0.5);
+      decode(floatSamples);
+      stream.free();
+      recognizer.free();
+    });
+
+readable.on('readable', function() {
+  let chunk;
+  while ((chunk = readable.read()) != null) {
+    const int16Samples = new Int16Array(
+        chunk.buffer, chunk.byteOffset,
+        chunk.length / Int16Array.BYTES_PER_ELEMENT);
+
+    const floatSamples = new Float32Array(int16Samples.length);
+
+    for (let i = 0; i < floatSamples.length; i++) {
+      floatSamples[i] = int16Samples[i] / 32768.0;
+    }
+
+    decode(floatSamples);
+  }
+});