Malaysian-TTS-1.7B-v1/README.md

---
library_name: transformers
tags: []
---

# Malaysian-TTS-1.7B-v1

Continue pretraining [mesolitica/Malaysian-TTS-1.7B-v0.1](https://huggingface.co/mesolitica/Malaysian-TTS-1.7B-v0.1) on much consistent dataset,

1. Use [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec) as speech detokenizer, output in 24k sample rate.
2. Support context switching between Malay and English.
3. Better pronunciation for letters.
4. Better repetitive tolerance.

## Speakers

1. [husein](https://huggingface.co/datasets/mesolitica/Malaysian-TTS-v2)
2. [idayu](https://huggingface.co/datasets/mesolitica/Malaysian-TTS-v2)
3. [singaporean](https://huggingface.co/datasets/mesolitica/IMDA-TTS)
4. [DisfluencySpeech](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech)
5. [singlish-speaker2050](https://huggingface.co/datasets/thucdangvan020999/singlish-speaker2050)
6. [singlish-speaker2202](https://huggingface.co/datasets/thucdangvan020999/singlish-speaker2202)
7. [haqkiem](https://www.linkedin.com/in/haqkiem-daim/), private dataset.

## How do we train

1. Multipacking with proper document masking on 4096 context length.
2. FP32-BF16 mixed precision training.
3. Full parameter finetuning.
4. WanDB at https://wandb.ai/huseinzol05/Malaysian-TTS-1.7B-v1

## How to use

1. First install DistilCodec,

```bash
pip3 install git+https://github.com/mesolitica/DistilCodec
```

2. Load the models,

```python
# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/model_config.json
# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/g_00204000

from distilcodec import DistilCodec, demo_for_generate_audio_codes
from transformers import AutoTokenizer, AutoModelForCausalLM

codec_model_config_path='model_config.json'
codec_ckpt_path = 'g_00204000'

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    use_generator=True,
    is_debug=False).eval()

tokenizer = AutoTokenizer.from_pretrained('mesolitica/Malaysian-TTS-1.7B-v1')
model = AutoModelForCausalLM.from_pretrained('mesolitica/Malaysian-TTS-1.7B-v1', torch_dtype = 'auto').cuda()
```

3. Generate,

```bash
import soundfile as sf
import re
from tqdm import tqdm

speakers = [
    'husein',
    'idayu',
    'singaporean',
    'DisfluencySpeech',
    'singlish-speaker2050',
    'singlish-speaker2202',
    'haqkiem',
]

string = 'IC saya adalah, sembilan enam, kosong tiga, satu empat, one, one, one, one, A, B, C, D, D, yes, Husein is very cute, cute, cute.'

for s in tqdm(speakers):

    left = s +': ' + string
    prompt = f'<|im_start|>{left}<|speech_start|>'

    generate_kwargs = dict(
        **tokenizer(prompt, return_tensors = 'pt', add_special_tokens = False).to('cuda'),
        max_new_tokens=1024,
        temperature=0.7,
        do_sample=True,
        repetition_penalty=1.1,
    )
    generation_output = model.generate(**generate_kwargs)
    speech_token = tokenizer.decode(generation_output[0]).split('<|speech_start|>')[-1].replace('<|endoftext|>', '')
    numbers = re.findall(r'speech_(\d+)', speech_token)
    d = list(map(int, numbers))
    y_gen = codec.decode_from_codes(
        d,
        minus_token_offset=False
    )
    sf.write(f'{s}.mp3', y_gen[0, 0].cpu().numpy(), 24000)
```

Output,

1. [husein-v1.mp3](husein-v1.mp3)
2. [idayu-v1.mp3](idayu-v1.mp3)
3. [singaporean-v1.mp3](singaporean-v1.mp3)
4. [DisfluencySpeech-v1.mp3](DisfluencySpeech-v1.mp3)
5. [singlish-speaker2050-v1.mp3](singlish-speaker2050-v1.mp3)
6. [singlish-speaker2202-v1.mp3](singlish-speaker2202-v1.mp3)
6. [haqkiem-v1.mp3](haqkiem-v1.mp3)

**Only `singlish-speaker2202` and `haqkiem` had to generate 2 times to get better output that follow exact text input**.

## Limitation

1. This model trained on normalized text, so if you have text such as `123`, you have to normalize it first to become `one two three` or `one hundred twenty three` or `satu dua tiga` or `seratus dua puluh tiga`. Feel free to use Malaya for normalization, Malaya support Malay and English normalization, read more at https://github.com/mesolitica/malaya/issues/247#issuecomment-3030313021
2. The repetitive pronunciation dataset does not consistently use commas for pauses. For example, `A, A, A, A, B, B` in our recordings is spoken as `A A A A B B`. We have no intention to improve it due to cost, but continue finetune using proper dataset should able to solve it.

## Source code

Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/qwen-tts

## Acknowledgement

Special thanks to https://www.sns.com.my and Nvidia for 1x H100!