--- language: - en license: mit base_model: Qwen/Qwen2.5-0.5B-Instruct tags: - text-generation - fine-tuned - lora - gguf - speech-to-text - text-cleanup - unsloth - qwen2 pipeline_tag: text-generation datasets: - Abdullahu5mani/flowscribe-dataset --- # FlowScribe — Qwen2.5-0.5B Speech Transcript Formatter A fine-tuned version of [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) that converts raw, messy speech-to-text output into clean, formatted text across multiple writing styles. **GitHub:** [github.com/Abdullahu5mani/flowscribe](https://github.com/Abdullahu5mani/flowscribe) --- ## The Problem Voice dictation tools like Whisper produce transcripts full of filler words (`um`, `uh`, `like`), self-corrections (`make it 5... no wait, 6`), and no punctuation or formatting. This model post-processes those transcripts into polished text, with awareness of the desired output style. --- ## Styles | Style | Behavior | |---|---| | `Auto` | Intelligent default — removes fillers, fixes grammar, handles self-corrections, applies structure | | `Professional` | Formal business tone, structured layout, perfect grammar | | `Casual` | Keeps the speaker's voice, light cleanup, contractions preserved | | `Verbatim` | Preserves exact wording, only strips `um`/`uh` and applies spoken formatting commands | | `Software_Dev` | Formats code terms, variable names (`camelCase`, `snake_case`), technical jargon | | `Enthusiastic` | High energy, exclamation marks, positive phrasing | --- ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "Abdullahu5mani/flowscribe-qwen2.5-0.5b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto" ) def format_transcript(raw_text, style="Auto"): messages = [ { "role": "system", "content": "You are a helpful assistant that transcribes and formats text based on a specific style instruction." }, { "role": "user", "content": f"Transcribe and format this with style: {style}\nInput: {raw_text}" } ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False) output_ids = outputs[0][len(inputs.input_ids[0]):] return tokenizer.decode(output_ids, skip_special_tokens=True) # Examples print(format_transcript( "um so the meeting is at 5... no wait make it 6 and uh we need to discuss the q3 budget", style="Professional" )) # → "The meeting is at 6 PM to discuss the Q3 budget." print(format_transcript( "the api endpoint is slash api slash users new line it takes a POST request with JSON", style="Software_Dev" )) # → "The API endpoint is `/api/users`\nIt takes a POST request with JSON." ``` --- ## GGUF (Quantized) Usage A Q4_K_M quantized GGUF version is included in this repository for fast CPU/GPU inference via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python). ```python from llama_cpp import Llama llm = Llama( model_path="model_q4_k_m.gguf", n_ctx=2048, n_gpu_layers=-1, # Set to 0 for CPU-only verbose=False ) response = llm.create_chat_completion( messages=[ { "role": "system", "content": "You are a helpful assistant that transcribes and formats text based on a specific style instruction." }, { "role": "user", "content": "Transcribe and format this with style: Casual\nInput: hey um so i was thinking we could like grab lunch tomorrow you know around noon ish" } ], max_tokens=256, temperature=0.1, ) print(response["choices"][0]["message"]["content"]) # → "Hey, I was thinking we could grab lunch tomorrow around noon." ``` --- ## Model Details | Property | Value | |---|---| | Base model | Qwen/Qwen2.5-0.5B-Instruct | | Fine-tuning method | LoRA (via [Unsloth](https://github.com/unslothai/unsloth)) | | Parameters | ~500M | | Training epochs | 3 | | Learning rate | 2e-5 | | Effective batch size | 16 (batch 2 × grad accumulation 8) | | Sequence length | 2048 | | Optimizer | AdamW 8-bit | | Training hardware | NVIDIA RTX 4070 8GB VRAM | | Chat template | ChatML | | Quantization | Q4_K_M (via llama.cpp) | --- ## Training Data Trained on ~19,800 synthetically generated examples from [flowscribe-dataset](https://huggingface.co/datasets/Abdullahu5mani/flowscribe-dataset). Each example is an Alpaca-style JSON object: ```json { "instruction": "Transcribe and format this with style: Professional", "input": "um so like the uh proposal is due friday and we need to finalize the, i mean confirm the budget", "output": "The proposal is due Friday and we need to confirm the budget." } ``` Data was generated using Google Gemini (primary) and 16 free OpenRouter models (fallback) across 10 domain scenarios: business email, software dev, personal messages, productivity lists, medical notes, and more. --- ## Limitations - Optimized for English only - Training data is synthetic — real-world dictation edge cases may vary - The 0.5B parameter size prioritizes speed and local deployment over raw capability - Dataset reached ~19.8K examples (target was 50K); further training on more data would improve robustness --- ## Files | File | Description | |---|---| | `model.safetensors` | Full-precision fine-tuned weights | | `model_q4_k_m.gguf` | Q4_K_M quantized GGUF for llama.cpp | | `config.json` | Model configuration | | `tokenizer.json` | Tokenizer | | `chat_template.jinja` | ChatML chat template | --- ## License MIT — see [LICENSE](https://github.com/Abdullahu5mani/flowscribe/blob/main/LICENSE)