--- library_name: transformers license: gpl-3.0 language: - as - bn - brx - doi - gom - gu - en - hi - kn - ks - mai - ml - mni - mr - ne - or - pa - sa - sat - sd - ta - te - ur base_model: - google/gemma-3-4b-it base_model_relation: finetune pipeline_tag: translation --- # sarvam-translate GGUF Models ## Model Generation Details This model was generated using [llama.cpp](https://github.com/ggerganov/llama.cpp) at commit [`1f63e75f`](https://github.com/ggerganov/llama.cpp/commit/1f63e75f3b5dc7f44dbe63c8a41d23958fe95bc0). ## Quantization beyond the IMatrix Testing a new quantization method using rules to bump important layers above what the standard imatrix would use. I have found that the standard IMatrix does not perform very well at low bit quantiztion and for MOE models. So I am using llama.cpp --tensor-type to bump up selected layers. See [Layer bumping with llama.cpp](https://github.com/Mungert69/GGUFModelBuilder/blob/main/model-converter/tensor_list_builder.py) This does create larger model files but increases precision for a given model size. ### **Please provide feedback on how you find this method performs** ## **Choosing the Right Model Format** Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**. ### **BF16 (Brain Float 16) – Use if BF16 acceleration is available** - A 16-bit floating-point format designed for **faster computation** while retaining good precision. - Provides **similar dynamic range** as FP32 but with **lower memory usage**. - Recommended if your hardware supports **BF16 acceleration** (check your device's specs). - Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32. πŸ“Œ **Use BF16 if:** βœ” Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs). βœ” You want **higher precision** while saving memory. βœ” You plan to **requantize** the model into another format. πŸ“Œ **Avoid BF16 if:** ❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower). ❌ You need compatibility with older devices that lack BF16 optimization. --- ### **F16 (Float 16) – More widely supported than BF16** - A 16-bit floating-point **high precision** but with less of range of values than BF16. - Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs). - Slightly lower numerical precision than BF16 but generally sufficient for inference. πŸ“Œ **Use F16 if:** βœ” Your hardware supports **FP16** but **not BF16**. βœ” You need a **balance between speed, memory usage, and accuracy**. βœ” You are running on a **GPU** or another device optimized for FP16 computations. πŸ“Œ **Avoid F16 if:** ❌ Your device lacks **native FP16 support** (it may run slower than expected). ❌ You have memory limitations. --- ### **Hybrid Precision Models (e.g., `bf16_q8_0`, `f16_q4_K`) – Best of Both Worlds** These formats selectively **quantize non-essential layers** while keeping **key layers in full precision** (e.g., attention and output layers). - Named like `bf16_q8_0` (meaning **full-precision BF16 core layers + quantized Q8_0 other layers**). - Strike a **balance between memory efficiency and accuracy**, improving over fully quantized models without requiring the full memory of BF16/F16. πŸ“Œ **Use Hybrid Models if:** βœ” You need **better accuracy than quant-only models** but can’t afford full BF16/F16 everywhere. βœ” Your device supports **mixed-precision inference**. βœ” You want to **optimize trade-offs** for production-grade models on constrained hardware. πŸ“Œ **Avoid Hybrid Models if:** ❌ Your target device doesn’t support **mixed or full-precision acceleration**. ❌ You are operating under **ultra-strict memory limits** (in which case use fully quantized formats). --- ### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference** Quantization reduces model size and memory usage while maintaining as much accuracy as possible. - **Lower-bit models (Q4_K)** β†’ **Best for minimal memory usage**, may have lower precision. - **Higher-bit models (Q6_K, Q8_0)** β†’ **Better accuracy**, requires more memory. πŸ“Œ **Use Quantized Models if:** βœ” You are running inference on a **CPU** and need an optimized model. βœ” Your device has **low VRAM** and cannot load full-precision models. βœ” You want to reduce **memory footprint** while keeping reasonable accuracy. πŸ“Œ **Avoid Quantized Models if:** ❌ You need **maximum accuracy** (full-precision models are better for this). ❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16). --- ### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)** These models are optimized for **very high memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint. - **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **very high memory efficiency**. - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large. - **Trade-off**: Lower accuracy compared to higher-bit quantizations. - **IQ3_S**: Small block size for **maximum memory efficiency**. - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive. - **IQ3_M**: Medium block size for better accuracy than **IQ3_S**. - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting. - **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy. - **Use case**: Best for **low-memory devices** where **Q6_K** is too large. - **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**. - **Use case**: Best for **ARM-based devices** or **low-memory environments**. ### **Ultra Low-Bit Quantization (IQ1_S IQ1_M IQ2_S IQ2_M IQ2_XS IQ2_XSS)** - *Ultra-low-bit quantization (1 2-bit) with **extreme memory efficiency**. - **Use case**: Best for cases were you have to fit the model into very constrained memory - **Trade-off**: Very Low Accuracy. May not function as expected. Please test fully before using. --- ### **Summary Table: Model Format Selection** | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case | |--------------------------|------------------|------------------|----------------------------------|--------------------------------------------------------------| | **BF16** | Very High | High | BF16-supported GPU/CPU | High-speed inference with reduced memory | | **F16** | High | High | FP16-supported GPU/CPU | Inference when BF16 isn’t available | | **Q4_K** | Medium-Low | Low | CPU or Low-VRAM devices | Memory-constrained inference | | **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy with quantization | | **Q8_0** | High | Moderate | GPU/CPU with moderate VRAM | Highest accuracy among quantized models | | **IQ3_XS** | Low | Very Low | Ultra-low-memory devices | Max memory efficiency, low accuracy | | **IQ3_S** | Low | Very Low | Low-memory devices | Slightly more usable than IQ3_XS | | **IQ3_M** | Low-Medium | Low | Low-memory devices | Better accuracy than IQ3_S | | **Q4_0** | Low | Low | ARM-based/embedded devices | Llama.cpp automatically optimizes for ARM inference | | **Ultra Low-Bit (IQ1/2_*)** | Very Low | Extremely Low | Tiny edge/embedded devices | Fit models in extremely tight memory; low accuracy | | **Hybrid (e.g., `bf16_q8_0`)** | Medium–High | Medium | Mixed-precision capable hardware | Balanced performance and memory, near-FP accuracy in critical layers | --- # Sarvam-Translate

Try on Sarvam Playground

Sarvam-Translate is an advanced translation model from Sarvam AI, specifically designed for comprehensive, document-level translation across the 22 official Indian languages, built on Gemma3-4B-IT. It addresses modern translation needs by moving beyond isolated sentences to handle long-context inputs, diverse content types, and various formats. Sarvam-Translate aims to provide high-quality, contextually aware translations for Indian languages, which have traditionally lagged behind high-resource languages in LLM performance. Learn more about Sarvam-Translate in our detailed [blog post](https://www.sarvam.ai/blogs/sarvam-translate). ## Key Features - **Comprehensive Indian Language Support**: Focus on the 22 official Indian languages, ensuring nuanced and accurate translations. - **Advanced Document-Level Translation**: Translates entire documents, web pages, speeches, textbooks, and scientific articles, not just isolated sentences. - **Versatile Format Handling**: Processes a wide array of input formats, including markdown, digitized content (handling OCR errors), documents with embedded math and chemistry equations, and code files (translating only comments). - **Context-Aware & Inclusive**: Engineered to respect different contexts, formats, styles (formal/informal), and ensure inclusivity (e.g., appropriate gender attribution). ## Supported languages list `Assamese`, `Bengali`, `Bodo`, `Dogri`, `Gujarati`, `English`, `Hindi`, `Kannada`, `Kashmiri`, `Konkani`, `Maithili`, `Malayalam`, `Manipuri`, `Marathi`, `Nepali`, `Odia`, `Punjabi`, `Sanskrit`, `Santali`, `Sindhi`, `Tamil`, `Telugu`, `Urdu` ## Quickstart The following code snippet demonstrates how to use Sarvam-Translate using Transformers. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "sarvamai/sarvam-translate" # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda:0') # Translation task tgt_lang = "Hindi" input_txt = "Be the change you wish to see in the world." # Chat-style message prompt messages = [ {"role": "system", "content": f"Translate the text below to {tgt_lang}."}, {"role": "user", "content": input_txt} ] # Apply chat template to structure the conversation text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # Tokenize and move input to model device model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # Generate the output generated_ids = model.generate( **model_inputs, max_new_tokens=1024, do_sample=True, temperature=0.01, num_return_sequences=1 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() output_text = tokenizer.decode(output_ids, skip_special_tokens=True) print("Input:", input_txt) print("Translation:", output_text) ``` ## vLLM Deployment ### Server: ```bash vllm serve sarvamai/sarvam-translate --port 8000 --dtype bfloat16 ``` ### Client: ```python from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id tgt_lang = 'Hindi' input_txt = 'Be the change you wish to see in the world.' messages = [{"role": "system", "content": f"Translate the text below to {tgt_lang}."}, {"role": "user", "content": input_txt}] response = client.chat.completions.create(model=model, messages=messages, temperature=0.01) output_text = response.choices[0].message.content print("Input:", input_txt) print("Translation:", output_text) ``` ## With Sarvam APIs Refer our [python client documentation](https://pypi.org/project/sarvamai/). Sample code: ```python from sarvamai import SarvamAI client = SarvamAI() response = client.text.translate( input="Be the change you wish to see in the world.", source_language_code="en-IN", target_language_code="hi-IN", speaker_gender="Male", model="sarvam-translate:v1", ) ``` # πŸš€ If you find these models useful Help me test my **AI-Powered Quantum Network Monitor Assistant** with **quantum-ready security checks**: πŸ‘‰ [Quantum Network Monitor](https://readyforquantum.com/?assistant=open&utm_source=huggingface&utm_medium=referral&utm_campaign=huggingface_repo_readme) The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : [Source Code Quantum Network Monitor](https://github.com/Mungert69). You will also find the code I use to quantize the models if you want to do it yourself [GGUFModelBuilder](https://github.com/Mungert69/GGUFModelBuilder) πŸ’¬ **How to test**: Choose an **AI assistant type**: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) ### **What I’m Testing** I’m pushing the limits of **small open-source models for AI network monitoring**, specifically: - **Function calling** against live network services - **How small can a model go** while still handling: - Automated **Nmap security scans** - **Quantum-readiness checks** - **Network Monitoring tasks** 🟑 **TestLLM** – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - βœ… **Zero-configuration setup** - ⏳ 30s load time (slow inference but **no API costs**) . No token limited as the cost is low. - πŸ”§ **Help wanted!** If you’re into **edge-device AI**, let’s collaborate! ### **Other Assistants** 🟒 **TurboLLM** – Uses **gpt-4.1-mini** : - **It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - **Create custom cmd processors to run .net code on Quantum Network Monitor Agents** - **Real-time network diagnostics and monitoring** - **Security Audits** - **Penetration testing** (Nmap/Metasploit) πŸ”΅ **HugLLM** – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. ### πŸ’‘ **Example commands you could test**: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code from. This is a very flexible and powerful feature. Use with caution! ### Final Word I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAIβ€”all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is [open source](https://github.com/Mungert69). Feel free to use whatever you find helpful. If you appreciate the work, please consider [buying me a coffee](https://www.buymeacoffee.com/mahadeva) β˜•. Your support helps cover service costs and allows me to raise token limits for everyone. I'm also open to job opportunities or sponsorship. Thank you! 😊