--- language: - hi - bn - ta - te - ml - gu - kn - mr - or - pa - as - mai - sa - sd license: apache-2.0 tags: - causal-lm - assistant - reasoning - multilingual - indic model_name: dheeyantra/dhee-nxtgen-qwen3-indic library_name: transformers --- # Dhee-NxtGen-Qwen3-Indic (4B) ## Model Description **Dhee-NxtGen-Qwen3-Indic** is a **single, unified 4B-parameter multilingual large language model** developed by **DheeYantra** in collaboration with **NxtGen Cloud Technologies Pvt. Ltd.** Built on the **Qwen3-4B architecture**, the model is created to support **assistant-style conversations**, **reasoning**, and **function-calling–compatible workflows** across **14 Indian (Indic) languages** within one shared model. The model is optimized for **native-script generation**, consistent multilingual behavior, and cross-lingual generalization. --- ## Supported Languages This single model supports the following Indic languages: - Hindi (hi) - Bengali (bn) - Tamil (ta) - Telugu (te) - Malayalam (ml) - Gujarati (gu) - Kannada (kn) - Marathi (mr) - Odia (or) - Punjabi (pa) - Assamese (as) - Maithili (mai) - Sanskrit (sa) - Sindhi (sd) > Best results are achieved when prompts are written entirely in the target language. --- ## Key Features - Single multilingual 4B model (no per-language checkpoints) - Fluent, native-script text generation across 14 Indic languages - Optimized for assistant-style and reasoning-based dialogue - Supports summarization, Q&A, and long-form generation - Compatible with function-calling style prompting - Fully compatible with Hugging Face Transformers - Ready for high-throughput inference using vLLM --- ## Example Usage ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # 1. Configuration model_name = "dheeyantra/dhee-nxtgen-qwen3-indic" device = "cuda" if torch.cuda.is_available() else "cpu" # 2. Load Model and Tokenizer print(f"Loading model: {model_name}...") tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, device_map="auto", torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32 ) # 3. Define the Prompt # Using the ChatML format expected by Qwen-based architectures prompt = """<|im_start|>system You are a helpful multilingual assistant.<|im_end|> <|im_start|>user क्या आप मेरे लिए एक अपॉइंटमेंट बुक कर सकते हैं? अगर हाँ, तो कृपया मुझसे ज़रूरी जानकारी जैसे तारीख, समय और उद्देश्य पूछिए।<|im_end|> <|im_start|>assistant """ # 4. Process and Generate inputs = tokenizer(prompt, return_tensors="pt").to(model.device) print("Generating response...") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=256, temperature=0.7, top_p=0.9, do_sample=True, pad_token_id=tokenizer.eos_token_id ) # 5. Decode and Print # We only want to print the newly generated text (the assistant's reply) full_output = tokenizer.decode(outputs[0], skip_special_tokens=True) response = full_output.split("assistant")[-1].strip() print("-" * 30) print(f"Assistant: {response}") print("-" * 30) ``` ## Function/Tool Calling Example Usage ```python import torch import json import re from transformers import AutoTokenizer, AutoModelForCausalLM # --- 1. MODEL SETUP --- model_name = "dheeyantra/dhee-nxtgen-qwen3-indic" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, device_map="auto", torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32 ) # --- 2. TOOLS & SYSTEM PROMPT --- tools = [{ "name": "book_appointment", "description": "Book an appointment for the user.", "parameters": { "type": "object", "properties": { "date": {"type": "string", "description": "Date in YYYY-MM-DD format"}, "time": {"type": "string", "description": "Time in HH:MM (24h) format"}, "purpose": {"type": "string", "description": "The medical reason or department"} }, "required": ["date", "time", "purpose"] } }] # We provide the current date so the model can resolve "tomorrow" or "next Monday" SYSTEM_PROMPT = f"""You are a helpful AI assistant. Today's Date: 2026-01-08 (Thursday). Available Tools: {json.dumps(tools, indent=2)} Rules: 1. If details (date, time, purpose) are missing, ask the user in Hindi. 2. If all details are present, output ONLY a JSON. 3. After a tool result is provided, confirm the booking to the user in Hindi.""" # --- 3. BACKEND FUNCTION --- def execute_booking(date, time, purpose): # Simulated backend logic if "10:00" in time: # Simulate a busy slot return {"status": "error", "message": "यह समय पहले से बुक है। कृपया कोई और समय चुनें।"} return {"status": "success", "id": "APP-9921", "doctor": "Dr. Verma"} # --- 4. THE INTERACTION ENGINE --- def run_conversation(user_input, history): # Add user input to history history.append({"role": "user", "content": user_input}) # Construct ChatML prompt prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n" for msg in history: prompt += f"<|im_start|>{msg['role']}\n{msg['content']}<|im_end|>\n" prompt += "<|im_start|>assistant\n" # Generate inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1) response = tokenizer.decode(outputs[0], skip_special_tokens=True).split("assistant")[-1].strip() # Check if the model wants to call a tool tool_match = re.search(r"(.*?)", response, re.DOTALL) if tool_match: print(f"\n[MODEL REQUESTED TOOL]: {response}") call_data = json.loads(tool_match.group(1).strip()) # Execute the function result = execute_booking(**call_data['arguments']) print(f"[TOOL RESULT]: {result}") # Feed result back to model for final response history.append({"role": "assistant", "content": response}) history.append({"role": "system", "content": f"Tool Result: {json.dumps(result)}"}) # Final confirmation generation return run_conversation("Please confirm the result to the user.", history) return response # --- 5. TEST SCENARIO --- chat_history = [] print("--- Chatbot Started (Today is 2026-01-08) ---") # Step 1: User provides partial info query_1 = "मेरे लिए डेंटिस्ट का अपॉइंटमेंट बुक करें।" print(f"\nUser: {query_1}") res_1 = run_conversation(query_1, chat_history) chat_history.append({"role": "assistant", "content": res_1}) print(f"Assistant: {res_1}") # Step 2: User provides the rest query_2 = "कल दोपहर 2 बजे।" print(f"\nUser: {query_2}") res_2 = run_conversation(query_2, chat_history) print(f"Assistant: {res_2}") ``` --- ## Prompting Guidelines - Use pure native-language prompts for best fluency - Avoid heavy code-mixing (e.g., Hinglish-heavy inputs) - Include a system prompt to stabilize multilingual behavior - Ask explicitly for step-by-step reasoning when required --- ## Intended Uses & Limitations ### Intended Uses - Multilingual Indic chatbots and AI assistants - Education, governance, and public-sector AI applications - Content generation and summarization in Indian languages - Cross-lingual conversational and reasoning systems ### Limitations - May occasionally hallucinate or produce inaccurate facts - Performance may vary slightly across languages - Not intended for medical, legal, or safety-critical use cases - Code-mixed inputs may reduce output quality --- ## vLLM / High-Performance Serving ### Requirements - NVIDIA GPU with compute capability ≥ 8.0 (A100 / H100 recommended) - PyTorch 2.1+ with CUDA installed - V100 (sm70) GPUs are not supported for vLLM GPU inference ### Installation ```bash pip install torch transformers vllm sentencepiece ``` ### Run vLLM Server ```bash vllm serve --model dheeyantra/dhee-nxtgen-qwen3-indic --host 0.0.0.0 --port 8000 ``` --- ## License Released under the **Apache 2.0 License**. --- Developed by **DheeYantra** in collaboration with **NxtGen Cloud Technologies Pvt. Ltd.**