--- license: other license_name: llama-3.2-community license_link: https://www.llama.com/llama-downloads base_model: meta-llama/Llama-3.2-1B pipeline_tag: text-generation library_name: transformers tags: - llama - llama-3 - meta - causal-lm - text-generation ---
# LumiChats v1.1 **A Fine-tuned Conversational AI Model Based on Llama 3.2 3B** [![License](https://img.shields.io/badge/License-Llama%203.2-blue.svg)](https://llama.meta.com/llama3_2/license/) [![Model Size](https://img.shields.io/badge/Parameters-3B-green.svg)]() [![Base Model](https://img.shields.io/badge/Base-Llama%203.2%203B%20Instruct-orange.svg)](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
--- ## 📖 Overview LumiChats v1.1 is a specialized conversational AI model built on top of **Meta's Llama 3.2 3B Instruct** foundation. This model has been fine-tuned using **LoRA (Low-Rank Adaptation)** with the **Unsloth** framework to deliver enhanced conversational capabilities while maintaining exceptional efficiency and performance. **Base Model:** [unsloth/Llama-3.2-3B-Instruct](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct) **Model Type:** Conversational AI / Instruction-tuned Language Model **Parameters:** 3.21 Billion (3,237,063,680 total) **Trainable Parameters:** 24,313,856 (~0.75% via LoRA) **Architecture:** Optimized Transformer with Auto-regressive Language Modeling --- ## ✨ Key Features - **💬 Enhanced Conversational Abilities**: Fine-tuned on FineTome-100k for natural, engaging dialogue - **🚀 Efficient & Fast**: - 2x faster training and inference with Unsloth optimizations - 4-bit quantization for reduced memory footprint - Only 0.75% of parameters trained via LoRA - **🌍 Multilingual Support**: Supports 8+ languages (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai) - **📱 Edge-Ready**: Optimized for deployment on edge devices and mobile platforms - **🎯 Superior Instruction Following**: Specialized training on response-only objectives - **🔒 Privacy-Focused**: Can run entirely on-device without cloud dependencies - **⚡ Memory Efficient**: Trained with just 2.35 GB peak memory using gradient checkpointing --- ## 🏗️ Architecture Details LumiChats v1.1 inherits the robust architecture of Llama 3.2 3B: - **Model Type**: Auto-regressive transformer language model (LlamaForCausalLM) - **Training Approach**: - Base: Supervised Fine-Tuning (SFT) + Reinforcement Learning with Human Feedback (RLHF) - Fine-tuning: LoRA adapters with response-only training - **Context Length**: Up to 128,000 tokens (trained with max_seq_length: 2048) - **Vocabulary Size**: Extended multilingual tokenizer - **Optimization**: 4-bit quantization, structured pruning, and knowledge distillation ### LoRA Configuration Details - **LoRA Rank (r)**: 16 - **LoRA Alpha**: 16 - **Target Modules**: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` - **LoRA Dropout**: 0 - **Trainable Parameters**: 24,313,856 (0.75% of total 3.2B parameters) --- ## 🎯 Intended Use Cases LumiChats v1.1 excels at: - **Conversational AI**: Natural dialogue and chat applications - **Personal Assistants**: Task management and information retrieval - **Content Generation**: Writing assistance and creative text generation - **Summarization**: Document and conversation summarization - **Question Answering**: Knowledge retrieval and Q&A systems - **Code Assistance**: Basic coding help and explanations - **On-Device Applications**: Mobile AI assistants and offline chatbots --- ## 🚀 Quick Start ### Using Transformers ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load model and tokenizer model_name = "adityakum667388/lumichats-v1.1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" ) # Prepare conversation messages = [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "What is the capital of France?"} ] # Generate response input_ids = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) outputs = model.generate( input_ids, max_new_tokens=512, temperature=0.7, top_p=0.9, do_sample=True, eos_token_id=tokenizer.eos_token_id ) response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True) print(response) ``` ### Using Unsloth for Inference (Fastest) ```python from unsloth import FastLanguageModel # Load model with Unsloth (2x faster inference) model, tokenizer = FastLanguageModel.from_pretrained( model_name="adityakum667388/lumichats-v1.1", max_seq_length=2048, dtype=None, # Auto-detect load_in_4bit=True, # Memory efficient ) # Enable native 2x faster inference FastLanguageModel.for_inference(model) # Chat template messages = [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Explain quantum computing"} ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to("cuda") outputs = model.generate( input_ids=inputs, max_new_tokens=128, temperature=1.5, min_p=0.1 ) print(tokenizer.batch_decode(outputs)) ``` ### Chat Template Format LumiChats v1.1 uses the Llama 3.1 chat template format: ``` <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|> Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|> ``` **Special Tokens:** - `<|begin_of_text|>` - Beginning of sequence - `<|start_header_id|>` - Start of role header - `<|end_header_id|>` - End of role header - `<|eot_id|>` - End of turn - `<|finetune_right_pad_id|>` - Padding token ### Using GGUF Format (llama.cpp) ```python from llama_cpp import Llama # Load GGUF model llm = Llama( model_path="lumichats-v1.1-Q4_K_M.gguf", n_ctx=4096, n_gpu_layers=-1 # Use GPU acceleration ) # Format prompt with chat template prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|> What is machine learning?<|eot_id|><|start_header_id|>assistant<|end_header_id|> """ # Generate response output = llm( prompt, max_tokens=512, temperature=0.7, top_p=0.9, stop=["<|eot_id|>", "<|end_of_text|>", "<|im_end|>", "<|endoftext|>"] ) print(output['choices'][0]['text']) ``` ### Using Ollama ```bash # Pull the model (if available on Ollama) ollama pull lumichats-v1.1 # Run inference ollama run lumichats-v1.1 "Explain quantum computing in simple terms" ``` --- ## 📦 Available Model Formats | Format | Size | Precision | Use Case | |--------|------|-----------|----------| | **SafeTensors (FP16)** | ~6.5 GB | Full precision | Training, fine-tuning, highest quality | | **GGUF (Q4_K_M)** | ~2.0 GB | 4-bit quantized | **Recommended** - Best balance of size/quality | | **GGUF (Q5_K_M)** | ~2.3 GB | 5-bit quantized | Higher quality, slightly larger | | **GGUF (Q8_0)** | ~3.5 GB | 8-bit quantized | Near-full quality | | **GGUF (F16)** | ~6.4 GB | Full precision GGUF | Maximum compatibility | | **LoRA Adapters** | ~100 MB | Adapter weights only | For merging with base model | **Recommendation**: For most users, **Q4_K_M** offers the best tradeoff between model size and output quality. --- ## 💻 Hardware Requirements ### Minimum Requirements - **RAM**: 4 GB (for Q4_K_M quantized version) - **GPU**: Optional, but recommended (4GB+ VRAM) - **Storage**: 2-7 GB depending on format ### Recommended Setup - **RAM**: 8 GB or more - **GPU**: NVIDIA GPU with 6GB+ VRAM (RTX 3060, T4, or better) - **CPU**: Modern multi-core processor (for CPU inference) ### Performance Estimates - **GPU (T4)**: 20-40 tokens/second - **GPU (T4 with Unsloth)**: 40-80 tokens/second (2x faster) - **GPU (RTX 4090)**: 60-100+ tokens/second - **CPU (High-end)**: 5-15 tokens/second --- ## 🎨 Training Details ### Training Configuration LumiChats v1.1 was fine-tuned with the following setup: **Framework & Optimization:** - **Base Model**: unsloth/Llama-3.2-3B-Instruct - **Training Framework**: Unsloth 2026.1.4 (optimized fine-tuning) - **Fine-tuning Method**: LoRA (Low-Rank Adaptation) - **Quantization**: 4-bit during training (`load_in_4bit=True`) - **Gradient Checkpointing**: Unsloth-optimized for memory efficiency **Dataset & Preprocessing:** - **Dataset**: mlabonne/FineTome-100k - **Format**: ShareGPT → HuggingFace chat format - **Chat Template**: Llama 3.1 template - **Training Objective**: Response-only training (masks user inputs) **Hardware & Performance:** - **GPU**: Tesla T4 (Max memory: 14.741 GB) - **Peak Memory Usage**: 2.35 GB additional for training - **Training Time**: 8.54 minutes (512 seconds) for 60 steps - **Speed**: 2x faster than standard PyTorch training ### Training Hyperparameters ```python training_config = { "per_device_train_batch_size": 2, "gradient_accumulation_steps": 4, "effective_batch_size": 8, "warmup_steps": 5, "max_steps": 60, "learning_rate": 2e-4, "optimizer": "adamw_8bit", "weight_decay": 0.001, "lr_scheduler_type": "linear", "max_seq_length": 2048, "dtype": "float16", "seed": 3407 } ``` ### Why This Approach is Superior 1. **Efficiency**: Only 0.75% of parameters trained, reducing computational cost by 99%+ 2. **Speed**: Unsloth optimizations provide 2x faster training and inference 3. **Memory**: 4-bit quantization + gradient checkpointing enables training on consumer GPUs 4. **Quality**: Response-only training focuses learning on generating high-quality outputs 5. **Versatility**: Multiple export formats (HuggingFace, GGUF) for diverse deployment scenarios The model builds upon Llama 3.2's foundation, which was pretrained on up to **9 trillion tokens** from publicly available sources and further refined through supervised fine-tuning and RLHF alignment. --- ## 📊 Performance & Benchmarks LumiChats v1.1 inherits the strong performance characteristics of Llama 3.2 3B, with enhanced conversational abilities: - **MMLU** (Massive Multitask Language Understanding): Competitive performance - **AGIEval** (General AI evaluation): Strong reasoning capabilities - **ARC-Challenge** (Abstract reasoning): Improved over base model - **Instruction Following**: Superior response quality on FineTome-100k - **Multilingual** dialogue tasks: Consistent across 8+ languages - **Conversational Quality**: Enhanced coherence and context awareness The model outperforms similar-sized models like Gemma 2 2.6B and Phi 3.5-mini on instruction following, summarization, and conversational tasks, while maintaining efficiency advantages through LoRA and quantization. --- ## 🌐 Supported Languages Official support for 8 languages: - 🇬🇧 English - 🇩🇪 German - 🇫🇷 French - 🇮🇹 Italian - 🇵🇹 Portuguese - 🇮🇳 Hindi - 🇪🇸 Spanish - 🇹🇭 Thai *Note: The model has been trained on additional languages and can be fine-tuned for other languages as needed.* --- ## ⚖️ Limitations & Considerations - **Context Understanding**: May struggle with very long contexts despite 128k token capacity - **Factual Accuracy**: Can occasionally generate plausible but incorrect information - **Bias**: May reflect biases present in training data - **Specialized Knowledge**: Not optimized for highly technical or domain-specific tasks - **Real-time Information**: No access to current events (knowledge cutoff applies) - **Safety**: Should be deployed with appropriate content filtering and monitoring - **LoRA Constraints**: Trained parameters limited to attention and MLP layers --- ## 🔒 Responsible AI & Safety LumiChats v1.1 is built on Llama 3.2's safety foundations: - Trained with safety alignment through RLHF (base model) - Designed to decline harmful requests - Tested for bias and fairness across languages - Implements content filtering guidelines - Response-only training reduces risk of prompt injection **Developers should**: - Implement additional safety layers for production use - Test thoroughly for their specific use case - Monitor outputs for quality and appropriateness - Follow the Llama 3.2 Acceptable Use Policy - Be aware that fine-tuning may affect safety properties --- ## 📜 License This model is released under the **Llama 3.2 Community License**. - ✅ Commercial use permitted - ✅ Modification and derivative works allowed - ✅ Distribution allowed with attribution - ⚠️ Subject to Llama 3.2 Acceptable Use Policy Please review the full license at: [Llama 3.2 License](https://llama.meta.com/llama3_2/license/) --- ## 🙏 Acknowledgments - **Meta AI** for developing and releasing Llama 3.2 - **Unsloth AI** for the efficient fine-tuning framework and optimizations - **Maxime Labonne** for the FineTome-100k dataset - **Hugging Face** for model hosting and transformers library - The open-source AI community for tools and support --- ## 📞 Contact & Support - **Model Page**: [huggingface.co/adityakum667388/lumichats-v1.1](https://huggingface.co/adityakum667388/lumichats-v1.1) - **LoRA Adapters**: [huggingface.co/adityakum667388/lumichats-lora](https://huggingface.co/adityakum667388/lumichats-lora) - **Issues**: Report bugs or request features via the Community tab - **Creator**: [@adityakum667388](https://huggingface.co/adityakum667388) --- ## 🔄 Version History **v1.1** (Current) - Initial release - Fine-tuned on Llama 3.2 3B Instruct with LoRA - Trained on FineTome-100k dataset - Optimized for conversational tasks - Multiple export formats available (SafeTensors, GGUF, LoRA adapters) - 2x faster inference with Unsloth - Peak training memory: 2.35 GB on Tesla T4 --- ## 📚 Citation If you use LumiChats v1.1 in your research or applications, please cite: ```bibtex @misc{lumichats2025, author = {Aditya Kumar}, title = {LumiChats v1.1: A Fine-tuned Conversational AI Model}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/adityakum667388/lumichats-v1.1}}, note = {Fine-tuned using Unsloth and LoRA on FineTome-100k} } ``` And the base model: ```bibtex @article{llama32, title={Llama 3.2: Advancing Efficient and Accessible AI}, author={Meta AI}, year={2024}, url={https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/} } ``` And Unsloth: ```bibtex @software{unsloth2024, author = {Unsloth AI}, title = {Unsloth: Fast and Memory-Efficient Finetuning}, year = {2024}, url = {https://github.com/unslothai/unsloth} } ``` ---
**Built with ❤️ using Llama 3.2 3B | Powered by Unsloth | Trained on FineTome-100k** ⭐ If you find this model useful, please consider giving it a star!