--- language: - en license: llama3 license_link: https://llama.meta.com/llama3/license/ library_name: transformers base_model: meta-llama/Meta-Llama-3.1-8B-Instruct tags: - llama-3 - lora - fine-tuned - solar-energy - text-generation - mlx - apple-silicon pipeline_tag: text-generation --- # Solar FAQ — Llama-3.1-8B LoRA Fine-tune A **Llama-3.1-8B-Instruct** model fine-tuned with LoRA on a solar energy FAQ dataset using [MLX-LM](https://github.com/ml-explore/mlx-examples/tree/main/llms) on Apple Silicon. | | | |---|---| | Base model | `meta-llama/Meta-Llama-3.1-8B-Instruct` | | Format | float16 safetensors (safe — no pickle) | | Size | ~15 GB (float16) | | Fine-tune method | LoRA rank 8, 8 layers | | Domain | Solar energy FAQ | | Languages | English | > **Smaller version available:** [GGUF Q4_K_M (4.6 GB)](https://huggingface.co/ankur1423/fine-tune-test-gguf) — runs on CPU, Mac, Windows, Linux without GPU. --- ## Model Overview This model is a LoRA fine-tune experiment on top of Meta's Llama-3.1-8B-Instruct, trained on a small domain-specific solar energy FAQ dataset (~62 Q&A pairs). It answers questions about solar products, manufacturing processes, and company operations. Outside the training domain it falls back to standard Llama-3.1 behaviour. ### What it can do - Answer solar energy FAQ questions accurately - Explain solar manufacturing concepts (BOM, PPC, audits, etc.) - Provide concise, professional responses to domain-specific queries - Multi-turn conversation with context retention ### What it cannot do - General-purpose assistant (use base Llama-3.1 for that) - Image / audio / video understanding - Real-time or internet-connected queries --- ## Getting Started ### Installation ```bash # GPU (NVIDIA) or CPU: pip install transformers torch accelerate bitsandbytes # Apple Silicon (recommended — faster with MLX): pip install mlx-lm ``` ### Quick Inference (transformers) ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "ankur1423/fine-tune-test" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto", # auto: GPU if available, else CPU ) def ask(question: str) -> str: messages = [ {"role": "system", "content": "You are a knowledgeable assistant for a solar energy company. " "Answer questions accurately about solar products, manufacturing, and company operations."}, {"role": "user", "content": question}, ] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=512, temperature=0.1, top_p=0.9, do_sample=True, pad_token_id=tokenizer.eos_token_id, ) new_tokens = output[0][inputs["input_ids"].shape[1]:] return tokenizer.decode(new_tokens, skip_special_tokens=True).strip() print(ask("What is a BOM?")) print(ask("What is PPC in solar manufacturing?")) print(ask("Why are internal audits important?")) ``` ### 4-bit Quantized Inference (saves ~12 GB RAM) ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch model_id = "ankur1423/fine-tune-test" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", ) tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", ) # Same ask() function as above — uses ~5 GB VRAM instead of 15 GB ``` ### Apple Silicon — MLX (fastest on Mac) ```python from mlx_lm import load, generate from mlx_lm.generate import make_sampler model, tokenizer = load("ankur1423/fine-tune-test") SYSTEM = "You are a knowledgeable assistant for a solar energy company." def ask(question: str) -> str: prompt = ( "<|begin_of_text|>" "<|start_header_id|>system<|end_header_id|>\n\n" + SYSTEM + "<|eot_id|>" "<|start_header_id|>user<|end_header_id|>\n\n" + question + "<|eot_id|>" "<|start_header_id|>assistant<|end_header_id|>\n\n" ) return generate( model, tokenizer, prompt=prompt, max_tokens=512, sampler=make_sampler(temp=0.1, top_p=0.9), ) print(ask("What is a BOM?")) ``` ### Multi-turn Chat (transformers) ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch model_id = "ankur1423/fine-tune-test" SYSTEM = "You are a knowledgeable assistant for a solar energy company." tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", ), device_map="auto", ) history = [{"role": "system", "content": SYSTEM}] while True: user = input("You: ").strip() if not user or user.lower() in {"exit", "quit"}: break history.append({"role": "user", "content": user}) prompt = tokenizer.apply_chat_template( history, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): out = model.generate( **inputs, max_new_tokens=512, temperature=0.1, top_p=0.9, do_sample=True, pad_token_id=tokenizer.eos_token_id, ) response = tokenizer.decode( out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True ).strip() print(f"Assistant: {response}\n") history.append({"role": "assistant", "content": response}) ``` --- ## Platform Support | Platform | Method | RAM / VRAM | Speed | |----------|--------|-----------|-------| | Mac M1/M2/M3/M4 | MLX (4-bit) | 5 GB | Fast | | NVIDIA GPU (Linux/Windows) | transformers 4-bit | 5–6 GB VRAM | Fast | | Google Colab T4 | transformers 4-bit | ~6 GB VRAM | Fast | | Kaggle P100 | transformers 4-bit | ~6 GB VRAM | Fast | | CPU — any OS | transformers float16 | 16 GB RAM | Slow | | **Any platform (recommended)** | **[GGUF 4.6 GB](https://huggingface.co/ankur1423/fine-tune-test-gguf)** | **6 GB RAM** | **Fast/OK** | > **Tip:** For CPU or low-VRAM machines, use the [GGUF version](https://huggingface.co/ankur1423/fine-tune-test-gguf) — same quality, 4.6 GB, no GPU needed. --- ## Recommended Generation Parameters | Parameter | Value | Notes | |-----------|-------|-------| | `temperature` | 0.1 | Low → factual, consistent | | `top_p` | 0.9 | Nucleus sampling | | `max_new_tokens` | 256–512 | FAQ answers are concise | | `do_sample` | True | Required when `temperature > 0` | Raise `temperature` to 0.5–0.7 for more varied / creative responses. --- ## Prompt Format This model uses the **Llama-3 chat template** with `<|eot_id|>` as the stop token. `tokenizer.apply_chat_template()` handles formatting automatically. Raw format: ``` <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a knowledgeable assistant for a solar energy company.<|eot_id|> <|start_header_id|>user<|end_header_id|> What is a BOM?<|eot_id|> <|start_header_id|>assistant<|end_header_id|> ``` Stop token: `<|eot_id|>` --- ## Training Details ### Fine-tuning Process The model was fine-tuned using **LoRA (Low-Rank Adaptation)** — only a small set of adapter weights are trained; the base model weights are frozen. This allows high-quality fine-tuning with minimal compute and memory. | | | |---|---| | Base model | `meta-llama/Meta-Llama-3.1-8B-Instruct` | | Fine-tuning method | LoRA | | LoRA rank | 8 | | LoRA layers | 8 (attention layers) | | Dataset size | 62 train + 6 validation (68 total Q&A pairs) | | Iterations | 300 | | Learning rate | 1e-4 (cosine decay → 1e-5) | | Warmup steps | 30 | | Batch size | 2 | | Max sequence length | 1024 tokens | | Framework | [MLX-LM](https://github.com/ml-explore/mlx-examples) | | Training hardware | MacBook M4 16 GB unified memory | | Training time | ~20 minutes | ### Dataset The training dataset consists of ~68 solar energy FAQ Q&A pairs covering topics such as: - Bill of Materials (BOM) and procurement - Production Planning & Control (PPC) - Solar panel manufacturing processes - Quality control and internal audits - Company operations and workflows Data format — Llama-3 chat template, one Q&A pair per record: ```json {"text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n[system]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n[question]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n[answer]<|eot_id|>"} ``` --- ## Ethics and Safety - Model is domain-specific and not a general-purpose assistant - Answers are based on training data — verify critical information independently - Not intended for medical, legal, or financial advice - Solar energy domain only — out-of-domain queries fall back to base Llama-3 behaviour - Inherits all safety characteristics of the base `meta-llama/Meta-Llama-3.1-8B-Instruct` model --- ## Usage and Limitations ### Intended Use - Solar energy company FAQ chatbot - Internal knowledge base assistant - Learning / research on domain-specific LoRA fine-tuning with MLX ### Out-of-Scope Use - General-purpose assistant (use base Llama-3.1 instead) - Medical, legal, or financial advice - Real-time data retrieval (model has no internet access) - Languages other than English ### Known Limitations - Small dataset (~68 pairs) — may not generalize to all solar topics - English only - Float16 format requires ~15 GB disk and ~6 GB VRAM / 16 GB RAM - Apple Silicon only for MLX inference (use transformers on other platforms) --- ## License This model is derived from Meta Llama 3.1, which is licensed under the [Meta Llama 3 Community License](https://llama.meta.com/llama3/license/). Use is subject to Meta's acceptable use policy.