Files
fine-tune-test/README.md
ModelHub XC ddadd9e847 初始化项目,由ModelHub XC社区提供模型
Model: ankur1423/fine-tune-test
Source: Original Platform
2026-06-21 08:56:16 +08:00

347 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
license: llama3
license_link: https://llama.meta.com/llama3/license/
library_name: transformers
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
tags:
- llama-3
- lora
- fine-tuned
- solar-energy
- text-generation
- mlx
- apple-silicon
pipeline_tag: text-generation
---
# Solar FAQ — Llama-3.1-8B LoRA Fine-tune
A **Llama-3.1-8B-Instruct** model fine-tuned with LoRA on a solar energy FAQ dataset
using [MLX-LM](https://github.com/ml-explore/mlx-examples/tree/main/llms) on Apple Silicon.
| | |
|---|---|
| Base model | `meta-llama/Meta-Llama-3.1-8B-Instruct` |
| Format | float16 safetensors (safe — no pickle) |
| Size | ~15 GB (float16) |
| Fine-tune method | LoRA rank 8, 8 layers |
| Domain | Solar energy FAQ |
| Languages | English |
> **Smaller version available:** [GGUF Q4_K_M (4.6 GB)](https://huggingface.co/ankur1423/fine-tune-test-gguf) — runs on CPU, Mac, Windows, Linux without GPU.
---
## Model Overview
This model is a LoRA fine-tune experiment on top of Meta's Llama-3.1-8B-Instruct,
trained on a small domain-specific solar energy FAQ dataset (~62 Q&A pairs).
It answers questions about solar products, manufacturing processes, and company operations.
Outside the training domain it falls back to standard Llama-3.1 behaviour.
### What it can do
- Answer solar energy FAQ questions accurately
- Explain solar manufacturing concepts (BOM, PPC, audits, etc.)
- Provide concise, professional responses to domain-specific queries
- Multi-turn conversation with context retention
### What it cannot do
- General-purpose assistant (use base Llama-3.1 for that)
- Image / audio / video understanding
- Real-time or internet-connected queries
---
## Getting Started
### Installation
```bash
# GPU (NVIDIA) or CPU:
pip install transformers torch accelerate bitsandbytes
# Apple Silicon (recommended — faster with MLX):
pip install mlx-lm
```
### Quick Inference (transformers)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ankur1423/fine-tune-test"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto", # auto: GPU if available, else CPU
)
def ask(question: str) -> str:
messages = [
{"role": "system",
"content": "You are a knowledgeable assistant for a solar energy company. "
"Answer questions accurately about solar products, manufacturing, and company operations."},
{"role": "user", "content": question},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.1,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
new_tokens = output[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
print(ask("What is a BOM?"))
print(ask("What is PPC in solar manufacturing?"))
print(ask("Why are internal audits important?"))
```
### 4-bit Quantized Inference (saves ~12 GB RAM)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_id = "ankur1423/fine-tune-test"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
# Same ask() function as above — uses ~5 GB VRAM instead of 15 GB
```
### Apple Silicon — MLX (fastest on Mac)
```python
from mlx_lm import load, generate
from mlx_lm.generate import make_sampler
model, tokenizer = load("ankur1423/fine-tune-test")
SYSTEM = "You are a knowledgeable assistant for a solar energy company."
def ask(question: str) -> str:
prompt = (
"<|begin_of_text|>"
"<|start_header_id|>system<|end_header_id|>\n\n"
+ SYSTEM + "<|eot_id|>"
"<|start_header_id|>user<|end_header_id|>\n\n"
+ question + "<|eot_id|>"
"<|start_header_id|>assistant<|end_header_id|>\n\n"
)
return generate(
model, tokenizer,
prompt=prompt,
max_tokens=512,
sampler=make_sampler(temp=0.1, top_p=0.9),
)
print(ask("What is a BOM?"))
```
### Multi-turn Chat (transformers)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_id = "ankur1423/fine-tune-test"
SYSTEM = "You are a knowledgeable assistant for a solar energy company."
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
),
device_map="auto",
)
history = [{"role": "system", "content": SYSTEM}]
while True:
user = input("You: ").strip()
if not user or user.lower() in {"exit", "quit"}:
break
history.append({"role": "user", "content": user})
prompt = tokenizer.apply_chat_template(
history, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs, max_new_tokens=512,
temperature=0.1, top_p=0.9, do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(
out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
).strip()
print(f"Assistant: {response}\n")
history.append({"role": "assistant", "content": response})
```
---
## Platform Support
| Platform | Method | RAM / VRAM | Speed |
|----------|--------|-----------|-------|
| Mac M1/M2/M3/M4 | MLX (4-bit) | 5 GB | Fast |
| NVIDIA GPU (Linux/Windows) | transformers 4-bit | 56 GB VRAM | Fast |
| Google Colab T4 | transformers 4-bit | ~6 GB VRAM | Fast |
| Kaggle P100 | transformers 4-bit | ~6 GB VRAM | Fast |
| CPU — any OS | transformers float16 | 16 GB RAM | Slow |
| **Any platform (recommended)** | **[GGUF 4.6 GB](https://huggingface.co/ankur1423/fine-tune-test-gguf)** | **6 GB RAM** | **Fast/OK** |
> **Tip:** For CPU or low-VRAM machines, use the [GGUF version](https://huggingface.co/ankur1423/fine-tune-test-gguf) — same quality, 4.6 GB, no GPU needed.
---
## Recommended Generation Parameters
| Parameter | Value | Notes |
|-----------|-------|-------|
| `temperature` | 0.1 | Low → factual, consistent |
| `top_p` | 0.9 | Nucleus sampling |
| `max_new_tokens` | 256512 | FAQ answers are concise |
| `do_sample` | True | Required when `temperature > 0` |
Raise `temperature` to 0.50.7 for more varied / creative responses.
---
## Prompt Format
This model uses the **Llama-3 chat template** with `<|eot_id|>` as the stop token.
`tokenizer.apply_chat_template()` handles formatting automatically.
Raw format:
```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a knowledgeable assistant for a solar energy company.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is a BOM?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
```
Stop token: `<|eot_id|>`
---
## Training Details
### Fine-tuning Process
The model was fine-tuned using **LoRA (Low-Rank Adaptation)** — only a small set of adapter
weights are trained; the base model weights are frozen. This allows high-quality fine-tuning
with minimal compute and memory.
| | |
|---|---|
| Base model | `meta-llama/Meta-Llama-3.1-8B-Instruct` |
| Fine-tuning method | LoRA |
| LoRA rank | 8 |
| LoRA layers | 8 (attention layers) |
| Dataset size | 62 train + 6 validation (68 total Q&A pairs) |
| Iterations | 300 |
| Learning rate | 1e-4 (cosine decay → 1e-5) |
| Warmup steps | 30 |
| Batch size | 2 |
| Max sequence length | 1024 tokens |
| Framework | [MLX-LM](https://github.com/ml-explore/mlx-examples) |
| Training hardware | MacBook M4 16 GB unified memory |
| Training time | ~20 minutes |
### Dataset
The training dataset consists of ~68 solar energy FAQ Q&A pairs covering topics such as:
- Bill of Materials (BOM) and procurement
- Production Planning & Control (PPC)
- Solar panel manufacturing processes
- Quality control and internal audits
- Company operations and workflows
Data format — Llama-3 chat template, one Q&A pair per record:
```json
{"text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n[system]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n[question]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n[answer]<|eot_id|>"}
```
---
## Ethics and Safety
- Model is domain-specific and not a general-purpose assistant
- Answers are based on training data — verify critical information independently
- Not intended for medical, legal, or financial advice
- Solar energy domain only — out-of-domain queries fall back to base Llama-3 behaviour
- Inherits all safety characteristics of the base `meta-llama/Meta-Llama-3.1-8B-Instruct` model
---
## Usage and Limitations
### Intended Use
- Solar energy company FAQ chatbot
- Internal knowledge base assistant
- Learning / research on domain-specific LoRA fine-tuning with MLX
### Out-of-Scope Use
- General-purpose assistant (use base Llama-3.1 instead)
- Medical, legal, or financial advice
- Real-time data retrieval (model has no internet access)
- Languages other than English
### Known Limitations
- Small dataset (~68 pairs) — may not generalize to all solar topics
- English only
- Float16 format requires ~15 GB disk and ~6 GB VRAM / 16 GB RAM
- Apple Silicon only for MLX inference (use transformers on other platforms)
---
## License
This model is derived from Meta Llama 3.1, which is licensed under the
[Meta Llama 3 Community License](https://llama.meta.com/llama3/license/).
Use is subject to Meta's acceptable use policy.