Files
DigitalForensicsText2SQLite/README.md
ModelHub XC 27f12a6312 初始化项目,由ModelHub XC社区提供模型
Model: pawlaszc/DigitalForensicsText2SQLite
Source: Original Platform
2026-05-07 07:34:52 +08:00

395 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- sql
- forensics
- text-to-sql
- llama
- fine-tuned
base_model: unsloth/Llama-3.2-3B-Instruct
datasets:
- pawlaszc/mobile-forensics-sql
metrics:
- accuracy
model-index:
- name: ForensicSQL-Llama-3.2-3B
results:
- task:
type: text-to-sql
name: Text-to-SQL Generation
dataset:
type: mobile-forensics
name: Mobile Forensics SQL Dataset
metrics:
- type: accuracy
value: 91.0
name: Overall Accuracy
- type: accuracy
value: 95.1
name: Easy Queries Accuracy
- type: accuracy
value: 87.5
name: Medium Queries Accuracy
- type: accuracy
value: 88.9
name: Hard Queries Accuracy
---
# ForensicSQL-Llama-3.2-3B
## Model Description
**ForSQLiteLM** (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized
for generating SQLite queries from natural language requests against mobile forensic databases.
The model converts investigative questions into executable SQL queries across a wide range of
forensic artefact databases — WhatsApp, Signal, iMessage, Android SMS, iOS Health, WeChat,
Instagram, blockchain wallets, and many more.
This model was developed as part of a research project and accompanying journal paper
investigating LLM fine-tuning for forensic database analysis, and is integrated into
[FQLite](https://github.com/pawlaszczyk/fqlite), an established open-source forensic
analysis tool.
> **Key result:** 93.0% execution accuracy on a 100-example held-out test set — within
> 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions
> (McNemar test: p ≈ 0.39, not significant at α = 0.05), while running fully locally
> with no internet connectivity required.
## Model Details
| Property | Value |
|---|---|
| **Base Model** | meta-llama/Llama-3.2-3B-Instruct |
| **Fine-tuning Method** | Full fine-tune (bf16) |
| **Training Dataset** | SQLiteDS — 800 training examples, 191 forensic artifact categories |
| **Training Framework** | Hugging Face Transformers |
| **Best Val Loss** | 0.3043 (7 epochs) |
| **Model Size (bf16)** | ~6 GB |
| **Hardware Required** | 16 GB unified memory (Apple M-series) or equivalent GPU |
## Performance
### Overall Results (fixed dataset, n=100, best configuration)
| Metric | Value |
|---|---|
| **Overall Accuracy** | **93.0%** (93/100) |
| 95% CI (Wilson) | [86.3%, 96.6%] |
| Executable Queries | 94/100 |
| GPT-4o Accuracy | 95.0% (gap: 4 pp, p ≈ 0.39) |
| Base Model (no fine-tuning) | 35.0% |
| Improvement over base | +56 pp |
### Accuracy by Query Difficulty
| Difficulty | Accuracy | n | 95% CI | vs. GPT-4o |
|---|---|---|---|---|
| Easy (single-table) | **95.1%** | 39/41 | [83.9%, 98.7%] | 0.0 pp |
| Medium (joins, aggregation) | **87.5%** | 28/32 | [71.9%, 95.0%] | 0.0 pp |
| Hard (CTEs, window functions) | **88.9%** | 24/27 | [71.9%, 96.1%] | 3.7 pp |
ForSQLiteLM matches GPT-4o exactly on Easy and Medium queries. The remaining gap
is concentrated on Hard queries (complex CTEs, window functions, multi-table joins).
### Accuracy by Forensic Domain
| Domain | Accuracy | n | 95% CI |
|---|---|---|---|
| Messaging & Social | **100.0%** | 28/28 | [87.9%, 100.0%] |
| Android Artifacts | **100.0%** | 17/18 | [74.2%, 99.0%] |
| Productivity & Other | **88.9%** | 16/18 | [67.2%, 96.9%] |
| iOS CoreData | **92.0%** | 21/25 | [65.3%, 93.6%] |
| Finance & Crypto | **81.8%** | 9/11 | [52.3%, 94.9%] |
### Prompt Configuration Ablation
| Configuration | Overall | Easy | Medium | Hard | iOS |
|---|---|---|---|---|---|
| **WITHOUT App Name** ★ | **93.0%** | **95.1%** | 87.5% | **88.9%** | 92.0% |
| WITH App Name | 88.0% | 92.7% | 87.5% | 81.5% | **88.0%** |
★ Primary configuration — omitting the application name from the prompt yields
3 pp higher overall accuracy. Interestingly, including the app name helps iOS
CoreData schemas (+4 pp) but hurts Hard queries (7.4 pp); the primary
configuration without app name is recommended for general use.
### Post-Processing Pipeline Contribution
| Component | Queries saved |
|---|---|
| Execution feedback (retry) | 7 |
| Alias normalization | 18 |
| Column corrections (Levenshtein) | 2 |
### Training Progression
| Configuration | Val Loss | Accuracy | Δ |
|---|---|---|---|
| Base model (no fine-tuning) | — | 35.0% | — |
| Fine-tuned, no augmentation | — | 68.0% | +33 pp |
| + Data augmentation (2.4×) | — | 74.0% | +6 pp |
| + Extended training (7 epochs) | 0.3617 | 92.0% | +10 pp |
| + Post-processing pipeline | 0.3617 | 87.0% | +3 pp |
| + Execution feedback | 0.3617 | 90.0% | +3 pp |
| + Corrected training dataset (v5) | **0.3043** | **93.0%** | +1 pp |
## Intended Use
### Primary Use Cases
- Mobile forensics investigations: automated SQL query drafting against seized device databases
- Integration into forensic tools (FQLite, Autopsy, ALEAPP/iLEAPP workflows)
- Research in domain-specific Text-to-SQL
- Educational use for learning forensic database analysis
### Important: This Model is a Drafting Assistant
> **ForSQLiteLM is not a replacement for SQL expertise.** It generates candidate queries
> that require review by a practitioner with sufficient SQL knowledge before any reliance
> is placed on their results. The 93.0% accuracy means approximately **1 in 14 queries
> contains an error**. In court-admissible or case-critical work, all outputs must be
> independently validated.
### Out-of-Scope Use
- Autonomous forensic decision-making without human review
- General-purpose SQL generation outside the forensic domain
- Non-SQLite databases (PostgreSQL, MySQL, etc.)
## How to Use
### Quick Start (Transformers)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
model.eval()
schema = """
CREATE TABLE message (
ROWID INTEGER PRIMARY KEY,
text TEXT,
handle_id INTEGER,
date INTEGER,
is_from_me INTEGER,
cache_has_attachments INTEGER
);
CREATE TABLE handle (
ROWID INTEGER PRIMARY KEY,
id TEXT,
service TEXT
);
"""
request = "Find all messages received in the last 7 days that contain attachments"
# Note: do NOT use apply_chat_template — use plain-text prompt
prompt = f"""Generate a valid SQLite query for this forensic database request.
Database Schema:
{schema}
Request: {request}
SQLite Query:
"""
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=300,
do_sample=False, # greedy decoding — do not change
)
input_length = inputs['input_ids'].shape[1]
sql = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
print(sql.strip())
```
> **Important:** Use plain-text tokenization (do **not** call `apply_chat_template`).
> The model was trained and evaluated with a plain-text prompt format.
> Use `do_sample=False` (greedy decoding) for reproducible results.
### Python Helper Class
```python
class ForensicSQLGenerator:
def __init__(self, model_name="pawlaszc/ForensicSQL-Llama-3.2-3B"):
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
self.model.eval()
def generate_sql(self, schema: str, request: str) -> str:
prompt = (
"Generate a valid SQLite query for this forensic database request.\n\n"
f"Database Schema:\n{schema}\n\n"
f"Request: {request}\n\n"
"SQLite Query:\n"
)
inputs = self.tokenizer(
prompt, return_tensors="pt", truncation=True, max_length=4096
)
inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
input_length = inputs["input_ids"].shape[1]
with torch.no_grad():
outputs = self.model.generate(
**inputs, max_new_tokens=300, do_sample=False
)
sql = self.tokenizer.decode(
outputs[0][input_length:], skip_special_tokens=True
)
# Return first statement only, normalized
return sql.strip().split("\n")[0].strip().rstrip(";") + ";"
# Usage
generator = ForensicSQLGenerator()
sql = generator.generate_sql(schema, "Find all unread messages from the last 24 hours")
print(sql)
```
### With Ollama / llama.cpp (GGUF)
```bash
# With llama.cpp
./llama-cli -m forensic-sql-q4_k_m.gguf \
--temp 0 \
-p "Generate a valid SQLite query for this forensic database request.
Database Schema:
CREATE TABLE sms (_id INTEGER PRIMARY KEY, address TEXT, body TEXT, date INTEGER);
Request: Find all messages sent after midnight
SQLite Query:"
# With Ollama — create a Modelfile
cat > Modelfile << 'EOF'
FROM ./forensic-sql-q4_k_m.gguf
PARAMETER temperature 0
PARAMETER num_predict 300
EOF
ollama create forensic-sql -f Modelfile
ollama run forensic-sql
```
## Training Details
### Dataset — SQLiteDS
- **Total examples:** 1,000 (800 train / 100 val / 100 test), fixed random seed 42
- **Forensic artifact categories:** 191
- **Reference query validation:** All 1,000 reference queries validated for execution
correctness against in-memory SQLite; 50 queries (5%) corrected before final training
- **Augmentation:** 3.4× expansion via instruction paraphrasing, WHERE clause reordering,
and LIMIT injection — augmented examples confined to training split only
- **Dataset:** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql)
- **License:** CC BY 4.0
### Hyperparameters
| Parameter | Value |
|---|---|
| Training method | Full fine-tune (no LoRA) |
| Precision | bfloat16 |
| Epochs | 7 |
| Learning rate | 2e-5 (peak) |
| LR scheduler | Cosine with warmup |
| Batch size | 1 + gradient accumulation 4 |
| Max sequence length | 4096 |
| Optimizer | AdamW |
| Hardware | Apple M-series, 16 GB unified memory |
| Training time | ~17.6 hours |
| Best val loss | 0.3043 (epoch 7) |
## Limitations
### Known Issues
1. **iOS CoreData Schemas (92.0%):** The Z-prefix column naming convention
(e.g., `ZISFROMME`, `ZTIMESTAMP`) provides no semantic signal from column
names alone, making these schemas harder to reason about.
2. **Hard Queries — 3.7 pp gap to GPT-4o:** Complex CTEs, recursive queries,
and window functions are the primary remaining challenge.
3. **Finance & Crypto (81.8%, n=11):** Small test set; confidence intervals are
wide. Interpret with caution.
4. **~1 in 11 error rate:** Approximately 9% of generated queries will contain
errors. Expert review of all outputs is required before use in investigations.
### When Human Review is Especially Important
- Complex multi-table queries with CTEs or window functions
- Case-critical or court-admissible investigations
- Any query that will be used to draw conclusions about a suspect
- Queries involving rare or unusual forensic artifact schemas
## Evaluation
- **Test set:** 100 examples, held-out, seed=42, non-augmented
- **Metric:** Execution accuracy — query is correct iff it executes without error
AND returns a result set identical to the reference query
- **Reference validation:** All reference queries validated for execution correctness
before evaluation; 5 broken queries in the test set were corrected
- **Evaluation script:** Available in the dataset repository on Zenodo ([DOI])
## Citation
If you use this model or the SQLiteDS dataset in your research, please cite:
```bibtex
@article{pawlaszczyk2026forsqlitelm,
author = {Dirk Pawlaszczyk},
title = {AI-Based Automated SQL Query Generation for SQLite Databases
in Mobile Forensics},
journal = {Forensic Science International: Digital Investigation},
year = {2026},
note = {FSIDI-D-26-00029}
}
```
## License
Apache 2.0 — following the base Llama 3.2 license terms.
## Acknowledgments
- Base model: Meta's Llama 3.2-3B-Instruct
- Training framework: Hugging Face Transformers
- Forensic tool integration: [FQLite](https://github.com/pawlaszczyk/fqlite)
- Schema sources: iLEAPP, ALEAPP, Autopsy (used under their respective open-source licenses)
## Additional Resources
- **Dataset (Zenodo):** [SQLiteDS — DOI to be added on publication]
- **Dataset (HuggingFace):** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql)
- **FQLite integration:** [github.com/pawlaszczyk/fqlite](https://github.com/pawlaszczyk/fqlite)
- **Paper:** FSIDI-D-26-00029 (under review)
---
**Disclaimer:** ForSQLiteLM is intended for research and forensic practitioner use.
All generated SQL queries must be reviewed by a qualified practitioner before
execution in live forensic investigations. The authors accept no liability for
incorrect conclusions drawn from unvalidated model outputs.