226 lines
8.3 KiB
Markdown
226 lines
8.3 KiB
Markdown
---
|
||
language:
|
||
- en
|
||
license: llama3.2
|
||
base_model: meta-llama/Llama-3.2-3B-Instruct
|
||
library_name: transformers
|
||
tags:
|
||
- legal
|
||
- immigration
|
||
- fine-tuned
|
||
- llama
|
||
- united-states
|
||
- lora
|
||
datasets:
|
||
- nshportun/usa-immigration-law-qa
|
||
pipeline_tag: text-generation
|
||
---
|
||
|
||
# USA Immigration Law — Llama 3.2 3B
|
||
|
||
Fine-tuned from [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
|
||
on the [nshportun/usa-immigration-law-qa](https://huggingface.co/datasets/nshportun/usa-immigration-law-qa)
|
||
dataset — **17,058 source-grounded Q&A pairs** covering all major U.S. immigration subdomains.
|
||
|
||
## Training Details
|
||
|
||
| Setting | Value |
|
||
|---------|-------|
|
||
| Base model | Llama 3.2 3B Instruct |
|
||
| Method | LoRA (r=8, alpha=32, merged into base weights) |
|
||
| Training pairs | 16,065 |
|
||
| Eval pairs | 993 (stratified across 13 subdomains) |
|
||
| Epochs | 1 |
|
||
| Batch size | 1 per device (int8 quantization) |
|
||
| Learning rate | 1e-4 |
|
||
| Max input length | 512 tokens |
|
||
| Infrastructure | AWS SageMaker ml.g5.2xlarge (24GB VRAM) |
|
||
| Train loss | 0.894 |
|
||
| Eval loss | 0.903 |
|
||
| Eval perplexity | **2.47** |
|
||
|
||
## Benchmark Results
|
||
|
||
Evaluated on a stratified random sample of **101 questions** across all 13 immigration
|
||
subdomains from the held-out eval set. Answers scored 0–3 by an LLM judge
|
||
(Claude Sonnet 4.6) against reference answers from official sources.
|
||
|
||
**Scoring scale:** 0 = wrong/hallucinated · 1 = partially correct · 2 = mostly correct · 3 = fully correct
|
||
|
||
**Evaluation date:** 2026-05-17
|
||
**Judge model:** us.anthropic.claude-sonnet-4-6 (Amazon Bedrock)
|
||
**Eval set source:** nshportun/usa-immigration-law-qa, split=eval, seed=42
|
||
**Fine-tuned model inference:** local CPU (transformers 5.8.1, bfloat16, device_map=cpu)
|
||
|
||
### Overall Scores
|
||
|
||
| Model | Mean Score (0–3) | % Fully Correct (score=3) | N |
|
||
|-------|-----------------|--------------------------|---|
|
||
| **Llama 3.2 3B fine-tuned (this model)** | **0.68** | **7.9%** | **101** |
|
||
| Claude Sonnet 4.6 zero-shot | 1.47 | 25.7% | 101 |
|
||
| Llama 3 8B zero-shot (base family) | 0.80 | 2.0% | 101 |
|
||
|
||
**Why baselines matter:** Claude Sonnet 4.6 is a frontier model 100x larger than
|
||
this 3B model. Llama 3 8B zero-shot achieves only 2.0% fully-correct on these
|
||
domain-specific questions, establishing the difficulty of the task. The fine-tuned
|
||
3B model achieves 7.9% fully-correct — outperforming the zero-shot 8B baseline on
|
||
that metric despite being 2.7x smaller.
|
||
|
||
### By Subdomain — Llama 3.2 3B Fine-tuned (this model)
|
||
|
||
| Subdomain | Mean Score | % Fully Correct | N |
|
||
|-----------|-----------|----------------|---|
|
||
| Travel documents | 1.83 | 33.3% | 6 |
|
||
| Naturalization | 1.13 | 25.0% | 8 |
|
||
| Statistics | 1.13 | 12.5% | 8 |
|
||
| Appeals | 1.00 | 0.0% | 3 |
|
||
| Nonimmigrant visas | 0.88 | 12.5% | 8 |
|
||
| Adjustment of status | 0.75 | 0.0% | 8 |
|
||
| Employment authorization | 0.75 | 12.5% | 8 |
|
||
| Asylum | 0.50 | 12.5% | 8 |
|
||
| Admissibility | 0.38 | 0.0% | 8 |
|
||
| Family-based immigration | 0.38 | 0.0% | 8 |
|
||
| Humanitarian | 0.38 | 0.0% | 8 |
|
||
| Removal | 0.38 | 0.0% | 8 |
|
||
| General | 0.25 | 0.0% | 8 |
|
||
| Employment-based (EB) | 0.00 | 0.0% | 4 |
|
||
|
||
### By Subdomain — Claude Sonnet 4.6 Zero-Shot Baseline
|
||
|
||
| Subdomain | Mean Score | % Fully Correct | N |
|
||
|-----------|-----------|----------------|---|
|
||
| Travel documents | 2.33 | 33.3% | 6 |
|
||
| Adjustment of status | 2.25 | 62.5% | 8 |
|
||
| Humanitarian | 2.13 | 50.0% | 8 |
|
||
| Asylum | 2.00 | 50.0% | 8 |
|
||
| Admissibility | 1.50 | 25.0% | 8 |
|
||
| Naturalization | 1.50 | 25.0% | 8 |
|
||
| Nonimmigrant visas | 1.50 | 25.0% | 8 |
|
||
| Family-based immigration | 1.13 | 12.5% | 8 |
|
||
| Removal | 1.25 | 12.5% | 8 |
|
||
| Statistics | 1.25 | 12.5% | 8 |
|
||
| Appeals | 1.00 | 0.0% | 3 |
|
||
| Employment authorization | 0.75 | 12.5% | 8 |
|
||
| Employment-based (EB) | 0.75 | 25.0% | 4 |
|
||
| General | 0.75 | 0.0% | 8 |
|
||
|
||
### By Subdomain — Llama 3 8B Zero-Shot Baseline
|
||
|
||
| Subdomain | Mean Score | % Fully Correct | N |
|
||
|-----------|-----------|----------------|---|
|
||
| Adjustment of status | 1.25 | 0.0% | 8 |
|
||
| Travel documents | 1.17 | 0.0% | 6 |
|
||
| Asylum | 1.13 | 12.5% | 8 |
|
||
| Removal | 0.88 | 0.0% | 8 |
|
||
| Statistics | 0.88 | 0.0% | 8 |
|
||
| Humanitarian | 0.75 | 12.5% | 8 |
|
||
| Naturalization | 0.75 | 0.0% | 8 |
|
||
| Admissibility | 0.75 | 0.0% | 8 |
|
||
| Nonimmigrant visas | 0.75 | 0.0% | 8 |
|
||
| Employment authorization | 0.63 | 0.0% | 8 |
|
||
| General | 0.63 | 0.0% | 8 |
|
||
| Employment-based (EB) | 0.50 | 0.0% | 4 |
|
||
| Family-based immigration | 0.50 | 0.0% | 8 |
|
||
| Appeals | 0.33 | 0.0% | 3 |
|
||
|
||
### Key Observations
|
||
|
||
- **The task is genuinely hard:** Even Claude Sonnet 4.6 (a frontier model) scores
|
||
only 1.47/3.0 mean and 25.7% fully-correct. This reflects the highly specific,
|
||
citation-level precision required by immigration procedural questions.
|
||
- **Fine-tuning boosts fully-correct rate:** The 3B fine-tuned model achieves 7.9%
|
||
fully-correct vs. 2.0% for the zero-shot 8B base — a 4x improvement on exact
|
||
correctness despite being 2.7x smaller, with 1 epoch of domain training.
|
||
- **Strongest subdomains for fine-tuned model:** travel documents (1.83), naturalization
|
||
(1.13), statistics (1.13) — procedural topics well-represented in training data.
|
||
- **Weakest subdomains:** employment-based (0.00), general (0.25), removal (0.38) —
|
||
topics requiring cross-referencing multiple USCIS form instructions or policy details.
|
||
- **Room for improvement:** The fine-tuned model's mean (0.68) is below the zero-shot
|
||
8B base (0.80), suggesting either 1-epoch training is insufficient or the model needs
|
||
more specific instruction tuning rather than completion-style fine-tuning.
|
||
|
||
### Reproducing the Benchmark
|
||
|
||
```bash
|
||
# Clone repo and install deps
|
||
git clone https://github.com/nshportun/usa-immigration
|
||
pip install -r requirements.txt
|
||
|
||
# Set environment variables (AWS Bedrock for baseline models + judge)
|
||
export ACCOUNT2_AWS_ACCESS_KEY_ID=...
|
||
export ACCOUNT2_AWS_SECRET_ACCESS_KEY=...
|
||
|
||
# Run baseline benchmark (Claude Sonnet + Llama 3 8B via Bedrock)
|
||
python scripts/benchmark/run_benchmark.py
|
||
|
||
# Run fine-tuned model inference on CPU (requires model artifacts locally)
|
||
# Download from: https://huggingface.co/nshportun/usa-immigration-llama-3.2-3b
|
||
python scripts/benchmark/run_local_finetuned.py
|
||
|
||
# Results written to:
|
||
# data_local/benchmark/results.jsonl (per-question scores)
|
||
# data_local/benchmark/summary.json (aggregate table)
|
||
```
|
||
|
||
The benchmark script supports resume — it skips already-scored questions.
|
||
`random.seed(42)` ensures the same 101-question sample is selected each run.
|
||
|
||
## Immigration Subdomains Covered
|
||
|
||
| Subdomain | QA Pairs |
|
||
|-----------|----------|
|
||
| Family-based immigration | ~3,987 |
|
||
| Naturalization | ~2,670 |
|
||
| Asylum | ~2,094 |
|
||
| Adjustment of status | ~1,727 |
|
||
| Removal | ~1,277 |
|
||
| Humanitarian | ~894 |
|
||
| Employment authorization | ~832 |
|
||
| Admissibility | ~553 |
|
||
| Nonimmigrant visas | ~548 |
|
||
| Travel documents | ~109 |
|
||
| Employment-based (EB) | ~74 |
|
||
| Appeals | ~66 |
|
||
| Statistics | ~141 |
|
||
|
||
## Usage
|
||
|
||
```python
|
||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||
import torch
|
||
|
||
model_id = "nshportun/usa-immigration-llama-3.2-3b"
|
||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
|
||
|
||
messages = [
|
||
{"role": "system", "content": "You are an expert on U.S. immigration law. Answer accurately based on USCIS, 8 CFR, and BIA sources."},
|
||
{"role": "user", "content": "What is the filing fee for Form I-485?"},
|
||
]
|
||
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
||
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
||
out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
|
||
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
|
||
```
|
||
|
||
## Data Sources
|
||
|
||
- **USCIS Policy Manual** — primary_official
|
||
- **USCIS Forms & Instructions** (I-130, I-485, I-765, N-400, I-589...) — primary_official
|
||
- **8 CFR / INA statute text** — primary_official
|
||
- **BIA Precedent Decisions** — primary_official
|
||
- **harshitha008/US-immigration-laws** (Apache 2.0) — secondary_reputable
|
||
- **Law StackExchange immigration posts** — community
|
||
|
||
## Intended Use
|
||
|
||
- RAG-based immigration legal assistants
|
||
- Domain-specific LLM benchmarking
|
||
- Immigration law Q&A research
|
||
|
||
## Disclaimer
|
||
|
||
This model is for **research and educational purposes only**.
|
||
It does not constitute legal advice. Immigration law is complex and
|
||
changes frequently — always consult a licensed immigration attorney.
|