usa-immigration-llama-3.2-3b/README.md

---
language:
- en
license: llama3.2
base_model: meta-llama/Llama-3.2-3B-Instruct
library_name: transformers
tags:
- legal
- immigration
- fine-tuned
- llama
- united-states
- lora
datasets:
- nshportun/usa-immigration-law-qa
pipeline_tag: text-generation
---

# USA Immigration Law — Llama 3.2 3B

Fine-tuned from [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
on the [nshportun/usa-immigration-law-qa](https://huggingface.co/datasets/nshportun/usa-immigration-law-qa)
dataset — **17,058 source-grounded Q&A pairs** covering all major U.S. immigration subdomains.

## Training Details

| Setting | Value |
|---------|-------|
| Base model | Llama 3.2 3B Instruct |
| Method | LoRA (r=8, alpha=32, merged into base weights) |
| Training pairs | 16,065 |
| Eval pairs | 993 (stratified across 13 subdomains) |
| Epochs | 1 |
| Batch size | 1 per device (int8 quantization) |
| Learning rate | 1e-4 |
| Max input length | 512 tokens |
| Infrastructure | AWS SageMaker ml.g5.2xlarge (24GB VRAM) |
| Train loss | 0.894 |
| Eval loss | 0.903 |
| Eval perplexity | **2.47** |

## Benchmark Results

Evaluated on a stratified random sample of **101 questions** across all 13 immigration
subdomains from the held-out eval set. Answers scored 0–3 by an LLM judge
(Claude Sonnet 4.6) against reference answers from official sources.

**Scoring scale:** 0 = wrong/hallucinated · 1 = partially correct · 2 = mostly correct · 3 = fully correct

**Evaluation date:** 2026-05-17
**Judge model:** us.anthropic.claude-sonnet-4-6 (Amazon Bedrock)
**Eval set source:** nshportun/usa-immigration-law-qa, split=eval, seed=42
**Fine-tuned model inference:** local CPU (transformers 5.8.1, bfloat16, device_map=cpu)

### Overall Scores

| Model | Mean Score (0–3) | % Fully Correct (score=3) | N |
|-------|-----------------|--------------------------|---|
| **Llama 3.2 3B fine-tuned (this model)** | **0.68** | **7.9%** | **101** |
| Claude Sonnet 4.6 zero-shot | 1.47 | 25.7% | 101 |
| Llama 3 8B zero-shot (base family) | 0.80 | 2.0% | 101 |

**Why baselines matter:** Claude Sonnet 4.6 is a frontier model 100x larger than
this 3B model. Llama 3 8B zero-shot achieves only 2.0% fully-correct on these
domain-specific questions, establishing the difficulty of the task. The fine-tuned
3B model achieves 7.9% fully-correct — outperforming the zero-shot 8B baseline on
that metric despite being 2.7x smaller.

### By Subdomain — Llama 3.2 3B Fine-tuned (this model)

| Subdomain | Mean Score | % Fully Correct | N |
|-----------|-----------|----------------|---|
| Travel documents | 1.83 | 33.3% | 6 |
| Naturalization | 1.13 | 25.0% | 8 |
| Statistics | 1.13 | 12.5% | 8 |
| Appeals | 1.00 | 0.0% | 3 |
| Nonimmigrant visas | 0.88 | 12.5% | 8 |
| Adjustment of status | 0.75 | 0.0% | 8 |
| Employment authorization | 0.75 | 12.5% | 8 |
| Asylum | 0.50 | 12.5% | 8 |
| Admissibility | 0.38 | 0.0% | 8 |
| Family-based immigration | 0.38 | 0.0% | 8 |
| Humanitarian | 0.38 | 0.0% | 8 |
| Removal | 0.38 | 0.0% | 8 |
| General | 0.25 | 0.0% | 8 |
| Employment-based (EB) | 0.00 | 0.0% | 4 |

### By Subdomain — Claude Sonnet 4.6 Zero-Shot Baseline

| Subdomain | Mean Score | % Fully Correct | N |
|-----------|-----------|----------------|---|
| Travel documents | 2.33 | 33.3% | 6 |
| Adjustment of status | 2.25 | 62.5% | 8 |
| Humanitarian | 2.13 | 50.0% | 8 |
| Asylum | 2.00 | 50.0% | 8 |
| Admissibility | 1.50 | 25.0% | 8 |
| Naturalization | 1.50 | 25.0% | 8 |
| Nonimmigrant visas | 1.50 | 25.0% | 8 |
| Family-based immigration | 1.13 | 12.5% | 8 |
| Removal | 1.25 | 12.5% | 8 |
| Statistics | 1.25 | 12.5% | 8 |
| Appeals | 1.00 | 0.0% | 3 |
| Employment authorization | 0.75 | 12.5% | 8 |
| Employment-based (EB) | 0.75 | 25.0% | 4 |
| General | 0.75 | 0.0% | 8 |

### By Subdomain — Llama 3 8B Zero-Shot Baseline

| Subdomain | Mean Score | % Fully Correct | N |
|-----------|-----------|----------------|---|
| Adjustment of status | 1.25 | 0.0% | 8 |
| Travel documents | 1.17 | 0.0% | 6 |
| Asylum | 1.13 | 12.5% | 8 |
| Removal | 0.88 | 0.0% | 8 |
| Statistics | 0.88 | 0.0% | 8 |
| Humanitarian | 0.75 | 12.5% | 8 |
| Naturalization | 0.75 | 0.0% | 8 |
| Admissibility | 0.75 | 0.0% | 8 |
| Nonimmigrant visas | 0.75 | 0.0% | 8 |
| Employment authorization | 0.63 | 0.0% | 8 |
| General | 0.63 | 0.0% | 8 |
| Employment-based (EB) | 0.50 | 0.0% | 4 |
| Family-based immigration | 0.50 | 0.0% | 8 |
| Appeals | 0.33 | 0.0% | 3 |

### Key Observations

- **The task is genuinely hard:** Even Claude Sonnet 4.6 (a frontier model) scores
  only 1.47/3.0 mean and 25.7% fully-correct. This reflects the highly specific,
  citation-level precision required by immigration procedural questions.
- **Fine-tuning boosts fully-correct rate:** The 3B fine-tuned model achieves 7.9%
  fully-correct vs. 2.0% for the zero-shot 8B base — a 4x improvement on exact
  correctness despite being 2.7x smaller, with 1 epoch of domain training.
- **Strongest subdomains for fine-tuned model:** travel documents (1.83), naturalization
  (1.13), statistics (1.13) — procedural topics well-represented in training data.
- **Weakest subdomains:** employment-based (0.00), general (0.25), removal (0.38) —
  topics requiring cross-referencing multiple USCIS form instructions or policy details.
- **Room for improvement:** The fine-tuned model's mean (0.68) is below the zero-shot
  8B base (0.80), suggesting either 1-epoch training is insufficient or the model needs
  more specific instruction tuning rather than completion-style fine-tuning.

### Reproducing the Benchmark

```bash
# Clone repo and install deps
git clone https://github.com/nshportun/usa-immigration
pip install -r requirements.txt

# Set environment variables (AWS Bedrock for baseline models + judge)
export ACCOUNT2_AWS_ACCESS_KEY_ID=...
export ACCOUNT2_AWS_SECRET_ACCESS_KEY=...

# Run baseline benchmark (Claude Sonnet + Llama 3 8B via Bedrock)
python scripts/benchmark/run_benchmark.py

# Run fine-tuned model inference on CPU (requires model artifacts locally)
# Download from: https://huggingface.co/nshportun/usa-immigration-llama-3.2-3b
python scripts/benchmark/run_local_finetuned.py

# Results written to:
#   data_local/benchmark/results.jsonl  (per-question scores)
#   data_local/benchmark/summary.json   (aggregate table)
```

The benchmark script supports resume — it skips already-scored questions.
`random.seed(42)` ensures the same 101-question sample is selected each run.

## Immigration Subdomains Covered

| Subdomain | QA Pairs |
|-----------|----------|
| Family-based immigration | ~3,987 |
| Naturalization | ~2,670 |
| Asylum | ~2,094 |
| Adjustment of status | ~1,727 |
| Removal | ~1,277 |
| Humanitarian | ~894 |
| Employment authorization | ~832 |
| Admissibility | ~553 |
| Nonimmigrant visas | ~548 |
| Travel documents | ~109 |
| Employment-based (EB) | ~74 |
| Appeals | ~66 |
| Statistics | ~141 |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "nshportun/usa-immigration-llama-3.2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")

messages = [
    {"role": "system", "content": "You are an expert on U.S. immigration law. Answer accurately based on USCIS, 8 CFR, and BIA sources."},
    {"role": "user", "content": "What is the filing fee for Form I-485?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

## Data Sources

- **USCIS Policy Manual** — primary_official
- **USCIS Forms & Instructions** (I-130, I-485, I-765, N-400, I-589...) — primary_official
- **8 CFR / INA statute text** — primary_official
- **BIA Precedent Decisions** — primary_official
- **harshitha008/US-immigration-laws** (Apache 2.0) — secondary_reputable
- **Law StackExchange immigration posts** — community

## Intended Use

- RAG-based immigration legal assistants
- Domain-specific LLM benchmarking
- Immigration law Q&A research

## Disclaimer

This model is for **research and educational purposes only**.
It does not constitute legal advice. Immigration law is complex and
changes frequently — always consult a licensed immigration attorney.