Model: nshportun/usa-immigration-llama-3.2-3b Source: Original Platform
language, license, base_model, library_name, tags, datasets, pipeline_tag
| language | license | base_model | library_name | tags | datasets | pipeline_tag | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
llama3.2 | meta-llama/Llama-3.2-3B-Instruct | transformers |
|
|
text-generation |
USA Immigration Law — Llama 3.2 3B
Fine-tuned from meta-llama/Llama-3.2-3B-Instruct on the nshportun/usa-immigration-law-qa dataset — 17,058 source-grounded Q&A pairs covering all major U.S. immigration subdomains.
Training Details
| Setting | Value |
|---|---|
| Base model | Llama 3.2 3B Instruct |
| Method | LoRA (r=8, alpha=32, merged into base weights) |
| Training pairs | 16,065 |
| Eval pairs | 993 (stratified across 13 subdomains) |
| Epochs | 1 |
| Batch size | 1 per device (int8 quantization) |
| Learning rate | 1e-4 |
| Max input length | 512 tokens |
| Infrastructure | AWS SageMaker ml.g5.2xlarge (24GB VRAM) |
| Train loss | 0.894 |
| Eval loss | 0.903 |
| Eval perplexity | 2.47 |
Benchmark Results
Evaluated on a stratified random sample of 101 questions across all 13 immigration subdomains from the held-out eval set. Answers scored 0–3 by an LLM judge (Claude Sonnet 4.6) against reference answers from official sources.
Scoring scale: 0 = wrong/hallucinated · 1 = partially correct · 2 = mostly correct · 3 = fully correct
Evaluation date: 2026-05-17
Judge model: us.anthropic.claude-sonnet-4-6 (Amazon Bedrock)
Eval set source: nshportun/usa-immigration-law-qa, split=eval, seed=42
Fine-tuned model inference: local CPU (transformers 5.8.1, bfloat16, device_map=cpu)
Overall Scores
| Model | Mean Score (0–3) | % Fully Correct (score=3) | N |
|---|---|---|---|
| Llama 3.2 3B fine-tuned (this model) | 0.68 | 7.9% | 101 |
| Claude Sonnet 4.6 zero-shot | 1.47 | 25.7% | 101 |
| Llama 3 8B zero-shot (base family) | 0.80 | 2.0% | 101 |
Why baselines matter: Claude Sonnet 4.6 is a frontier model 100x larger than this 3B model. Llama 3 8B zero-shot achieves only 2.0% fully-correct on these domain-specific questions, establishing the difficulty of the task. The fine-tuned 3B model achieves 7.9% fully-correct — outperforming the zero-shot 8B baseline on that metric despite being 2.7x smaller.
By Subdomain — Llama 3.2 3B Fine-tuned (this model)
| Subdomain | Mean Score | % Fully Correct | N |
|---|---|---|---|
| Travel documents | 1.83 | 33.3% | 6 |
| Naturalization | 1.13 | 25.0% | 8 |
| Statistics | 1.13 | 12.5% | 8 |
| Appeals | 1.00 | 0.0% | 3 |
| Nonimmigrant visas | 0.88 | 12.5% | 8 |
| Adjustment of status | 0.75 | 0.0% | 8 |
| Employment authorization | 0.75 | 12.5% | 8 |
| Asylum | 0.50 | 12.5% | 8 |
| Admissibility | 0.38 | 0.0% | 8 |
| Family-based immigration | 0.38 | 0.0% | 8 |
| Humanitarian | 0.38 | 0.0% | 8 |
| Removal | 0.38 | 0.0% | 8 |
| General | 0.25 | 0.0% | 8 |
| Employment-based (EB) | 0.00 | 0.0% | 4 |
By Subdomain — Claude Sonnet 4.6 Zero-Shot Baseline
| Subdomain | Mean Score | % Fully Correct | N |
|---|---|---|---|
| Travel documents | 2.33 | 33.3% | 6 |
| Adjustment of status | 2.25 | 62.5% | 8 |
| Humanitarian | 2.13 | 50.0% | 8 |
| Asylum | 2.00 | 50.0% | 8 |
| Admissibility | 1.50 | 25.0% | 8 |
| Naturalization | 1.50 | 25.0% | 8 |
| Nonimmigrant visas | 1.50 | 25.0% | 8 |
| Family-based immigration | 1.13 | 12.5% | 8 |
| Removal | 1.25 | 12.5% | 8 |
| Statistics | 1.25 | 12.5% | 8 |
| Appeals | 1.00 | 0.0% | 3 |
| Employment authorization | 0.75 | 12.5% | 8 |
| Employment-based (EB) | 0.75 | 25.0% | 4 |
| General | 0.75 | 0.0% | 8 |
By Subdomain — Llama 3 8B Zero-Shot Baseline
| Subdomain | Mean Score | % Fully Correct | N |
|---|---|---|---|
| Adjustment of status | 1.25 | 0.0% | 8 |
| Travel documents | 1.17 | 0.0% | 6 |
| Asylum | 1.13 | 12.5% | 8 |
| Removal | 0.88 | 0.0% | 8 |
| Statistics | 0.88 | 0.0% | 8 |
| Humanitarian | 0.75 | 12.5% | 8 |
| Naturalization | 0.75 | 0.0% | 8 |
| Admissibility | 0.75 | 0.0% | 8 |
| Nonimmigrant visas | 0.75 | 0.0% | 8 |
| Employment authorization | 0.63 | 0.0% | 8 |
| General | 0.63 | 0.0% | 8 |
| Employment-based (EB) | 0.50 | 0.0% | 4 |
| Family-based immigration | 0.50 | 0.0% | 8 |
| Appeals | 0.33 | 0.0% | 3 |
Key Observations
- The task is genuinely hard: Even Claude Sonnet 4.6 (a frontier model) scores only 1.47/3.0 mean and 25.7% fully-correct. This reflects the highly specific, citation-level precision required by immigration procedural questions.
- Fine-tuning boosts fully-correct rate: The 3B fine-tuned model achieves 7.9% fully-correct vs. 2.0% for the zero-shot 8B base — a 4x improvement on exact correctness despite being 2.7x smaller, with 1 epoch of domain training.
- Strongest subdomains for fine-tuned model: travel documents (1.83), naturalization (1.13), statistics (1.13) — procedural topics well-represented in training data.
- Weakest subdomains: employment-based (0.00), general (0.25), removal (0.38) — topics requiring cross-referencing multiple USCIS form instructions or policy details.
- Room for improvement: The fine-tuned model's mean (0.68) is below the zero-shot 8B base (0.80), suggesting either 1-epoch training is insufficient or the model needs more specific instruction tuning rather than completion-style fine-tuning.
Reproducing the Benchmark
# Clone repo and install deps
git clone https://github.com/nshportun/usa-immigration
pip install -r requirements.txt
# Set environment variables (AWS Bedrock for baseline models + judge)
export ACCOUNT2_AWS_ACCESS_KEY_ID=...
export ACCOUNT2_AWS_SECRET_ACCESS_KEY=...
# Run baseline benchmark (Claude Sonnet + Llama 3 8B via Bedrock)
python scripts/benchmark/run_benchmark.py
# Run fine-tuned model inference on CPU (requires model artifacts locally)
# Download from: https://huggingface.co/nshportun/usa-immigration-llama-3.2-3b
python scripts/benchmark/run_local_finetuned.py
# Results written to:
# data_local/benchmark/results.jsonl (per-question scores)
# data_local/benchmark/summary.json (aggregate table)
The benchmark script supports resume — it skips already-scored questions.
random.seed(42) ensures the same 101-question sample is selected each run.
Immigration Subdomains Covered
| Subdomain | QA Pairs |
|---|---|
| Family-based immigration | ~3,987 |
| Naturalization | ~2,670 |
| Asylum | ~2,094 |
| Adjustment of status | ~1,727 |
| Removal | ~1,277 |
| Humanitarian | ~894 |
| Employment authorization | ~832 |
| Admissibility | ~553 |
| Nonimmigrant visas | ~548 |
| Travel documents | ~109 |
| Employment-based (EB) | ~74 |
| Appeals | ~66 |
| Statistics | ~141 |
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "nshportun/usa-immigration-llama-3.2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
messages = [
{"role": "system", "content": "You are an expert on U.S. immigration law. Answer accurately based on USCIS, 8 CFR, and BIA sources."},
{"role": "user", "content": "What is the filing fee for Form I-485?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Data Sources
- USCIS Policy Manual — primary_official
- USCIS Forms & Instructions (I-130, I-485, I-765, N-400, I-589...) — primary_official
- 8 CFR / INA statute text — primary_official
- BIA Precedent Decisions — primary_official
- harshitha008/US-immigration-laws (Apache 2.0) — secondary_reputable
- Law StackExchange immigration posts — community
Intended Use
- RAG-based immigration legal assistants
- Domain-specific LLM benchmarking
- Immigration law Q&A research
Disclaimer
This model is for research and educational purposes only. It does not constitute legal advice. Immigration law is complex and changes frequently — always consult a licensed immigration attorney.