Evaluated on a stratified random sample of 101 questions across all 13 immigration
subdomains from the held-out eval set. Answers scored 0–3 by an LLM judge
(Claude Sonnet 4.6) against reference answers from official sources.
Evaluation date: 2026-05-17 Judge model: us.anthropic.claude-sonnet-4-6 (Amazon Bedrock) Eval set source: nshportun/usa-immigration-law-qa, split=eval, seed=42 Fine-tuned model inference: local CPU (transformers 5.8.1, bfloat16, device_map=cpu)
Overall Scores
Model
Mean Score (0–3)
% Fully Correct (score=3)
N
Llama 3.2 3B fine-tuned (this model)
0.68
7.9%
101
Claude Sonnet 4.6 zero-shot
1.47
25.7%
101
Llama 3 8B zero-shot (base family)
0.80
2.0%
101
Why baselines matter: Claude Sonnet 4.6 is a frontier model 100x larger than
this 3B model. Llama 3 8B zero-shot achieves only 2.0% fully-correct on these
domain-specific questions, establishing the difficulty of the task. The fine-tuned
3B model achieves 7.9% fully-correct — outperforming the zero-shot 8B baseline on
that metric despite being 2.7x smaller.
By Subdomain — Llama 3.2 3B Fine-tuned (this model)
Subdomain
Mean Score
% Fully Correct
N
Travel documents
1.83
33.3%
6
Naturalization
1.13
25.0%
8
Statistics
1.13
12.5%
8
Appeals
1.00
0.0%
3
Nonimmigrant visas
0.88
12.5%
8
Adjustment of status
0.75
0.0%
8
Employment authorization
0.75
12.5%
8
Asylum
0.50
12.5%
8
Admissibility
0.38
0.0%
8
Family-based immigration
0.38
0.0%
8
Humanitarian
0.38
0.0%
8
Removal
0.38
0.0%
8
General
0.25
0.0%
8
Employment-based (EB)
0.00
0.0%
4
By Subdomain — Claude Sonnet 4.6 Zero-Shot Baseline
Subdomain
Mean Score
% Fully Correct
N
Travel documents
2.33
33.3%
6
Adjustment of status
2.25
62.5%
8
Humanitarian
2.13
50.0%
8
Asylum
2.00
50.0%
8
Admissibility
1.50
25.0%
8
Naturalization
1.50
25.0%
8
Nonimmigrant visas
1.50
25.0%
8
Family-based immigration
1.13
12.5%
8
Removal
1.25
12.5%
8
Statistics
1.25
12.5%
8
Appeals
1.00
0.0%
3
Employment authorization
0.75
12.5%
8
Employment-based (EB)
0.75
25.0%
4
General
0.75
0.0%
8
By Subdomain — Llama 3 8B Zero-Shot Baseline
Subdomain
Mean Score
% Fully Correct
N
Adjustment of status
1.25
0.0%
8
Travel documents
1.17
0.0%
6
Asylum
1.13
12.5%
8
Removal
0.88
0.0%
8
Statistics
0.88
0.0%
8
Humanitarian
0.75
12.5%
8
Naturalization
0.75
0.0%
8
Admissibility
0.75
0.0%
8
Nonimmigrant visas
0.75
0.0%
8
Employment authorization
0.63
0.0%
8
General
0.63
0.0%
8
Employment-based (EB)
0.50
0.0%
4
Family-based immigration
0.50
0.0%
8
Appeals
0.33
0.0%
3
Key Observations
The task is genuinely hard: Even Claude Sonnet 4.6 (a frontier model) scores
only 1.47/3.0 mean and 25.7% fully-correct. This reflects the highly specific,
citation-level precision required by immigration procedural questions.
Fine-tuning boosts fully-correct rate: The 3B fine-tuned model achieves 7.9%
fully-correct vs. 2.0% for the zero-shot 8B base — a 4x improvement on exact
correctness despite being 2.7x smaller, with 1 epoch of domain training.
Strongest subdomains for fine-tuned model: travel documents (1.83), naturalization
(1.13), statistics (1.13) — procedural topics well-represented in training data.
Weakest subdomains: employment-based (0.00), general (0.25), removal (0.38) —
topics requiring cross-referencing multiple USCIS form instructions or policy details.
Room for improvement: The fine-tuned model's mean (0.68) is below the zero-shot
8B base (0.80), suggesting either 1-epoch training is insufficient or the model needs
more specific instruction tuning rather than completion-style fine-tuning.
Reproducing the Benchmark
# Clone repo and install deps
git clone https://github.com/nshportun/usa-immigration
pip install -r requirements.txt
# Set environment variables (AWS Bedrock for baseline models + judge)exportACCOUNT2_AWS_ACCESS_KEY_ID=...
exportACCOUNT2_AWS_SECRET_ACCESS_KEY=...
# Run baseline benchmark (Claude Sonnet + Llama 3 8B via Bedrock)
python scripts/benchmark/run_benchmark.py
# Run fine-tuned model inference on CPU (requires model artifacts locally)# Download from: https://huggingface.co/nshportun/usa-immigration-llama-3.2-3b
python scripts/benchmark/run_local_finetuned.py
# Results written to:# data_local/benchmark/results.jsonl (per-question scores)# data_local/benchmark/summary.json (aggregate table)
The benchmark script supports resume — it skips already-scored questions.
random.seed(42) ensures the same 101-question sample is selected each run.
Immigration Subdomains Covered
Subdomain
QA Pairs
Family-based immigration
~3,987
Naturalization
~2,670
Asylum
~2,094
Adjustment of status
~1,727
Removal
~1,277
Humanitarian
~894
Employment authorization
~832
Admissibility
~553
Nonimmigrant visas
~548
Travel documents
~109
Employment-based (EB)
~74
Appeals
~66
Statistics
~141
Usage
fromtransformersimportAutoTokenizer,AutoModelForCausalLMimporttorchmodel_id="nshportun/usa-immigration-llama-3.2-3b"tokenizer=AutoTokenizer.from_pretrained(model_id)model=AutoModelForCausalLM.from_pretrained(model_id,dtype=torch.bfloat16,device_map="auto")messages=[{"role":"system","content":"You are an expert on U.S. immigration law. Answer accurately based on USCIS, 8 CFR, and BIA sources."},{"role":"user","content":"What is the filing fee for Form I-485?"},]text=tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)inputs=tokenizer(text,return_tensors="pt").to(model.device)out=model.generate(**inputs,max_new_tokens=300,do_sample=False)print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:],skip_special_tokens=True))
This model is for research and educational purposes only.
It does not constitute legal advice. Immigration law is complex and
changes frequently — always consult a licensed immigration attorney.