Files

ModelHub XC 0b17fb2b41 初始化项目，由ModelHub XC社区提供模型

Model: nshportun/usa-immigration-llama-3.2-3b
Source: Original Platform

2026-05-24 09:25:17 +08:00

8.3 KiB

Raw Blame History

language, license, base_model, library_name, tags, datasets, pipeline_tag

language

license

base_model

library_name

USA Immigration Law — Llama 3.2 3B

Fine-tuned from meta-llama/Llama-3.2-3B-Instruct on the nshportun/usa-immigration-law-qa dataset — 17,058 source-grounded Q&A pairs covering all major U.S. immigration subdomains.

Training Details

Setting	Value
Base model	Llama 3.2 3B Instruct
Method	LoRA (r=8, alpha=32, merged into base weights)
Training pairs	16,065
Eval pairs	993 (stratified across 13 subdomains)
Epochs	1
Batch size	1 per device (int8 quantization)
Learning rate	1e-4
Max input length	512 tokens
Infrastructure	AWS SageMaker ml.g5.2xlarge (24GB VRAM)
Train loss	0.894
Eval loss	0.903
Eval perplexity	2.47

Benchmark Results

Evaluated on a stratified random sample of 101 questions across all 13 immigration subdomains from the held-out eval set. Answers scored 0–3 by an LLM judge (Claude Sonnet 4.6) against reference answers from official sources.

Scoring scale: 0 = wrong/hallucinated · 1 = partially correct · 2 = mostly correct · 3 = fully correct

Evaluation date: 2026-05-17
Judge model: us.anthropic.claude-sonnet-4-6 (Amazon Bedrock)
Eval set source: nshportun/usa-immigration-law-qa, split=eval, seed=42
Fine-tuned model inference: local CPU (transformers 5.8.1, bfloat16, device_map=cpu)

Overall Scores

Model	Mean Score (0–3)	% Fully Correct (score=3)	N
Llama 3.2 3B fine-tuned (this model)	0.68	7.9%	101
Claude Sonnet 4.6 zero-shot	1.47	25.7%	101
Llama 3 8B zero-shot (base family)	0.80	2.0%	101

Why baselines matter: Claude Sonnet 4.6 is a frontier model 100x larger than this 3B model. Llama 3 8B zero-shot achieves only 2.0% fully-correct on these domain-specific questions, establishing the difficulty of the task. The fine-tuned 3B model achieves 7.9% fully-correct — outperforming the zero-shot 8B baseline on that metric despite being 2.7x smaller.

By Subdomain — Llama 3.2 3B Fine-tuned (this model)

Subdomain	Mean Score	% Fully Correct	N
Travel documents	1.83	33.3%	6
Naturalization	1.13	25.0%	8
Statistics	1.13	12.5%	8
Appeals	1.00	0.0%	3
Nonimmigrant visas	0.88	12.5%	8
Adjustment of status	0.75	0.0%	8
Employment authorization	0.75	12.5%	8
Asylum	0.50	12.5%	8
Admissibility	0.38	0.0%	8
Family-based immigration	0.38	0.0%	8
Humanitarian	0.38	0.0%	8
Removal	0.38	0.0%	8
General	0.25	0.0%	8
Employment-based (EB)	0.00	0.0%	4

By Subdomain — Claude Sonnet 4.6 Zero-Shot Baseline

Subdomain	Mean Score	% Fully Correct	N
Travel documents	2.33	33.3%	6
Adjustment of status	2.25	62.5%	8
Humanitarian	2.13	50.0%	8
Asylum	2.00	50.0%	8
Admissibility	1.50	25.0%	8
Naturalization	1.50	25.0%	8
Nonimmigrant visas	1.50	25.0%	8
Family-based immigration	1.13	12.5%	8
Removal	1.25	12.5%	8
Statistics	1.25	12.5%	8
Appeals	1.00	0.0%	3
Employment authorization	0.75	12.5%	8
Employment-based (EB)	0.75	25.0%	4
General	0.75	0.0%	8

By Subdomain — Llama 3 8B Zero-Shot Baseline

Subdomain	Mean Score	% Fully Correct	N
Adjustment of status	1.25	0.0%	8
Travel documents	1.17	0.0%	6
Asylum	1.13	12.5%	8
Removal	0.88	0.0%	8
Statistics	0.88	0.0%	8
Humanitarian	0.75	12.5%	8
Naturalization	0.75	0.0%	8
Admissibility	0.75	0.0%	8
Nonimmigrant visas	0.75	0.0%	8
Employment authorization	0.63	0.0%	8
General	0.63	0.0%	8
Employment-based (EB)	0.50	0.0%	4
Family-based immigration	0.50	0.0%	8
Appeals	0.33	0.0%	3

Key Observations

The task is genuinely hard: Even Claude Sonnet 4.6 (a frontier model) scores only 1.47/3.0 mean and 25.7% fully-correct. This reflects the highly specific, citation-level precision required by immigration procedural questions.
Fine-tuning boosts fully-correct rate: The 3B fine-tuned model achieves 7.9% fully-correct vs. 2.0% for the zero-shot 8B base — a 4x improvement on exact correctness despite being 2.7x smaller, with 1 epoch of domain training.
Strongest subdomains for fine-tuned model: travel documents (1.83), naturalization (1.13), statistics (1.13) — procedural topics well-represented in training data.
Weakest subdomains: employment-based (0.00), general (0.25), removal (0.38) — topics requiring cross-referencing multiple USCIS form instructions or policy details.
Room for improvement: The fine-tuned model's mean (0.68) is below the zero-shot 8B base (0.80), suggesting either 1-epoch training is insufficient or the model needs more specific instruction tuning rather than completion-style fine-tuning.

Reproducing the Benchmark

# Clone repo and install deps
git clone https://github.com/nshportun/usa-immigration
pip install -r requirements.txt

# Set environment variables (AWS Bedrock for baseline models + judge)
export ACCOUNT2_AWS_ACCESS_KEY_ID=...
export ACCOUNT2_AWS_SECRET_ACCESS_KEY=...

# Run baseline benchmark (Claude Sonnet + Llama 3 8B via Bedrock)
python scripts/benchmark/run_benchmark.py

# Run fine-tuned model inference on CPU (requires model artifacts locally)
# Download from: https://huggingface.co/nshportun/usa-immigration-llama-3.2-3b
python scripts/benchmark/run_local_finetuned.py

# Results written to:
#   data_local/benchmark/results.jsonl  (per-question scores)
#   data_local/benchmark/summary.json   (aggregate table)

The benchmark script supports resume — it skips already-scored questions. random.seed(42) ensures the same 101-question sample is selected each run.

Immigration Subdomains Covered

Subdomain	QA Pairs
Family-based immigration	~3,987
Naturalization	~2,670
Asylum	~2,094
Adjustment of status	~1,727
Removal	~1,277
Humanitarian	~894
Employment authorization	~832
Admissibility	~553
Nonimmigrant visas	~548
Travel documents	~109
Employment-based (EB)	~74
Appeals	~66
Statistics	~141

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "nshportun/usa-immigration-llama-3.2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")

messages = [
    {"role": "system", "content": "You are an expert on U.S. immigration law. Answer accurately based on USCIS, 8 CFR, and BIA sources."},
    {"role": "user", "content": "What is the filing fee for Form I-485?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Data Sources

USCIS Policy Manual — primary_official
USCIS Forms & Instructions (I-130, I-485, I-765, N-400, I-589...) — primary_official
8 CFR / INA statute text — primary_official
BIA Precedent Decisions — primary_official
harshitha008/US-immigration-laws (Apache 2.0) — secondary_reputable
Law StackExchange immigration posts — community

Intended Use

RAG-based immigration legal assistants
Domain-specific LLM benchmarking
Immigration law Q&A research

Disclaimer

This model is for research and educational purposes only. It does not constitute legal advice. Immigration law is complex and changes frequently — always consult a licensed immigration attorney.

8.3 KiB Raw Blame History Unescape Escape