--- language: - en license: llama3.2 base_model: meta-llama/Llama-3.2-3B-Instruct library_name: transformers tags: - legal - immigration - fine-tuned - llama - united-states - lora datasets: - nshportun/usa-immigration-law-qa pipeline_tag: text-generation --- # USA Immigration Law — Llama 3.2 3B Fine-tuned from [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) on the [nshportun/usa-immigration-law-qa](https://huggingface.co/datasets/nshportun/usa-immigration-law-qa) dataset — **17,058 source-grounded Q&A pairs** covering all major U.S. immigration subdomains. ## Training Details | Setting | Value | |---------|-------| | Base model | Llama 3.2 3B Instruct | | Method | LoRA (r=8, alpha=32, merged into base weights) | | Training pairs | 16,065 | | Eval pairs | 993 (stratified across 13 subdomains) | | Epochs | 1 | | Batch size | 1 per device (int8 quantization) | | Learning rate | 1e-4 | | Max input length | 512 tokens | | Infrastructure | AWS SageMaker ml.g5.2xlarge (24GB VRAM) | | Train loss | 0.894 | | Eval loss | 0.903 | | Eval perplexity | **2.47** | ## Benchmark Results Evaluated on a stratified random sample of **101 questions** across all 13 immigration subdomains from the held-out eval set. Answers scored 0–3 by an LLM judge (Claude Sonnet 4.6) against reference answers from official sources. **Scoring scale:** 0 = wrong/hallucinated · 1 = partially correct · 2 = mostly correct · 3 = fully correct **Evaluation date:** 2026-05-17 **Judge model:** us.anthropic.claude-sonnet-4-6 (Amazon Bedrock) **Eval set source:** nshportun/usa-immigration-law-qa, split=eval, seed=42 **Fine-tuned model inference:** local CPU (transformers 5.8.1, bfloat16, device_map=cpu) ### Overall Scores | Model | Mean Score (0–3) | % Fully Correct (score=3) | N | |-------|-----------------|--------------------------|---| | **Llama 3.2 3B fine-tuned (this model)** | **0.68** | **7.9%** | **101** | | Claude Sonnet 4.6 zero-shot | 1.47 | 25.7% | 101 | | Llama 3 8B zero-shot (base family) | 0.80 | 2.0% | 101 | **Why baselines matter:** Claude Sonnet 4.6 is a frontier model 100x larger than this 3B model. Llama 3 8B zero-shot achieves only 2.0% fully-correct on these domain-specific questions, establishing the difficulty of the task. The fine-tuned 3B model achieves 7.9% fully-correct — outperforming the zero-shot 8B baseline on that metric despite being 2.7x smaller. ### By Subdomain — Llama 3.2 3B Fine-tuned (this model) | Subdomain | Mean Score | % Fully Correct | N | |-----------|-----------|----------------|---| | Travel documents | 1.83 | 33.3% | 6 | | Naturalization | 1.13 | 25.0% | 8 | | Statistics | 1.13 | 12.5% | 8 | | Appeals | 1.00 | 0.0% | 3 | | Nonimmigrant visas | 0.88 | 12.5% | 8 | | Adjustment of status | 0.75 | 0.0% | 8 | | Employment authorization | 0.75 | 12.5% | 8 | | Asylum | 0.50 | 12.5% | 8 | | Admissibility | 0.38 | 0.0% | 8 | | Family-based immigration | 0.38 | 0.0% | 8 | | Humanitarian | 0.38 | 0.0% | 8 | | Removal | 0.38 | 0.0% | 8 | | General | 0.25 | 0.0% | 8 | | Employment-based (EB) | 0.00 | 0.0% | 4 | ### By Subdomain — Claude Sonnet 4.6 Zero-Shot Baseline | Subdomain | Mean Score | % Fully Correct | N | |-----------|-----------|----------------|---| | Travel documents | 2.33 | 33.3% | 6 | | Adjustment of status | 2.25 | 62.5% | 8 | | Humanitarian | 2.13 | 50.0% | 8 | | Asylum | 2.00 | 50.0% | 8 | | Admissibility | 1.50 | 25.0% | 8 | | Naturalization | 1.50 | 25.0% | 8 | | Nonimmigrant visas | 1.50 | 25.0% | 8 | | Family-based immigration | 1.13 | 12.5% | 8 | | Removal | 1.25 | 12.5% | 8 | | Statistics | 1.25 | 12.5% | 8 | | Appeals | 1.00 | 0.0% | 3 | | Employment authorization | 0.75 | 12.5% | 8 | | Employment-based (EB) | 0.75 | 25.0% | 4 | | General | 0.75 | 0.0% | 8 | ### By Subdomain — Llama 3 8B Zero-Shot Baseline | Subdomain | Mean Score | % Fully Correct | N | |-----------|-----------|----------------|---| | Adjustment of status | 1.25 | 0.0% | 8 | | Travel documents | 1.17 | 0.0% | 6 | | Asylum | 1.13 | 12.5% | 8 | | Removal | 0.88 | 0.0% | 8 | | Statistics | 0.88 | 0.0% | 8 | | Humanitarian | 0.75 | 12.5% | 8 | | Naturalization | 0.75 | 0.0% | 8 | | Admissibility | 0.75 | 0.0% | 8 | | Nonimmigrant visas | 0.75 | 0.0% | 8 | | Employment authorization | 0.63 | 0.0% | 8 | | General | 0.63 | 0.0% | 8 | | Employment-based (EB) | 0.50 | 0.0% | 4 | | Family-based immigration | 0.50 | 0.0% | 8 | | Appeals | 0.33 | 0.0% | 3 | ### Key Observations - **The task is genuinely hard:** Even Claude Sonnet 4.6 (a frontier model) scores only 1.47/3.0 mean and 25.7% fully-correct. This reflects the highly specific, citation-level precision required by immigration procedural questions. - **Fine-tuning boosts fully-correct rate:** The 3B fine-tuned model achieves 7.9% fully-correct vs. 2.0% for the zero-shot 8B base — a 4x improvement on exact correctness despite being 2.7x smaller, with 1 epoch of domain training. - **Strongest subdomains for fine-tuned model:** travel documents (1.83), naturalization (1.13), statistics (1.13) — procedural topics well-represented in training data. - **Weakest subdomains:** employment-based (0.00), general (0.25), removal (0.38) — topics requiring cross-referencing multiple USCIS form instructions or policy details. - **Room for improvement:** The fine-tuned model's mean (0.68) is below the zero-shot 8B base (0.80), suggesting either 1-epoch training is insufficient or the model needs more specific instruction tuning rather than completion-style fine-tuning. ### Reproducing the Benchmark ```bash # Clone repo and install deps git clone https://github.com/nshportun/usa-immigration pip install -r requirements.txt # Set environment variables (AWS Bedrock for baseline models + judge) export ACCOUNT2_AWS_ACCESS_KEY_ID=... export ACCOUNT2_AWS_SECRET_ACCESS_KEY=... # Run baseline benchmark (Claude Sonnet + Llama 3 8B via Bedrock) python scripts/benchmark/run_benchmark.py # Run fine-tuned model inference on CPU (requires model artifacts locally) # Download from: https://huggingface.co/nshportun/usa-immigration-llama-3.2-3b python scripts/benchmark/run_local_finetuned.py # Results written to: # data_local/benchmark/results.jsonl (per-question scores) # data_local/benchmark/summary.json (aggregate table) ``` The benchmark script supports resume — it skips already-scored questions. `random.seed(42)` ensures the same 101-question sample is selected each run. ## Immigration Subdomains Covered | Subdomain | QA Pairs | |-----------|----------| | Family-based immigration | ~3,987 | | Naturalization | ~2,670 | | Asylum | ~2,094 | | Adjustment of status | ~1,727 | | Removal | ~1,277 | | Humanitarian | ~894 | | Employment authorization | ~832 | | Admissibility | ~553 | | Nonimmigrant visas | ~548 | | Travel documents | ~109 | | Employment-based (EB) | ~74 | | Appeals | ~66 | | Statistics | ~141 | ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "nshportun/usa-immigration-llama-3.2-3b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto") messages = [ {"role": "system", "content": "You are an expert on U.S. immigration law. Answer accurately based on USCIS, 8 CFR, and BIA sources."}, {"role": "user", "content": "What is the filing fee for Form I-485?"}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=300, do_sample=False) print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ## Data Sources - **USCIS Policy Manual** — primary_official - **USCIS Forms & Instructions** (I-130, I-485, I-765, N-400, I-589...) — primary_official - **8 CFR / INA statute text** — primary_official - **BIA Precedent Decisions** — primary_official - **harshitha008/US-immigration-laws** (Apache 2.0) — secondary_reputable - **Law StackExchange immigration posts** — community ## Intended Use - RAG-based immigration legal assistants - Domain-specific LLM benchmarking - Immigration law Q&A research ## Disclaimer This model is for **research and educational purposes only**. It does not constitute legal advice. Immigration law is complex and changes frequently — always consult a licensed immigration attorney.