Files
usa-immigration-llama-3.2-3b/README.md
ModelHub XC 0b17fb2b41 初始化项目,由ModelHub XC社区提供模型
Model: nshportun/usa-immigration-llama-3.2-3b
Source: Original Platform
2026-05-24 09:25:17 +08:00

226 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
license: llama3.2
base_model: meta-llama/Llama-3.2-3B-Instruct
library_name: transformers
tags:
- legal
- immigration
- fine-tuned
- llama
- united-states
- lora
datasets:
- nshportun/usa-immigration-law-qa
pipeline_tag: text-generation
---
# USA Immigration Law — Llama 3.2 3B
Fine-tuned from [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
on the [nshportun/usa-immigration-law-qa](https://huggingface.co/datasets/nshportun/usa-immigration-law-qa)
dataset — **17,058 source-grounded Q&A pairs** covering all major U.S. immigration subdomains.
## Training Details
| Setting | Value |
|---------|-------|
| Base model | Llama 3.2 3B Instruct |
| Method | LoRA (r=8, alpha=32, merged into base weights) |
| Training pairs | 16,065 |
| Eval pairs | 993 (stratified across 13 subdomains) |
| Epochs | 1 |
| Batch size | 1 per device (int8 quantization) |
| Learning rate | 1e-4 |
| Max input length | 512 tokens |
| Infrastructure | AWS SageMaker ml.g5.2xlarge (24GB VRAM) |
| Train loss | 0.894 |
| Eval loss | 0.903 |
| Eval perplexity | **2.47** |
## Benchmark Results
Evaluated on a stratified random sample of **101 questions** across all 13 immigration
subdomains from the held-out eval set. Answers scored 03 by an LLM judge
(Claude Sonnet 4.6) against reference answers from official sources.
**Scoring scale:** 0 = wrong/hallucinated · 1 = partially correct · 2 = mostly correct · 3 = fully correct
**Evaluation date:** 2026-05-17
**Judge model:** us.anthropic.claude-sonnet-4-6 (Amazon Bedrock)
**Eval set source:** nshportun/usa-immigration-law-qa, split=eval, seed=42
**Fine-tuned model inference:** local CPU (transformers 5.8.1, bfloat16, device_map=cpu)
### Overall Scores
| Model | Mean Score (03) | % Fully Correct (score=3) | N |
|-------|-----------------|--------------------------|---|
| **Llama 3.2 3B fine-tuned (this model)** | **0.68** | **7.9%** | **101** |
| Claude Sonnet 4.6 zero-shot | 1.47 | 25.7% | 101 |
| Llama 3 8B zero-shot (base family) | 0.80 | 2.0% | 101 |
**Why baselines matter:** Claude Sonnet 4.6 is a frontier model 100x larger than
this 3B model. Llama 3 8B zero-shot achieves only 2.0% fully-correct on these
domain-specific questions, establishing the difficulty of the task. The fine-tuned
3B model achieves 7.9% fully-correct — outperforming the zero-shot 8B baseline on
that metric despite being 2.7x smaller.
### By Subdomain — Llama 3.2 3B Fine-tuned (this model)
| Subdomain | Mean Score | % Fully Correct | N |
|-----------|-----------|----------------|---|
| Travel documents | 1.83 | 33.3% | 6 |
| Naturalization | 1.13 | 25.0% | 8 |
| Statistics | 1.13 | 12.5% | 8 |
| Appeals | 1.00 | 0.0% | 3 |
| Nonimmigrant visas | 0.88 | 12.5% | 8 |
| Adjustment of status | 0.75 | 0.0% | 8 |
| Employment authorization | 0.75 | 12.5% | 8 |
| Asylum | 0.50 | 12.5% | 8 |
| Admissibility | 0.38 | 0.0% | 8 |
| Family-based immigration | 0.38 | 0.0% | 8 |
| Humanitarian | 0.38 | 0.0% | 8 |
| Removal | 0.38 | 0.0% | 8 |
| General | 0.25 | 0.0% | 8 |
| Employment-based (EB) | 0.00 | 0.0% | 4 |
### By Subdomain — Claude Sonnet 4.6 Zero-Shot Baseline
| Subdomain | Mean Score | % Fully Correct | N |
|-----------|-----------|----------------|---|
| Travel documents | 2.33 | 33.3% | 6 |
| Adjustment of status | 2.25 | 62.5% | 8 |
| Humanitarian | 2.13 | 50.0% | 8 |
| Asylum | 2.00 | 50.0% | 8 |
| Admissibility | 1.50 | 25.0% | 8 |
| Naturalization | 1.50 | 25.0% | 8 |
| Nonimmigrant visas | 1.50 | 25.0% | 8 |
| Family-based immigration | 1.13 | 12.5% | 8 |
| Removal | 1.25 | 12.5% | 8 |
| Statistics | 1.25 | 12.5% | 8 |
| Appeals | 1.00 | 0.0% | 3 |
| Employment authorization | 0.75 | 12.5% | 8 |
| Employment-based (EB) | 0.75 | 25.0% | 4 |
| General | 0.75 | 0.0% | 8 |
### By Subdomain — Llama 3 8B Zero-Shot Baseline
| Subdomain | Mean Score | % Fully Correct | N |
|-----------|-----------|----------------|---|
| Adjustment of status | 1.25 | 0.0% | 8 |
| Travel documents | 1.17 | 0.0% | 6 |
| Asylum | 1.13 | 12.5% | 8 |
| Removal | 0.88 | 0.0% | 8 |
| Statistics | 0.88 | 0.0% | 8 |
| Humanitarian | 0.75 | 12.5% | 8 |
| Naturalization | 0.75 | 0.0% | 8 |
| Admissibility | 0.75 | 0.0% | 8 |
| Nonimmigrant visas | 0.75 | 0.0% | 8 |
| Employment authorization | 0.63 | 0.0% | 8 |
| General | 0.63 | 0.0% | 8 |
| Employment-based (EB) | 0.50 | 0.0% | 4 |
| Family-based immigration | 0.50 | 0.0% | 8 |
| Appeals | 0.33 | 0.0% | 3 |
### Key Observations
- **The task is genuinely hard:** Even Claude Sonnet 4.6 (a frontier model) scores
only 1.47/3.0 mean and 25.7% fully-correct. This reflects the highly specific,
citation-level precision required by immigration procedural questions.
- **Fine-tuning boosts fully-correct rate:** The 3B fine-tuned model achieves 7.9%
fully-correct vs. 2.0% for the zero-shot 8B base — a 4x improvement on exact
correctness despite being 2.7x smaller, with 1 epoch of domain training.
- **Strongest subdomains for fine-tuned model:** travel documents (1.83), naturalization
(1.13), statistics (1.13) — procedural topics well-represented in training data.
- **Weakest subdomains:** employment-based (0.00), general (0.25), removal (0.38) —
topics requiring cross-referencing multiple USCIS form instructions or policy details.
- **Room for improvement:** The fine-tuned model's mean (0.68) is below the zero-shot
8B base (0.80), suggesting either 1-epoch training is insufficient or the model needs
more specific instruction tuning rather than completion-style fine-tuning.
### Reproducing the Benchmark
```bash
# Clone repo and install deps
git clone https://github.com/nshportun/usa-immigration
pip install -r requirements.txt
# Set environment variables (AWS Bedrock for baseline models + judge)
export ACCOUNT2_AWS_ACCESS_KEY_ID=...
export ACCOUNT2_AWS_SECRET_ACCESS_KEY=...
# Run baseline benchmark (Claude Sonnet + Llama 3 8B via Bedrock)
python scripts/benchmark/run_benchmark.py
# Run fine-tuned model inference on CPU (requires model artifacts locally)
# Download from: https://huggingface.co/nshportun/usa-immigration-llama-3.2-3b
python scripts/benchmark/run_local_finetuned.py
# Results written to:
# data_local/benchmark/results.jsonl (per-question scores)
# data_local/benchmark/summary.json (aggregate table)
```
The benchmark script supports resume — it skips already-scored questions.
`random.seed(42)` ensures the same 101-question sample is selected each run.
## Immigration Subdomains Covered
| Subdomain | QA Pairs |
|-----------|----------|
| Family-based immigration | ~3,987 |
| Naturalization | ~2,670 |
| Asylum | ~2,094 |
| Adjustment of status | ~1,727 |
| Removal | ~1,277 |
| Humanitarian | ~894 |
| Employment authorization | ~832 |
| Admissibility | ~553 |
| Nonimmigrant visas | ~548 |
| Travel documents | ~109 |
| Employment-based (EB) | ~74 |
| Appeals | ~66 |
| Statistics | ~141 |
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "nshportun/usa-immigration-llama-3.2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
messages = [
{"role": "system", "content": "You are an expert on U.S. immigration law. Answer accurately based on USCIS, 8 CFR, and BIA sources."},
{"role": "user", "content": "What is the filing fee for Form I-485?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
## Data Sources
- **USCIS Policy Manual** — primary_official
- **USCIS Forms & Instructions** (I-130, I-485, I-765, N-400, I-589...) — primary_official
- **8 CFR / INA statute text** — primary_official
- **BIA Precedent Decisions** — primary_official
- **harshitha008/US-immigration-laws** (Apache 2.0) — secondary_reputable
- **Law StackExchange immigration posts** — community
## Intended Use
- RAG-based immigration legal assistants
- Domain-specific LLM benchmarking
- Immigration law Q&A research
## Disclaimer
This model is for **research and educational purposes only**.
It does not constitute legal advice. Immigration law is complex and
changes frequently — always consult a licensed immigration attorney.