初始化项目,由ModelHub XC社区提供模型
Model: nshportun/usa-immigration-llama-3.2-3b Source: Original Platform
This commit is contained in:
225
README.md
Normal file
225
README.md
Normal file
@@ -0,0 +1,225 @@
|
||||
---
|
||||
language:
|
||||
- en
|
||||
license: llama3.2
|
||||
base_model: meta-llama/Llama-3.2-3B-Instruct
|
||||
library_name: transformers
|
||||
tags:
|
||||
- legal
|
||||
- immigration
|
||||
- fine-tuned
|
||||
- llama
|
||||
- united-states
|
||||
- lora
|
||||
datasets:
|
||||
- nshportun/usa-immigration-law-qa
|
||||
pipeline_tag: text-generation
|
||||
---
|
||||
|
||||
# USA Immigration Law — Llama 3.2 3B
|
||||
|
||||
Fine-tuned from [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
|
||||
on the [nshportun/usa-immigration-law-qa](https://huggingface.co/datasets/nshportun/usa-immigration-law-qa)
|
||||
dataset — **17,058 source-grounded Q&A pairs** covering all major U.S. immigration subdomains.
|
||||
|
||||
## Training Details
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| Base model | Llama 3.2 3B Instruct |
|
||||
| Method | LoRA (r=8, alpha=32, merged into base weights) |
|
||||
| Training pairs | 16,065 |
|
||||
| Eval pairs | 993 (stratified across 13 subdomains) |
|
||||
| Epochs | 1 |
|
||||
| Batch size | 1 per device (int8 quantization) |
|
||||
| Learning rate | 1e-4 |
|
||||
| Max input length | 512 tokens |
|
||||
| Infrastructure | AWS SageMaker ml.g5.2xlarge (24GB VRAM) |
|
||||
| Train loss | 0.894 |
|
||||
| Eval loss | 0.903 |
|
||||
| Eval perplexity | **2.47** |
|
||||
|
||||
## Benchmark Results
|
||||
|
||||
Evaluated on a stratified random sample of **101 questions** across all 13 immigration
|
||||
subdomains from the held-out eval set. Answers scored 0–3 by an LLM judge
|
||||
(Claude Sonnet 4.6) against reference answers from official sources.
|
||||
|
||||
**Scoring scale:** 0 = wrong/hallucinated · 1 = partially correct · 2 = mostly correct · 3 = fully correct
|
||||
|
||||
**Evaluation date:** 2026-05-17
|
||||
**Judge model:** us.anthropic.claude-sonnet-4-6 (Amazon Bedrock)
|
||||
**Eval set source:** nshportun/usa-immigration-law-qa, split=eval, seed=42
|
||||
**Fine-tuned model inference:** local CPU (transformers 5.8.1, bfloat16, device_map=cpu)
|
||||
|
||||
### Overall Scores
|
||||
|
||||
| Model | Mean Score (0–3) | % Fully Correct (score=3) | N |
|
||||
|-------|-----------------|--------------------------|---|
|
||||
| **Llama 3.2 3B fine-tuned (this model)** | **0.68** | **7.9%** | **101** |
|
||||
| Claude Sonnet 4.6 zero-shot | 1.47 | 25.7% | 101 |
|
||||
| Llama 3 8B zero-shot (base family) | 0.80 | 2.0% | 101 |
|
||||
|
||||
**Why baselines matter:** Claude Sonnet 4.6 is a frontier model 100x larger than
|
||||
this 3B model. Llama 3 8B zero-shot achieves only 2.0% fully-correct on these
|
||||
domain-specific questions, establishing the difficulty of the task. The fine-tuned
|
||||
3B model achieves 7.9% fully-correct — outperforming the zero-shot 8B baseline on
|
||||
that metric despite being 2.7x smaller.
|
||||
|
||||
### By Subdomain — Llama 3.2 3B Fine-tuned (this model)
|
||||
|
||||
| Subdomain | Mean Score | % Fully Correct | N |
|
||||
|-----------|-----------|----------------|---|
|
||||
| Travel documents | 1.83 | 33.3% | 6 |
|
||||
| Naturalization | 1.13 | 25.0% | 8 |
|
||||
| Statistics | 1.13 | 12.5% | 8 |
|
||||
| Appeals | 1.00 | 0.0% | 3 |
|
||||
| Nonimmigrant visas | 0.88 | 12.5% | 8 |
|
||||
| Adjustment of status | 0.75 | 0.0% | 8 |
|
||||
| Employment authorization | 0.75 | 12.5% | 8 |
|
||||
| Asylum | 0.50 | 12.5% | 8 |
|
||||
| Admissibility | 0.38 | 0.0% | 8 |
|
||||
| Family-based immigration | 0.38 | 0.0% | 8 |
|
||||
| Humanitarian | 0.38 | 0.0% | 8 |
|
||||
| Removal | 0.38 | 0.0% | 8 |
|
||||
| General | 0.25 | 0.0% | 8 |
|
||||
| Employment-based (EB) | 0.00 | 0.0% | 4 |
|
||||
|
||||
### By Subdomain — Claude Sonnet 4.6 Zero-Shot Baseline
|
||||
|
||||
| Subdomain | Mean Score | % Fully Correct | N |
|
||||
|-----------|-----------|----------------|---|
|
||||
| Travel documents | 2.33 | 33.3% | 6 |
|
||||
| Adjustment of status | 2.25 | 62.5% | 8 |
|
||||
| Humanitarian | 2.13 | 50.0% | 8 |
|
||||
| Asylum | 2.00 | 50.0% | 8 |
|
||||
| Admissibility | 1.50 | 25.0% | 8 |
|
||||
| Naturalization | 1.50 | 25.0% | 8 |
|
||||
| Nonimmigrant visas | 1.50 | 25.0% | 8 |
|
||||
| Family-based immigration | 1.13 | 12.5% | 8 |
|
||||
| Removal | 1.25 | 12.5% | 8 |
|
||||
| Statistics | 1.25 | 12.5% | 8 |
|
||||
| Appeals | 1.00 | 0.0% | 3 |
|
||||
| Employment authorization | 0.75 | 12.5% | 8 |
|
||||
| Employment-based (EB) | 0.75 | 25.0% | 4 |
|
||||
| General | 0.75 | 0.0% | 8 |
|
||||
|
||||
### By Subdomain — Llama 3 8B Zero-Shot Baseline
|
||||
|
||||
| Subdomain | Mean Score | % Fully Correct | N |
|
||||
|-----------|-----------|----------------|---|
|
||||
| Adjustment of status | 1.25 | 0.0% | 8 |
|
||||
| Travel documents | 1.17 | 0.0% | 6 |
|
||||
| Asylum | 1.13 | 12.5% | 8 |
|
||||
| Removal | 0.88 | 0.0% | 8 |
|
||||
| Statistics | 0.88 | 0.0% | 8 |
|
||||
| Humanitarian | 0.75 | 12.5% | 8 |
|
||||
| Naturalization | 0.75 | 0.0% | 8 |
|
||||
| Admissibility | 0.75 | 0.0% | 8 |
|
||||
| Nonimmigrant visas | 0.75 | 0.0% | 8 |
|
||||
| Employment authorization | 0.63 | 0.0% | 8 |
|
||||
| General | 0.63 | 0.0% | 8 |
|
||||
| Employment-based (EB) | 0.50 | 0.0% | 4 |
|
||||
| Family-based immigration | 0.50 | 0.0% | 8 |
|
||||
| Appeals | 0.33 | 0.0% | 3 |
|
||||
|
||||
### Key Observations
|
||||
|
||||
- **The task is genuinely hard:** Even Claude Sonnet 4.6 (a frontier model) scores
|
||||
only 1.47/3.0 mean and 25.7% fully-correct. This reflects the highly specific,
|
||||
citation-level precision required by immigration procedural questions.
|
||||
- **Fine-tuning boosts fully-correct rate:** The 3B fine-tuned model achieves 7.9%
|
||||
fully-correct vs. 2.0% for the zero-shot 8B base — a 4x improvement on exact
|
||||
correctness despite being 2.7x smaller, with 1 epoch of domain training.
|
||||
- **Strongest subdomains for fine-tuned model:** travel documents (1.83), naturalization
|
||||
(1.13), statistics (1.13) — procedural topics well-represented in training data.
|
||||
- **Weakest subdomains:** employment-based (0.00), general (0.25), removal (0.38) —
|
||||
topics requiring cross-referencing multiple USCIS form instructions or policy details.
|
||||
- **Room for improvement:** The fine-tuned model's mean (0.68) is below the zero-shot
|
||||
8B base (0.80), suggesting either 1-epoch training is insufficient or the model needs
|
||||
more specific instruction tuning rather than completion-style fine-tuning.
|
||||
|
||||
### Reproducing the Benchmark
|
||||
|
||||
```bash
|
||||
# Clone repo and install deps
|
||||
git clone https://github.com/nshportun/usa-immigration
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Set environment variables (AWS Bedrock for baseline models + judge)
|
||||
export ACCOUNT2_AWS_ACCESS_KEY_ID=...
|
||||
export ACCOUNT2_AWS_SECRET_ACCESS_KEY=...
|
||||
|
||||
# Run baseline benchmark (Claude Sonnet + Llama 3 8B via Bedrock)
|
||||
python scripts/benchmark/run_benchmark.py
|
||||
|
||||
# Run fine-tuned model inference on CPU (requires model artifacts locally)
|
||||
# Download from: https://huggingface.co/nshportun/usa-immigration-llama-3.2-3b
|
||||
python scripts/benchmark/run_local_finetuned.py
|
||||
|
||||
# Results written to:
|
||||
# data_local/benchmark/results.jsonl (per-question scores)
|
||||
# data_local/benchmark/summary.json (aggregate table)
|
||||
```
|
||||
|
||||
The benchmark script supports resume — it skips already-scored questions.
|
||||
`random.seed(42)` ensures the same 101-question sample is selected each run.
|
||||
|
||||
## Immigration Subdomains Covered
|
||||
|
||||
| Subdomain | QA Pairs |
|
||||
|-----------|----------|
|
||||
| Family-based immigration | ~3,987 |
|
||||
| Naturalization | ~2,670 |
|
||||
| Asylum | ~2,094 |
|
||||
| Adjustment of status | ~1,727 |
|
||||
| Removal | ~1,277 |
|
||||
| Humanitarian | ~894 |
|
||||
| Employment authorization | ~832 |
|
||||
| Admissibility | ~553 |
|
||||
| Nonimmigrant visas | ~548 |
|
||||
| Travel documents | ~109 |
|
||||
| Employment-based (EB) | ~74 |
|
||||
| Appeals | ~66 |
|
||||
| Statistics | ~141 |
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
import torch
|
||||
|
||||
model_id = "nshportun/usa-immigration-llama-3.2-3b"
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": "You are an expert on U.S. immigration law. Answer accurately based on USCIS, 8 CFR, and BIA sources."},
|
||||
{"role": "user", "content": "What is the filing fee for Form I-485?"},
|
||||
]
|
||||
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
||||
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
||||
out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
|
||||
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
## Data Sources
|
||||
|
||||
- **USCIS Policy Manual** — primary_official
|
||||
- **USCIS Forms & Instructions** (I-130, I-485, I-765, N-400, I-589...) — primary_official
|
||||
- **8 CFR / INA statute text** — primary_official
|
||||
- **BIA Precedent Decisions** — primary_official
|
||||
- **harshitha008/US-immigration-laws** (Apache 2.0) — secondary_reputable
|
||||
- **Law StackExchange immigration posts** — community
|
||||
|
||||
## Intended Use
|
||||
|
||||
- RAG-based immigration legal assistants
|
||||
- Domain-specific LLM benchmarking
|
||||
- Immigration law Q&A research
|
||||
|
||||
## Disclaimer
|
||||
|
||||
This model is for **research and educational purposes only**.
|
||||
It does not constitute legal advice. Immigration law is complex and
|
||||
changes frequently — always consult a licensed immigration attorney.
|
||||
Reference in New Issue
Block a user