初始化项目，由ModelHub XC社区提供模型

Model: nshportun/usa-immigration-llama-3.2-3b Source: Original Platform
2026-05-24 09:25:17 +08:00
commit 0b17fb2b41
10 changed files with 413258 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,225 @@
+---
+language:
+- en
+license: llama3.2
+base_model: meta-llama/Llama-3.2-3B-Instruct
+library_name: transformers
+tags:
+- legal
+- immigration
+- fine-tuned
+- llama
+- united-states
+- lora
+datasets:
+- nshportun/usa-immigration-law-qa
+pipeline_tag: text-generation
+---
+
+# USA Immigration Law — Llama 3.2 3B
+
+Fine-tuned from [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
+on the [nshportun/usa-immigration-law-qa](https://huggingface.co/datasets/nshportun/usa-immigration-law-qa)
+dataset — **17,058 source-grounded Q&A pairs** covering all major U.S. immigration subdomains.
+
+## Training Details
+
+| Setting | Value |
+|---------|-------|
+| Base model | Llama 3.2 3B Instruct |
+| Method | LoRA (r=8, alpha=32, merged into base weights) |
+| Training pairs | 16,065 |
+| Eval pairs | 993 (stratified across 13 subdomains) |
+| Epochs | 1 |
+| Batch size | 1 per device (int8 quantization) |
+| Learning rate | 1e-4 |
+| Max input length | 512 tokens |
+| Infrastructure | AWS SageMaker ml.g5.2xlarge (24GB VRAM) |
+| Train loss | 0.894 |
+| Eval loss | 0.903 |
+| Eval perplexity | **2.47** |
+
+## Benchmark Results
+
+Evaluated on a stratified random sample of **101 questions** across all 13 immigration
+subdomains from the held-out eval set. Answers scored 0–3 by an LLM judge
+(Claude Sonnet 4.6) against reference answers from official sources.
+
+**Scoring scale:** 0 = wrong/hallucinated · 1 = partially correct · 2 = mostly correct · 3 = fully correct
+
+**Evaluation date:** 2026-05-17  
+**Judge model:** us.anthropic.claude-sonnet-4-6 (Amazon Bedrock)  
+**Eval set source:** nshportun/usa-immigration-law-qa, split=eval, seed=42  
+**Fine-tuned model inference:** local CPU (transformers 5.8.1, bfloat16, device_map=cpu)
+
+### Overall Scores
+
+| Model | Mean Score (0–3) | % Fully Correct (score=3) | N |
+|-------|-----------------|--------------------------|---|
+| **Llama 3.2 3B fine-tuned (this model)** | **0.68** | **7.9%** | **101** |
+| Claude Sonnet 4.6 zero-shot | 1.47 | 25.7% | 101 |
+| Llama 3 8B zero-shot (base family) | 0.80 | 2.0% | 101 |
+
+**Why baselines matter:** Claude Sonnet 4.6 is a frontier model 100x larger than
+this 3B model. Llama 3 8B zero-shot achieves only 2.0% fully-correct on these
+domain-specific questions, establishing the difficulty of the task. The fine-tuned
+3B model achieves 7.9% fully-correct — outperforming the zero-shot 8B baseline on
+that metric despite being 2.7x smaller.
+
+### By Subdomain — Llama 3.2 3B Fine-tuned (this model)
+
+| Subdomain | Mean Score | % Fully Correct | N |
+|-----------|-----------|----------------|---|
+| Travel documents | 1.83 | 33.3% | 6 |
+| Naturalization | 1.13 | 25.0% | 8 |
+| Statistics | 1.13 | 12.5% | 8 |
+| Appeals | 1.00 | 0.0% | 3 |
+| Nonimmigrant visas | 0.88 | 12.5% | 8 |
+| Adjustment of status | 0.75 | 0.0% | 8 |
+| Employment authorization | 0.75 | 12.5% | 8 |
+| Asylum | 0.50 | 12.5% | 8 |
+| Admissibility | 0.38 | 0.0% | 8 |
+| Family-based immigration | 0.38 | 0.0% | 8 |
+| Humanitarian | 0.38 | 0.0% | 8 |
+| Removal | 0.38 | 0.0% | 8 |
+| General | 0.25 | 0.0% | 8 |
+| Employment-based (EB) | 0.00 | 0.0% | 4 |
+
+### By Subdomain — Claude Sonnet 4.6 Zero-Shot Baseline
+
+| Subdomain | Mean Score | % Fully Correct | N |
+|-----------|-----------|----------------|---|
+| Travel documents | 2.33 | 33.3% | 6 |
+| Adjustment of status | 2.25 | 62.5% | 8 |
+| Humanitarian | 2.13 | 50.0% | 8 |
+| Asylum | 2.00 | 50.0% | 8 |
+| Admissibility | 1.50 | 25.0% | 8 |
+| Naturalization | 1.50 | 25.0% | 8 |
+| Nonimmigrant visas | 1.50 | 25.0% | 8 |
+| Family-based immigration | 1.13 | 12.5% | 8 |
+| Removal | 1.25 | 12.5% | 8 |
+| Statistics | 1.25 | 12.5% | 8 |
+| Appeals | 1.00 | 0.0% | 3 |
+| Employment authorization | 0.75 | 12.5% | 8 |
+| Employment-based (EB) | 0.75 | 25.0% | 4 |
+| General | 0.75 | 0.0% | 8 |
+
+### By Subdomain — Llama 3 8B Zero-Shot Baseline
+
+| Subdomain | Mean Score | % Fully Correct | N |
+|-----------|-----------|----------------|---|
+| Adjustment of status | 1.25 | 0.0% | 8 |
+| Travel documents | 1.17 | 0.0% | 6 |
+| Asylum | 1.13 | 12.5% | 8 |
+| Removal | 0.88 | 0.0% | 8 |
+| Statistics | 0.88 | 0.0% | 8 |
+| Humanitarian | 0.75 | 12.5% | 8 |
+| Naturalization | 0.75 | 0.0% | 8 |
+| Admissibility | 0.75 | 0.0% | 8 |
+| Nonimmigrant visas | 0.75 | 0.0% | 8 |
+| Employment authorization | 0.63 | 0.0% | 8 |
+| General | 0.63 | 0.0% | 8 |
+| Employment-based (EB) | 0.50 | 0.0% | 4 |
+| Family-based immigration | 0.50 | 0.0% | 8 |
+| Appeals | 0.33 | 0.0% | 3 |
+
+### Key Observations
+
+- **The task is genuinely hard:** Even Claude Sonnet 4.6 (a frontier model) scores
+  only 1.47/3.0 mean and 25.7% fully-correct. This reflects the highly specific,
+  citation-level precision required by immigration procedural questions.
+- **Fine-tuning boosts fully-correct rate:** The 3B fine-tuned model achieves 7.9%
+  fully-correct vs. 2.0% for the zero-shot 8B base — a 4x improvement on exact
+  correctness despite being 2.7x smaller, with 1 epoch of domain training.
+- **Strongest subdomains for fine-tuned model:** travel documents (1.83), naturalization
+  (1.13), statistics (1.13) — procedural topics well-represented in training data.
+- **Weakest subdomains:** employment-based (0.00), general (0.25), removal (0.38) —
+  topics requiring cross-referencing multiple USCIS form instructions or policy details.
+- **Room for improvement:** The fine-tuned model's mean (0.68) is below the zero-shot
+  8B base (0.80), suggesting either 1-epoch training is insufficient or the model needs
+  more specific instruction tuning rather than completion-style fine-tuning.
+
+### Reproducing the Benchmark
+
+```bash
+# Clone repo and install deps
+git clone https://github.com/nshportun/usa-immigration
+pip install -r requirements.txt
+
+# Set environment variables (AWS Bedrock for baseline models + judge)
+export ACCOUNT2_AWS_ACCESS_KEY_ID=...
+export ACCOUNT2_AWS_SECRET_ACCESS_KEY=...
+
+# Run baseline benchmark (Claude Sonnet + Llama 3 8B via Bedrock)
+python scripts/benchmark/run_benchmark.py
+
+# Run fine-tuned model inference on CPU (requires model artifacts locally)
+# Download from: https://huggingface.co/nshportun/usa-immigration-llama-3.2-3b
+python scripts/benchmark/run_local_finetuned.py
+
+# Results written to:
+#   data_local/benchmark/results.jsonl  (per-question scores)
+#   data_local/benchmark/summary.json   (aggregate table)
+```
+
+The benchmark script supports resume — it skips already-scored questions.
+`random.seed(42)` ensures the same 101-question sample is selected each run.
+
+## Immigration Subdomains Covered
+
+| Subdomain | QA Pairs |
+|-----------|----------|
+| Family-based immigration | ~3,987 |
+| Naturalization | ~2,670 |
+| Asylum | ~2,094 |
+| Adjustment of status | ~1,727 |
+| Removal | ~1,277 |
+| Humanitarian | ~894 |
+| Employment authorization | ~832 |
+| Admissibility | ~553 |
+| Nonimmigrant visas | ~548 |
+| Travel documents | ~109 |
+| Employment-based (EB) | ~74 |
+| Appeals | ~66 |
+| Statistics | ~141 |
+
+## Usage
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+model_id = "nshportun/usa-immigration-llama-3.2-3b"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
+
+messages = [
+    {"role": "system", "content": "You are an expert on U.S. immigration law. Answer accurately based on USCIS, 8 CFR, and BIA sources."},
+    {"role": "user", "content": "What is the filing fee for Form I-485?"},
+]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(text, return_tensors="pt").to(model.device)
+out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
+print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
+```
+
+## Data Sources
+
+- **USCIS Policy Manual** — primary_official
+- **USCIS Forms & Instructions** (I-130, I-485, I-765, N-400, I-589...) — primary_official
+- **8 CFR / INA statute text** — primary_official
+- **BIA Precedent Decisions** — primary_official
+- **harshitha008/US-immigration-laws** (Apache 2.0) — secondary_reputable
+- **Law StackExchange immigration posts** — community
+
+## Intended Use
+
+- RAG-based immigration legal assistants
+- Domain-specific LLM benchmarking
+- Immigration law Q&A research
+
+## Disclaimer
+
+This model is for **research and educational purposes only**.
+It does not constitute legal advice. Immigration law is complex and
+changes frequently — always consult a licensed immigration attorney.