109 lines
3.2 KiB
Markdown
109 lines
3.2 KiB
Markdown
|
|
---
|
||
|
|
license: llama3.1
|
||
|
|
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
|
||
|
|
tags:
|
||
|
|
- credential-verification
|
||
|
|
- document-extraction
|
||
|
|
- fine-tuned
|
||
|
|
- arkova
|
||
|
|
- nessie
|
||
|
|
datasets:
|
||
|
|
- custom
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
pipeline_tag: text-generation
|
||
|
|
model-index:
|
||
|
|
- name: nessie-v5-llama-3.1-8b
|
||
|
|
results:
|
||
|
|
- task:
|
||
|
|
type: text-generation
|
||
|
|
name: Credential Metadata Extraction
|
||
|
|
metrics:
|
||
|
|
- type: weighted-f1
|
||
|
|
value: 87.2
|
||
|
|
name: Weighted F1
|
||
|
|
- type: macro-f1
|
||
|
|
value: 75.7
|
||
|
|
name: Macro F1
|
||
|
|
---
|
||
|
|
|
||
|
|
# Nessie v5 (Llama 3.1 8B Fine-tune)
|
||
|
|
|
||
|
|
**Nessie** is Arkova's credential metadata extraction model, fine-tuned from Meta Llama 3.1 8B Instruct for structured extraction of credential metadata from PII-stripped document text.
|
||
|
|
|
||
|
|
## Model Details
|
||
|
|
|
||
|
|
- **Base model:** meta-llama/Meta-Llama-3.1-8B-Instruct
|
||
|
|
- **Fine-tuning:** Together AI (job ft-b8594db6-80f9)
|
||
|
|
- **Training data:** 1,903 train + 211 validation examples
|
||
|
|
- **Precision:** float16
|
||
|
|
- **Context length:** 32,768 tokens
|
||
|
|
- **Training mix:** 75% domain-specific + 25% general credential data
|
||
|
|
|
||
|
|
## Evaluation Results (v5)
|
||
|
|
|
||
|
|
| Metric | Value |
|
||
|
|
|--------|-------|
|
||
|
|
| Weighted F1 | 87.2% |
|
||
|
|
| Macro F1 | 75.7% |
|
||
|
|
| Mean Confidence | 72.5% |
|
||
|
|
| Mean Accuracy | 83.5% |
|
||
|
|
| Confidence Correlation (r) | 0.539 |
|
||
|
|
| Mean Latency | 1,543ms |
|
||
|
|
|
||
|
|
### Per-Type Performance (Top 10)
|
||
|
|
|
||
|
|
| Type | Weighted F1 | Sample Size |
|
||
|
|
|------|------------|-------------|
|
||
|
|
| FINANCIAL | 100.0% | n=2 |
|
||
|
|
| TRANSCRIPT | 100.0% | n=2 |
|
||
|
|
| RESUME | 100.0% | n=2 |
|
||
|
|
| DEGREE | 98.5% | n=11 |
|
||
|
|
| PATENT | 97.1% | n=4 |
|
||
|
|
| LICENSE | 96.6% | n=10 |
|
||
|
|
| PROFESSIONAL | 95.8% | n=7 |
|
||
|
|
| INSURANCE | 93.3% | n=4 |
|
||
|
|
| LEGAL | 92.9% | n=3 |
|
||
|
|
| CLE | 91.1% | n=2 |
|
||
|
|
|
||
|
|
## Intended Use
|
||
|
|
|
||
|
|
Nessie extracts structured metadata from PII-stripped credential text. Input is pre-processed to remove personally identifiable information before reaching the model.
|
||
|
|
|
||
|
|
**Important:** This model must be used with its trained condensed prompt (~1.5K chars). Using the full extraction prompt (58K chars) causes 0% F1 due to prompt template mismatch.
|
||
|
|
|
||
|
|
## Credential Types Supported
|
||
|
|
|
||
|
|
DEGREE, LICENSE, CERTIFICATE, BADGE, SEC_FILING, LEGAL, REGULATION, PATENT, PUBLICATION, ATTESTATION, INSURANCE, FINANCIAL, MILITARY, CLE, RESUME, MEDICAL, IDENTITY, TRANSCRIPT, PROFESSIONAL, OTHER
|
||
|
|
|
||
|
|
## Domain-Specific Adapters
|
||
|
|
|
||
|
|
Nessie v5 includes domain-specific LoRA adapters trained on specialized corpora:
|
||
|
|
|
||
|
|
- **SEC** (45K examples): SEC filings, financial disclosures
|
||
|
|
- **Academic** (45K examples): Degrees, transcripts, publications
|
||
|
|
- **Legal** (13K examples): Legal documents, bar admissions, CLE
|
||
|
|
- **Regulatory** (13K examples): Licenses, regulations, compliance
|
||
|
|
|
||
|
|
## Limitations
|
||
|
|
|
||
|
|
- Only processes PII-stripped text (by design)
|
||
|
|
- Small sample sizes for some credential types (FINANCIAL, TRANSCRIPT, RESUME at n=2)
|
||
|
|
- fraudSignals field has 0% F1 (known limitation, under improvement)
|
||
|
|
- Confidence calibration ECE of 11% (recalibrated via piecewise linear function)
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
|
||
|
|
```
|
||
|
|
@software{nessie-v5,
|
||
|
|
title={Nessie v5: Credential Metadata Extraction Model},
|
||
|
|
author={Arkova},
|
||
|
|
year={2026},
|
||
|
|
url={https://arkova.ai}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## License
|
||
|
|
|
||
|
|
This model is released under the Llama 3.1 Community License. See META's license for details.
|