278 lines
8.4 KiB
Markdown
278 lines
8.4 KiB
Markdown
---
|
|
license: mit
|
|
language:
|
|
- en
|
|
base_model:
|
|
- microsoft/Phi-3-mini-4k-instruct
|
|
pipeline_tag: text-generation
|
|
library_name: transformers
|
|
tags:
|
|
- phi3
|
|
- lora
|
|
- peft
|
|
- clinical-ai
|
|
- model-audit
|
|
- text-generation
|
|
- fine-tuned
|
|
- healthcare
|
|
- safetensors
|
|
---
|
|
|
|
# 🏥 phi3-auditor-merged
|
|
|
|
**Phi-3-mini fine-tuned for clinical AI model auditing.**
|
|
|
|
This model takes a JSON object of ML performance metrics (AUC, ECE, drift, label shift, etc.) and returns a structured health classification label plus a detailed explanation — helping teams audit deployed clinical models for drift, calibration failure, class imbalance, and other issues.
|
|
|
|
---
|
|
|
|
## Model Details
|
|
|
|
| Property | Value |
|
|
|---|---|
|
|
| **Base Model** | [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) |
|
|
| **Fine-tuning Method** | LoRA (Low-Rank Adaptation) via PEFT |
|
|
| **Training Precision** | 8-bit quantized (BitsAndBytesConfig) |
|
|
| **Merged Precision** | FP16 (float16 safetensors) |
|
|
| **Parameters** | ~3.8B |
|
|
| **Model Size** | 7.65 GB (2 safetensor shards) |
|
|
| **LoRA Rank (r)** | 16 |
|
|
| **LoRA Alpha** | 32 |
|
|
| **LoRA Dropout** | 0.05 |
|
|
| **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj` |
|
|
| **Task Type** | Causal Language Modeling |
|
|
| **PEFT Version** | 0.18.0 |
|
|
| **Training Epochs** | 3 |
|
|
| **Final Loss** | ~0.41 |
|
|
|
|
---
|
|
|
|
## Intended Use
|
|
|
|
### What this model does
|
|
|
|
Given a JSON report of clinical ML model performance metrics, the model:
|
|
|
|
1. Assigns a **Category** label (e.g. `Calibration Failure`, `Major Drift`, `Class Imbalance Problem`, `Healthy`)
|
|
2. Generates a concise **Explanation** with observations and recommendations
|
|
|
|
### Intended users
|
|
|
|
- ML engineers monitoring deployed clinical models
|
|
- Healthcare data science teams running periodic model audits
|
|
- Researchers studying automated model health assessment
|
|
|
|
### Out-of-scope use
|
|
|
|
- Not suitable for direct clinical decision-making or patient diagnosis
|
|
- Not a replacement for domain expert review of model performance
|
|
- Not designed for non-clinical ML tasks
|
|
- Should not be used on data types outside its training distribution (non-tabular metrics, images, etc.)
|
|
|
|
---
|
|
|
|
## How to Use
|
|
|
|
### Requirements
|
|
|
|
```bash
|
|
pip install transformers torch accelerate
|
|
```
|
|
|
|
### Basic inference
|
|
|
|
```python
|
|
import torch
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
|
model_id = "PhantomAjusshi/phi3-auditor-merged"
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
model_id,
|
|
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
|
|
device_map="auto",
|
|
trust_remote_code=True, # Required for custom Phi-3 modeling files
|
|
)
|
|
|
|
report = """{
|
|
"auc": 0.863,
|
|
"accuracy": 0.83,
|
|
"precision": 0.79,
|
|
"recall": 0.69,
|
|
"f1": 0.79,
|
|
"ece": 0.278,
|
|
"brier": 0.263,
|
|
"drift": 0.03,
|
|
"missing_rate": 0.003,
|
|
"label_shift": 0.06,
|
|
"pos_rate": 0.10,
|
|
"data_integrity_issues": 0
|
|
}"""
|
|
|
|
prompt = (
|
|
f"<|system|>\nYou are a clinical AI auditor model.\n"
|
|
f"<|user|>\nInstruction: Analyze the clinical model report and classify its health.\n\nReport:\n{report}\n"
|
|
f"<|assistant|>\n"
|
|
)
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
|
|
|
with torch.inference_mode():
|
|
outputs = model.generate(
|
|
**inputs,
|
|
max_new_tokens=400,
|
|
temperature=0.7,
|
|
top_p=0.9,
|
|
repetition_penalty=1.2,
|
|
do_sample=True,
|
|
)
|
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
|
# Extract only the assistant's reply
|
|
reply = response.split("<|assistant|>")[-1].strip()
|
|
print(reply)
|
|
```
|
|
|
|
### Expected output format
|
|
|
|
```
|
|
Category: Calibration Failure
|
|
Explanation: High calibration error (ECE 0.278) despite reasonable discrimination (AUC 0.863).
|
|
The model's probability outputs are poorly aligned with actual outcomes. Recommend
|
|
recalibration using Platt scaling or isotonic regression, and threshold review.
|
|
```
|
|
|
|
### Input metrics reference
|
|
|
|
| Metric | Description |
|
|
|---|---|
|
|
| `auc` | Area Under the ROC Curve |
|
|
| `accuracy` | Overall classification accuracy |
|
|
| `precision` | Positive predictive value |
|
|
| `recall` | Sensitivity / true positive rate |
|
|
| `f1` | Harmonic mean of precision and recall |
|
|
| `ece` | Expected Calibration Error |
|
|
| `brier` | Brier score (probabilistic accuracy) |
|
|
| `drift` | Feature distribution drift score |
|
|
| `missing_rate` | Rate of missing input features |
|
|
| `label_shift` | Output label distribution shift |
|
|
| `pos_rate` | Positive prediction rate |
|
|
| `data_integrity_issues` | Count of detected data quality issues |
|
|
|
|
---
|
|
|
|
## Training Details
|
|
|
|
### Dataset
|
|
|
|
- **Name:** Custom synthetic clinical audit dataset (`audit_dataset_v2_5000.json`)
|
|
- **Size:** 5,000 labeled samples
|
|
- **Split:** 80% train (4,000) / 20% test (1,000)
|
|
- **Format:** JSONL — each record has `instruction`, `input` (metrics JSON), `output` (category + explanation)
|
|
- **Generation date:** November 17, 2025
|
|
|
|
Each sample pairs a set of synthetic model performance metrics with a human-written audit label and explanation covering categories such as:
|
|
- Healthy / Passing
|
|
- Calibration Failure
|
|
- Major Drift / Potential Drift
|
|
- Class Imbalance Problem
|
|
- Data Integrity Issue
|
|
- Needs Review / Critical Failure
|
|
|
|
### Training procedure
|
|
|
|
The base model was loaded in 8-bit using `BitsAndBytesConfig` and adapted with LoRA targeting the attention projection layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`). After training, the LoRA adapter was merged into the base model weights using `peft.merge_and_unload()` and saved as full FP16 safetensors.
|
|
|
|
**Prompt format used during training:**
|
|
|
|
```
|
|
<|system|>
|
|
You are an AI auditor analyzing clinical model performance reports.
|
|
<|user|>
|
|
Instruction: Analyze the clinical model report and classify its health.
|
|
|
|
Report:
|
|
{ ...metrics JSON... }
|
|
<|assistant|>
|
|
Category: <label>
|
|
Explanation: <explanation>
|
|
```
|
|
|
|
### Hyperparameters
|
|
|
|
| Parameter | Value |
|
|
|---|---|
|
|
| Epochs | 3 |
|
|
| Batch size | 4 |
|
|
| Gradient accumulation steps | 4 |
|
|
| Effective batch size | 16 |
|
|
| Learning rate | 1e-4 |
|
|
| Warmup ratio | 0.1 |
|
|
| Max sequence length | 512 |
|
|
| Optimizer | AdamW (default) |
|
|
| Precision | FP16 (mixed) |
|
|
|
|
### Training loss
|
|
|
|
| Step | Epoch | Loss |
|
|
|---|---|---|
|
|
| 50 | 0.22 | 1.623 |
|
|
| 100 | 0.44 | 0.657 |
|
|
| 150 | 0.67 | 0.444 |
|
|
| 200 | 0.89 | 0.420 |
|
|
| 300 | 1.33 | 0.413 |
|
|
| 450 | 2.00 | 0.412 |
|
|
| 600 | 2.67 | 0.408 |
|
|
| 675 | 3.00 | ~0.410 |
|
|
|
|
Loss converged rapidly after the first 150 steps, stabilizing around 0.41 for the remainder of training.
|
|
|
|
---
|
|
|
|
## Evaluation
|
|
|
|
The model was evaluated on a held-out test set of 1,000 samples using weighted precision, recall, F1, and accuracy computed by extracting the `Category:` field from generated outputs and comparing to ground-truth labels.
|
|
|
|
> Formal evaluation metrics will be added here once a full benchmark run is completed.
|
|
|
|
---
|
|
|
|
## Limitations & Bias
|
|
|
|
- **Synthetic training data:** The model was trained entirely on synthetically generated audit reports. Real-world clinical model metrics may follow different distributions or contain edge cases not represented in training.
|
|
- **Label sensitivity:** The model may be sensitive to metric combinations near decision boundaries between categories.
|
|
- **No temporal reasoning:** The model does not reason about metric trends over time — each inference is based on a single snapshot of metrics.
|
|
- **English only:** All training data is in English.
|
|
- **Not a substitute for expert review:** Outputs should be treated as decision-support, not a final audit verdict.
|
|
|
|
---
|
|
|
|
## Repository & Related Work
|
|
|
|
- **Training code:** [Hospital-Audit-Trained-Model (GitHub)](https://github.com/PhantomAjusshi/Hospital-Audit-Trained-Model)
|
|
- **Web application:** [Hospital-Model-Audit-Website (GitHub)](https://github.com/PhantomAjusshi/Hospital-Model-Audit-Website) — a full-stack Next.js + FastAPI interface that uses this model via llama.cpp
|
|
|
|
---
|
|
|
|
## Citation
|
|
|
|
If you use this model in your work, please cite:
|
|
|
|
```bibtex
|
|
@misc{phi3-auditor-merged,
|
|
author = {PhantomAjusshi},
|
|
title = {phi3-auditor-merged: Phi-3-mini fine-tuned for clinical AI model auditing},
|
|
year = {2025},
|
|
publisher = {HuggingFace},
|
|
url = {https://huggingface.co/PhantomAjusshi/phi3-auditor-merged}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## License
|
|
|
|
This model is released under the **MIT License**.
|
|
|
|
The base model ([microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)) is subject to Microsoft's Phi-3 license. Please review it before use in commercial or production settings. |