8.4 KiB
license, language, base_model, pipeline_tag, library_name, tags
| license | language | base_model | pipeline_tag | library_name | tags | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mit |
|
|
text-generation | transformers |
|
🏥 phi3-auditor-merged
Phi-3-mini fine-tuned for clinical AI model auditing.
This model takes a JSON object of ML performance metrics (AUC, ECE, drift, label shift, etc.) and returns a structured health classification label plus a detailed explanation — helping teams audit deployed clinical models for drift, calibration failure, class imbalance, and other issues.
Model Details
| Property | Value |
|---|---|
| Base Model | microsoft/Phi-3-mini-4k-instruct |
| Fine-tuning Method | LoRA (Low-Rank Adaptation) via PEFT |
| Training Precision | 8-bit quantized (BitsAndBytesConfig) |
| Merged Precision | FP16 (float16 safetensors) |
| Parameters | ~3.8B |
| Model Size | 7.65 GB (2 safetensor shards) |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.05 |
| Target Modules | q_proj, k_proj, v_proj, o_proj |
| Task Type | Causal Language Modeling |
| PEFT Version | 0.18.0 |
| Training Epochs | 3 |
| Final Loss | ~0.41 |
Intended Use
What this model does
Given a JSON report of clinical ML model performance metrics, the model:
- Assigns a Category label (e.g.
Calibration Failure,Major Drift,Class Imbalance Problem,Healthy) - Generates a concise Explanation with observations and recommendations
Intended users
- ML engineers monitoring deployed clinical models
- Healthcare data science teams running periodic model audits
- Researchers studying automated model health assessment
Out-of-scope use
- Not suitable for direct clinical decision-making or patient diagnosis
- Not a replacement for domain expert review of model performance
- Not designed for non-clinical ML tasks
- Should not be used on data types outside its training distribution (non-tabular metrics, images, etc.)
How to Use
Requirements
pip install transformers torch accelerate
Basic inference
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "PhantomAjusshi/phi3-auditor-merged"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
trust_remote_code=True, # Required for custom Phi-3 modeling files
)
report = """{
"auc": 0.863,
"accuracy": 0.83,
"precision": 0.79,
"recall": 0.69,
"f1": 0.79,
"ece": 0.278,
"brier": 0.263,
"drift": 0.03,
"missing_rate": 0.003,
"label_shift": 0.06,
"pos_rate": 0.10,
"data_integrity_issues": 0
}"""
prompt = (
f"<|system|>\nYou are a clinical AI auditor model.\n"
f"<|user|>\nInstruction: Analyze the clinical model report and classify its health.\n\nReport:\n{report}\n"
f"<|assistant|>\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=400,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.2,
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the assistant's reply
reply = response.split("<|assistant|>")[-1].strip()
print(reply)
Expected output format
Category: Calibration Failure
Explanation: High calibration error (ECE 0.278) despite reasonable discrimination (AUC 0.863).
The model's probability outputs are poorly aligned with actual outcomes. Recommend
recalibration using Platt scaling or isotonic regression, and threshold review.
Input metrics reference
| Metric | Description |
|---|---|
auc |
Area Under the ROC Curve |
accuracy |
Overall classification accuracy |
precision |
Positive predictive value |
recall |
Sensitivity / true positive rate |
f1 |
Harmonic mean of precision and recall |
ece |
Expected Calibration Error |
brier |
Brier score (probabilistic accuracy) |
drift |
Feature distribution drift score |
missing_rate |
Rate of missing input features |
label_shift |
Output label distribution shift |
pos_rate |
Positive prediction rate |
data_integrity_issues |
Count of detected data quality issues |
Training Details
Dataset
- Name: Custom synthetic clinical audit dataset (
audit_dataset_v2_5000.json) - Size: 5,000 labeled samples
- Split: 80% train (4,000) / 20% test (1,000)
- Format: JSONL — each record has
instruction,input(metrics JSON),output(category + explanation) - Generation date: November 17, 2025
Each sample pairs a set of synthetic model performance metrics with a human-written audit label and explanation covering categories such as:
- Healthy / Passing
- Calibration Failure
- Major Drift / Potential Drift
- Class Imbalance Problem
- Data Integrity Issue
- Needs Review / Critical Failure
Training procedure
The base model was loaded in 8-bit using BitsAndBytesConfig and adapted with LoRA targeting the attention projection layers (q_proj, k_proj, v_proj, o_proj). After training, the LoRA adapter was merged into the base model weights using peft.merge_and_unload() and saved as full FP16 safetensors.
Prompt format used during training:
<|system|>
You are an AI auditor analyzing clinical model performance reports.
<|user|>
Instruction: Analyze the clinical model report and classify its health.
Report:
{ ...metrics JSON... }
<|assistant|>
Category: <label>
Explanation: <explanation>
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Batch size | 4 |
| Gradient accumulation steps | 4 |
| Effective batch size | 16 |
| Learning rate | 1e-4 |
| Warmup ratio | 0.1 |
| Max sequence length | 512 |
| Optimizer | AdamW (default) |
| Precision | FP16 (mixed) |
Training loss
| Step | Epoch | Loss |
|---|---|---|
| 50 | 0.22 | 1.623 |
| 100 | 0.44 | 0.657 |
| 150 | 0.67 | 0.444 |
| 200 | 0.89 | 0.420 |
| 300 | 1.33 | 0.413 |
| 450 | 2.00 | 0.412 |
| 600 | 2.67 | 0.408 |
| 675 | 3.00 | ~0.410 |
Loss converged rapidly after the first 150 steps, stabilizing around 0.41 for the remainder of training.
Evaluation
The model was evaluated on a held-out test set of 1,000 samples using weighted precision, recall, F1, and accuracy computed by extracting the Category: field from generated outputs and comparing to ground-truth labels.
Formal evaluation metrics will be added here once a full benchmark run is completed.
Limitations & Bias
- Synthetic training data: The model was trained entirely on synthetically generated audit reports. Real-world clinical model metrics may follow different distributions or contain edge cases not represented in training.
- Label sensitivity: The model may be sensitive to metric combinations near decision boundaries between categories.
- No temporal reasoning: The model does not reason about metric trends over time — each inference is based on a single snapshot of metrics.
- English only: All training data is in English.
- Not a substitute for expert review: Outputs should be treated as decision-support, not a final audit verdict.
Repository & Related Work
- Training code: Hospital-Audit-Trained-Model (GitHub)
- Web application: Hospital-Model-Audit-Website (GitHub) — a full-stack Next.js + FastAPI interface that uses this model via llama.cpp
Citation
If you use this model in your work, please cite:
@misc{phi3-auditor-merged,
author = {PhantomAjusshi},
title = {phi3-auditor-merged: Phi-3-mini fine-tuned for clinical AI model auditing},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/PhantomAjusshi/phi3-auditor-merged}
}
License
This model is released under the MIT License.
The base model (microsoft/Phi-3-mini-4k-instruct) is subject to Microsoft's Phi-3 license. Please review it before use in commercial or production settings.