初始化项目,由ModelHub XC社区提供模型
Model: PhantomAjusshi/phi3-auditor-merged Source: Original Platform
This commit is contained in:
278
README.md
Normal file
278
README.md
Normal file
@@ -0,0 +1,278 @@
|
||||
---
|
||||
license: mit
|
||||
language:
|
||||
- en
|
||||
base_model:
|
||||
- microsoft/Phi-3-mini-4k-instruct
|
||||
pipeline_tag: text-generation
|
||||
library_name: transformers
|
||||
tags:
|
||||
- phi3
|
||||
- lora
|
||||
- peft
|
||||
- clinical-ai
|
||||
- model-audit
|
||||
- text-generation
|
||||
- fine-tuned
|
||||
- healthcare
|
||||
- safetensors
|
||||
---
|
||||
|
||||
# 🏥 phi3-auditor-merged
|
||||
|
||||
**Phi-3-mini fine-tuned for clinical AI model auditing.**
|
||||
|
||||
This model takes a JSON object of ML performance metrics (AUC, ECE, drift, label shift, etc.) and returns a structured health classification label plus a detailed explanation — helping teams audit deployed clinical models for drift, calibration failure, class imbalance, and other issues.
|
||||
|
||||
---
|
||||
|
||||
## Model Details
|
||||
|
||||
| Property | Value |
|
||||
|---|---|
|
||||
| **Base Model** | [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) |
|
||||
| **Fine-tuning Method** | LoRA (Low-Rank Adaptation) via PEFT |
|
||||
| **Training Precision** | 8-bit quantized (BitsAndBytesConfig) |
|
||||
| **Merged Precision** | FP16 (float16 safetensors) |
|
||||
| **Parameters** | ~3.8B |
|
||||
| **Model Size** | 7.65 GB (2 safetensor shards) |
|
||||
| **LoRA Rank (r)** | 16 |
|
||||
| **LoRA Alpha** | 32 |
|
||||
| **LoRA Dropout** | 0.05 |
|
||||
| **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj` |
|
||||
| **Task Type** | Causal Language Modeling |
|
||||
| **PEFT Version** | 0.18.0 |
|
||||
| **Training Epochs** | 3 |
|
||||
| **Final Loss** | ~0.41 |
|
||||
|
||||
---
|
||||
|
||||
## Intended Use
|
||||
|
||||
### What this model does
|
||||
|
||||
Given a JSON report of clinical ML model performance metrics, the model:
|
||||
|
||||
1. Assigns a **Category** label (e.g. `Calibration Failure`, `Major Drift`, `Class Imbalance Problem`, `Healthy`)
|
||||
2. Generates a concise **Explanation** with observations and recommendations
|
||||
|
||||
### Intended users
|
||||
|
||||
- ML engineers monitoring deployed clinical models
|
||||
- Healthcare data science teams running periodic model audits
|
||||
- Researchers studying automated model health assessment
|
||||
|
||||
### Out-of-scope use
|
||||
|
||||
- Not suitable for direct clinical decision-making or patient diagnosis
|
||||
- Not a replacement for domain expert review of model performance
|
||||
- Not designed for non-clinical ML tasks
|
||||
- Should not be used on data types outside its training distribution (non-tabular metrics, images, etc.)
|
||||
|
||||
---
|
||||
|
||||
## How to Use
|
||||
|
||||
### Requirements
|
||||
|
||||
```bash
|
||||
pip install transformers torch accelerate
|
||||
```
|
||||
|
||||
### Basic inference
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
model_id = "PhantomAjusshi/phi3-auditor-merged"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
|
||||
device_map="auto",
|
||||
trust_remote_code=True, # Required for custom Phi-3 modeling files
|
||||
)
|
||||
|
||||
report = """{
|
||||
"auc": 0.863,
|
||||
"accuracy": 0.83,
|
||||
"precision": 0.79,
|
||||
"recall": 0.69,
|
||||
"f1": 0.79,
|
||||
"ece": 0.278,
|
||||
"brier": 0.263,
|
||||
"drift": 0.03,
|
||||
"missing_rate": 0.003,
|
||||
"label_shift": 0.06,
|
||||
"pos_rate": 0.10,
|
||||
"data_integrity_issues": 0
|
||||
}"""
|
||||
|
||||
prompt = (
|
||||
f"<|system|>\nYou are a clinical AI auditor model.\n"
|
||||
f"<|user|>\nInstruction: Analyze the clinical model report and classify its health.\n\nReport:\n{report}\n"
|
||||
f"<|assistant|>\n"
|
||||
)
|
||||
|
||||
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
||||
|
||||
with torch.inference_mode():
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=400,
|
||||
temperature=0.7,
|
||||
top_p=0.9,
|
||||
repetition_penalty=1.2,
|
||||
do_sample=True,
|
||||
)
|
||||
|
||||
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
||||
# Extract only the assistant's reply
|
||||
reply = response.split("<|assistant|>")[-1].strip()
|
||||
print(reply)
|
||||
```
|
||||
|
||||
### Expected output format
|
||||
|
||||
```
|
||||
Category: Calibration Failure
|
||||
Explanation: High calibration error (ECE 0.278) despite reasonable discrimination (AUC 0.863).
|
||||
The model's probability outputs are poorly aligned with actual outcomes. Recommend
|
||||
recalibration using Platt scaling or isotonic regression, and threshold review.
|
||||
```
|
||||
|
||||
### Input metrics reference
|
||||
|
||||
| Metric | Description |
|
||||
|---|---|
|
||||
| `auc` | Area Under the ROC Curve |
|
||||
| `accuracy` | Overall classification accuracy |
|
||||
| `precision` | Positive predictive value |
|
||||
| `recall` | Sensitivity / true positive rate |
|
||||
| `f1` | Harmonic mean of precision and recall |
|
||||
| `ece` | Expected Calibration Error |
|
||||
| `brier` | Brier score (probabilistic accuracy) |
|
||||
| `drift` | Feature distribution drift score |
|
||||
| `missing_rate` | Rate of missing input features |
|
||||
| `label_shift` | Output label distribution shift |
|
||||
| `pos_rate` | Positive prediction rate |
|
||||
| `data_integrity_issues` | Count of detected data quality issues |
|
||||
|
||||
---
|
||||
|
||||
## Training Details
|
||||
|
||||
### Dataset
|
||||
|
||||
- **Name:** Custom synthetic clinical audit dataset (`audit_dataset_v2_5000.json`)
|
||||
- **Size:** 5,000 labeled samples
|
||||
- **Split:** 80% train (4,000) / 20% test (1,000)
|
||||
- **Format:** JSONL — each record has `instruction`, `input` (metrics JSON), `output` (category + explanation)
|
||||
- **Generation date:** November 17, 2025
|
||||
|
||||
Each sample pairs a set of synthetic model performance metrics with a human-written audit label and explanation covering categories such as:
|
||||
- Healthy / Passing
|
||||
- Calibration Failure
|
||||
- Major Drift / Potential Drift
|
||||
- Class Imbalance Problem
|
||||
- Data Integrity Issue
|
||||
- Needs Review / Critical Failure
|
||||
|
||||
### Training procedure
|
||||
|
||||
The base model was loaded in 8-bit using `BitsAndBytesConfig` and adapted with LoRA targeting the attention projection layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`). After training, the LoRA adapter was merged into the base model weights using `peft.merge_and_unload()` and saved as full FP16 safetensors.
|
||||
|
||||
**Prompt format used during training:**
|
||||
|
||||
```
|
||||
<|system|>
|
||||
You are an AI auditor analyzing clinical model performance reports.
|
||||
<|user|>
|
||||
Instruction: Analyze the clinical model report and classify its health.
|
||||
|
||||
Report:
|
||||
{ ...metrics JSON... }
|
||||
<|assistant|>
|
||||
Category: <label>
|
||||
Explanation: <explanation>
|
||||
```
|
||||
|
||||
### Hyperparameters
|
||||
|
||||
| Parameter | Value |
|
||||
|---|---|
|
||||
| Epochs | 3 |
|
||||
| Batch size | 4 |
|
||||
| Gradient accumulation steps | 4 |
|
||||
| Effective batch size | 16 |
|
||||
| Learning rate | 1e-4 |
|
||||
| Warmup ratio | 0.1 |
|
||||
| Max sequence length | 512 |
|
||||
| Optimizer | AdamW (default) |
|
||||
| Precision | FP16 (mixed) |
|
||||
|
||||
### Training loss
|
||||
|
||||
| Step | Epoch | Loss |
|
||||
|---|---|---|
|
||||
| 50 | 0.22 | 1.623 |
|
||||
| 100 | 0.44 | 0.657 |
|
||||
| 150 | 0.67 | 0.444 |
|
||||
| 200 | 0.89 | 0.420 |
|
||||
| 300 | 1.33 | 0.413 |
|
||||
| 450 | 2.00 | 0.412 |
|
||||
| 600 | 2.67 | 0.408 |
|
||||
| 675 | 3.00 | ~0.410 |
|
||||
|
||||
Loss converged rapidly after the first 150 steps, stabilizing around 0.41 for the remainder of training.
|
||||
|
||||
---
|
||||
|
||||
## Evaluation
|
||||
|
||||
The model was evaluated on a held-out test set of 1,000 samples using weighted precision, recall, F1, and accuracy computed by extracting the `Category:` field from generated outputs and comparing to ground-truth labels.
|
||||
|
||||
> Formal evaluation metrics will be added here once a full benchmark run is completed.
|
||||
|
||||
---
|
||||
|
||||
## Limitations & Bias
|
||||
|
||||
- **Synthetic training data:** The model was trained entirely on synthetically generated audit reports. Real-world clinical model metrics may follow different distributions or contain edge cases not represented in training.
|
||||
- **Label sensitivity:** The model may be sensitive to metric combinations near decision boundaries between categories.
|
||||
- **No temporal reasoning:** The model does not reason about metric trends over time — each inference is based on a single snapshot of metrics.
|
||||
- **English only:** All training data is in English.
|
||||
- **Not a substitute for expert review:** Outputs should be treated as decision-support, not a final audit verdict.
|
||||
|
||||
---
|
||||
|
||||
## Repository & Related Work
|
||||
|
||||
- **Training code:** [Hospital-Audit-Trained-Model (GitHub)](https://github.com/PhantomAjusshi/Hospital-Audit-Trained-Model)
|
||||
- **Web application:** [Hospital-Model-Audit-Website (GitHub)](https://github.com/PhantomAjusshi/Hospital-Model-Audit-Website) — a full-stack Next.js + FastAPI interface that uses this model via llama.cpp
|
||||
|
||||
---
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this model in your work, please cite:
|
||||
|
||||
```bibtex
|
||||
@misc{phi3-auditor-merged,
|
||||
author = {PhantomAjusshi},
|
||||
title = {phi3-auditor-merged: Phi-3-mini fine-tuned for clinical AI model auditing},
|
||||
year = {2025},
|
||||
publisher = {HuggingFace},
|
||||
url = {https://huggingface.co/PhantomAjusshi/phi3-auditor-merged}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
This model is released under the **MIT License**.
|
||||
|
||||
The base model ([microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)) is subject to Microsoft's Phi-3 license. Please review it before use in commercial or production settings.
|
||||
Reference in New Issue
Block a user