Files

ModelHub XC 17f7551cd9 初始化项目，由ModelHub XC社区提供模型

Model: PhantomAjusshi/phi3-auditor-merged
Source: Original Platform

2026-06-03 07:14:19 +08:00

8.4 KiB

Raw Blame History

license, language, base_model, pipeline_tag, library_name, tags

license

language

base_model

pipeline_tag

library_name

🏥 phi3-auditor-merged

Phi-3-mini fine-tuned for clinical AI model auditing.

This model takes a JSON object of ML performance metrics (AUC, ECE, drift, label shift, etc.) and returns a structured health classification label plus a detailed explanation — helping teams audit deployed clinical models for drift, calibration failure, class imbalance, and other issues.

Model Details

Property	Value
Base Model	microsoft/Phi-3-mini-4k-instruct
Fine-tuning Method	LoRA (Low-Rank Adaptation) via PEFT
Training Precision	8-bit quantized (BitsAndBytesConfig)
Merged Precision	FP16 (float16 safetensors)
Parameters	~3.8B
Model Size	7.65 GB (2 safetensor shards)
LoRA Rank (r)	16
LoRA Alpha	32
LoRA Dropout	0.05
Target Modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`
Task Type	Causal Language Modeling
PEFT Version	0.18.0
Training Epochs	3
Final Loss	~0.41

Intended Use

What this model does

Given a JSON report of clinical ML model performance metrics, the model:

Assigns a Category label (e.g. Calibration Failure, Major Drift, Class Imbalance Problem, Healthy)
Generates a concise Explanation with observations and recommendations

Intended users

ML engineers monitoring deployed clinical models
Healthcare data science teams running periodic model audits
Researchers studying automated model health assessment

Out-of-scope use

Not suitable for direct clinical decision-making or patient diagnosis
Not a replacement for domain expert review of model performance
Not designed for non-clinical ML tasks
Should not be used on data types outside its training distribution (non-tabular metrics, images, etc.)

How to Use

Requirements

pip install transformers torch accelerate

Basic inference

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "PhantomAjusshi/phi3-auditor-merged"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
    trust_remote_code=True,  # Required for custom Phi-3 modeling files
)

report = """{
  "auc": 0.863,
  "accuracy": 0.83,
  "precision": 0.79,
  "recall": 0.69,
  "f1": 0.79,
  "ece": 0.278,
  "brier": 0.263,
  "drift": 0.03,
  "missing_rate": 0.003,
  "label_shift": 0.06,
  "pos_rate": 0.10,
  "data_integrity_issues": 0
}"""

prompt = (
    f"<|system|>\nYou are a clinical AI auditor model.\n"
    f"<|user|>\nInstruction: Analyze the clinical model report and classify its health.\n\nReport:\n{report}\n"
    f"<|assistant|>\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=400,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.2,
        do_sample=True,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the assistant's reply
reply = response.split("<|assistant|>")[-1].strip()
print(reply)

Expected output format

Category: Calibration Failure
Explanation: High calibration error (ECE 0.278) despite reasonable discrimination (AUC 0.863).
The model's probability outputs are poorly aligned with actual outcomes. Recommend
recalibration using Platt scaling or isotonic regression, and threshold review.

Input metrics reference

Metric	Description
`auc`	Area Under the ROC Curve
`accuracy`	Overall classification accuracy
`precision`	Positive predictive value
`recall`	Sensitivity / true positive rate
`f1`	Harmonic mean of precision and recall
`ece`	Expected Calibration Error
`brier`	Brier score (probabilistic accuracy)
`drift`	Feature distribution drift score
`missing_rate`	Rate of missing input features
`label_shift`	Output label distribution shift
`pos_rate`	Positive prediction rate
`data_integrity_issues`	Count of detected data quality issues

Training Details

Dataset

Name: Custom synthetic clinical audit dataset (audit_dataset_v2_5000.json)
Size: 5,000 labeled samples
Split: 80% train (4,000) / 20% test (1,000)
Format: JSONL — each record has instruction, input (metrics JSON), output (category + explanation)
Generation date: November 17, 2025

Each sample pairs a set of synthetic model performance metrics with a human-written audit label and explanation covering categories such as:

Healthy / Passing
Calibration Failure
Major Drift / Potential Drift
Class Imbalance Problem
Data Integrity Issue
Needs Review / Critical Failure

Training procedure

The base model was loaded in 8-bit using BitsAndBytesConfig and adapted with LoRA targeting the attention projection layers (q_proj, k_proj, v_proj, o_proj). After training, the LoRA adapter was merged into the base model weights using peft.merge_and_unload() and saved as full FP16 safetensors.

Prompt format used during training:

<|system|>
You are an AI auditor analyzing clinical model performance reports.
<|user|>
Instruction: Analyze the clinical model report and classify its health.

Report:
{ ...metrics JSON... }
<|assistant|>
Category: <label>
Explanation: <explanation>

Hyperparameters

Parameter	Value
Epochs	3
Batch size	4
Gradient accumulation steps	4
Effective batch size	16
Learning rate	1e-4
Warmup ratio	0.1
Max sequence length	512
Optimizer	AdamW (default)
Precision	FP16 (mixed)

Training loss

Step	Epoch	Loss
50	0.22	1.623
100	0.44	0.657
150	0.67	0.444
200	0.89	0.420
300	1.33	0.413
450	2.00	0.412
600	2.67	0.408
675	3.00	~0.410

Loss converged rapidly after the first 150 steps, stabilizing around 0.41 for the remainder of training.

Evaluation

The model was evaluated on a held-out test set of 1,000 samples using weighted precision, recall, F1, and accuracy computed by extracting the Category: field from generated outputs and comparing to ground-truth labels.

Formal evaluation metrics will be added here once a full benchmark run is completed.

Limitations & Bias

Synthetic training data: The model was trained entirely on synthetically generated audit reports. Real-world clinical model metrics may follow different distributions or contain edge cases not represented in training.
Label sensitivity: The model may be sensitive to metric combinations near decision boundaries between categories.
No temporal reasoning: The model does not reason about metric trends over time — each inference is based on a single snapshot of metrics.
English only: All training data is in English.
Not a substitute for expert review: Outputs should be treated as decision-support, not a final audit verdict.

Training code: Hospital-Audit-Trained-Model (GitHub)
Web application: Hospital-Model-Audit-Website (GitHub) — a full-stack Next.js + FastAPI interface that uses this model via llama.cpp

Citation

If you use this model in your work, please cite:

@misc{phi3-auditor-merged,
  author       = {PhantomAjusshi},
  title        = {phi3-auditor-merged: Phi-3-mini fine-tuned for clinical AI model auditing},
  year         = {2025},
  publisher    = {HuggingFace},
  url          = {https://huggingface.co/PhantomAjusshi/phi3-auditor-merged}
}

License

This model is released under the MIT License.

The base model (microsoft/Phi-3-mini-4k-instruct) is subject to Microsoft's Phi-3 license. Please review it before use in commercial or production settings.

8.4 KiB Raw Blame History