ModelHub XC 17f7551cd9 初始化项目,由ModelHub XC社区提供模型
Model: PhantomAjusshi/phi3-auditor-merged
Source: Original Platform
2026-06-03 07:14:19 +08:00

license, language, base_model, pipeline_tag, library_name, tags
license language base_model pipeline_tag library_name tags
mit
en
microsoft/Phi-3-mini-4k-instruct
text-generation transformers
phi3
lora
peft
clinical-ai
model-audit
text-generation
fine-tuned
healthcare
safetensors

🏥 phi3-auditor-merged

Phi-3-mini fine-tuned for clinical AI model auditing.

This model takes a JSON object of ML performance metrics (AUC, ECE, drift, label shift, etc.) and returns a structured health classification label plus a detailed explanation — helping teams audit deployed clinical models for drift, calibration failure, class imbalance, and other issues.


Model Details

Property Value
Base Model microsoft/Phi-3-mini-4k-instruct
Fine-tuning Method LoRA (Low-Rank Adaptation) via PEFT
Training Precision 8-bit quantized (BitsAndBytesConfig)
Merged Precision FP16 (float16 safetensors)
Parameters ~3.8B
Model Size 7.65 GB (2 safetensor shards)
LoRA Rank (r) 16
LoRA Alpha 32
LoRA Dropout 0.05
Target Modules q_proj, k_proj, v_proj, o_proj
Task Type Causal Language Modeling
PEFT Version 0.18.0
Training Epochs 3
Final Loss ~0.41

Intended Use

What this model does

Given a JSON report of clinical ML model performance metrics, the model:

  1. Assigns a Category label (e.g. Calibration Failure, Major Drift, Class Imbalance Problem, Healthy)
  2. Generates a concise Explanation with observations and recommendations

Intended users

  • ML engineers monitoring deployed clinical models
  • Healthcare data science teams running periodic model audits
  • Researchers studying automated model health assessment

Out-of-scope use

  • Not suitable for direct clinical decision-making or patient diagnosis
  • Not a replacement for domain expert review of model performance
  • Not designed for non-clinical ML tasks
  • Should not be used on data types outside its training distribution (non-tabular metrics, images, etc.)

How to Use

Requirements

pip install transformers torch accelerate

Basic inference

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "PhantomAjusshi/phi3-auditor-merged"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
    trust_remote_code=True,  # Required for custom Phi-3 modeling files
)

report = """{
  "auc": 0.863,
  "accuracy": 0.83,
  "precision": 0.79,
  "recall": 0.69,
  "f1": 0.79,
  "ece": 0.278,
  "brier": 0.263,
  "drift": 0.03,
  "missing_rate": 0.003,
  "label_shift": 0.06,
  "pos_rate": 0.10,
  "data_integrity_issues": 0
}"""

prompt = (
    f"<|system|>\nYou are a clinical AI auditor model.\n"
    f"<|user|>\nInstruction: Analyze the clinical model report and classify its health.\n\nReport:\n{report}\n"
    f"<|assistant|>\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=400,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.2,
        do_sample=True,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the assistant's reply
reply = response.split("<|assistant|>")[-1].strip()
print(reply)

Expected output format

Category: Calibration Failure
Explanation: High calibration error (ECE 0.278) despite reasonable discrimination (AUC 0.863).
The model's probability outputs are poorly aligned with actual outcomes. Recommend
recalibration using Platt scaling or isotonic regression, and threshold review.

Input metrics reference

Metric Description
auc Area Under the ROC Curve
accuracy Overall classification accuracy
precision Positive predictive value
recall Sensitivity / true positive rate
f1 Harmonic mean of precision and recall
ece Expected Calibration Error
brier Brier score (probabilistic accuracy)
drift Feature distribution drift score
missing_rate Rate of missing input features
label_shift Output label distribution shift
pos_rate Positive prediction rate
data_integrity_issues Count of detected data quality issues

Training Details

Dataset

  • Name: Custom synthetic clinical audit dataset (audit_dataset_v2_5000.json)
  • Size: 5,000 labeled samples
  • Split: 80% train (4,000) / 20% test (1,000)
  • Format: JSONL — each record has instruction, input (metrics JSON), output (category + explanation)
  • Generation date: November 17, 2025

Each sample pairs a set of synthetic model performance metrics with a human-written audit label and explanation covering categories such as:

  • Healthy / Passing
  • Calibration Failure
  • Major Drift / Potential Drift
  • Class Imbalance Problem
  • Data Integrity Issue
  • Needs Review / Critical Failure

Training procedure

The base model was loaded in 8-bit using BitsAndBytesConfig and adapted with LoRA targeting the attention projection layers (q_proj, k_proj, v_proj, o_proj). After training, the LoRA adapter was merged into the base model weights using peft.merge_and_unload() and saved as full FP16 safetensors.

Prompt format used during training:

<|system|>
You are an AI auditor analyzing clinical model performance reports.
<|user|>
Instruction: Analyze the clinical model report and classify its health.

Report:
{ ...metrics JSON... }
<|assistant|>
Category: <label>
Explanation: <explanation>

Hyperparameters

Parameter Value
Epochs 3
Batch size 4
Gradient accumulation steps 4
Effective batch size 16
Learning rate 1e-4
Warmup ratio 0.1
Max sequence length 512
Optimizer AdamW (default)
Precision FP16 (mixed)

Training loss

Step Epoch Loss
50 0.22 1.623
100 0.44 0.657
150 0.67 0.444
200 0.89 0.420
300 1.33 0.413
450 2.00 0.412
600 2.67 0.408
675 3.00 ~0.410

Loss converged rapidly after the first 150 steps, stabilizing around 0.41 for the remainder of training.


Evaluation

The model was evaluated on a held-out test set of 1,000 samples using weighted precision, recall, F1, and accuracy computed by extracting the Category: field from generated outputs and comparing to ground-truth labels.

Formal evaluation metrics will be added here once a full benchmark run is completed.


Limitations & Bias

  • Synthetic training data: The model was trained entirely on synthetically generated audit reports. Real-world clinical model metrics may follow different distributions or contain edge cases not represented in training.
  • Label sensitivity: The model may be sensitive to metric combinations near decision boundaries between categories.
  • No temporal reasoning: The model does not reason about metric trends over time — each inference is based on a single snapshot of metrics.
  • English only: All training data is in English.
  • Not a substitute for expert review: Outputs should be treated as decision-support, not a final audit verdict.


Citation

If you use this model in your work, please cite:

@misc{phi3-auditor-merged,
  author       = {PhantomAjusshi},
  title        = {phi3-auditor-merged: Phi-3-mini fine-tuned for clinical AI model auditing},
  year         = {2025},
  publisher    = {HuggingFace},
  url          = {https://huggingface.co/PhantomAjusshi/phi3-auditor-merged}
}

License

This model is released under the MIT License.

The base model (microsoft/Phi-3-mini-4k-instruct) is subject to Microsoft's Phi-3 license. Please review it before use in commercial or production settings.

Description
Model synced from source: PhantomAjusshi/phi3-auditor-merged
Readme 682 KiB
Languages
Python 99.5%
Jinja 0.5%