288 lines
9.9 KiB
Markdown
288 lines
9.9 KiB
Markdown
|
|
---
|
|||
|
|
base_model:
|
|||
|
|
- unsloth/Llama-3.2-1B-Instruct
|
|||
|
|
tags:
|
|||
|
|
- text-generation
|
|||
|
|
- transformers
|
|||
|
|
- unsloth
|
|||
|
|
- llama-3.2
|
|||
|
|
- clinical
|
|||
|
|
- medical
|
|||
|
|
- text-cleaning
|
|||
|
|
- finetuned
|
|||
|
|
license: other
|
|||
|
|
language:
|
|||
|
|
- eng
|
|||
|
|
datasets:
|
|||
|
|
- custom/clinical-text-cleaning
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
<div align="center">
|
|||
|
|
|
|||
|
|
# 🩺 clinical-info-corrector-1B
|
|||
|
|
|
|||
|
|
**Finetuned Llama 3.2 1B Instruct model for cleaning clinical text (histories, notes & staff remarks)**
|
|||
|
|
|
|||
|
|
<img src="https://huggingface.co/alphaaico/clinical-info-corrector-1B/resolve/main/clinical-info-corrector-1B.png"
|
|||
|
|
alt="clinical-info-corrector-1B"
|
|||
|
|
style="width: 500px;
|
|||
|
|
height: auto;
|
|||
|
|
object-position: center top;">
|
|||
|
|
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Model Card
|
|||
|
|
|
|||
|
|
- **Model name:** `clinical-info-corrector-1B`
|
|||
|
|
- **Base model:** `unsloth/Llama-3.2-1B-Instruct`
|
|||
|
|
- **Finetuning framework:** [Unsloth](https://github.com/unslothai/unsloth) + TRL `SFTTrainer`
|
|||
|
|
- **Task:** Clinical text *cleaning & rephrasing* (not diagnosis)
|
|||
|
|
- **Status:** Experimental — trained on a relatively small, noisy custom dataset
|
|||
|
|
|
|||
|
|
This model is designed to take **messy clinical content** (patient histories, discharge notes, staff remarks, “adviced” sections, etc.) and rewrite it into **clear, grammatically correct, medically faithful text** while preserving the underlying facts.
|
|||
|
|
|
|||
|
|
It is *not* a diagnostic or decision-support model. Think of it as a domain-aware “clinical copy-editor” for structured text.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Motivation & Intended Use
|
|||
|
|
|
|||
|
|
Many hospital information systems accumulate:
|
|||
|
|
|
|||
|
|
- telegraphic note fragments
|
|||
|
|
- mixed-case, ALL-CAPS, abbreviations everywhere
|
|||
|
|
- repeated or contradictory phrases
|
|||
|
|
- encoding artefacts (e.g., `Â`, broken bullets, etc.)
|
|||
|
|
|
|||
|
|
The goal of `clinical-info-corrector-1B` is to explore whether even a **small 1B parameter model**, fine-tuned on a **crude but realistic dataset**, can:
|
|||
|
|
|
|||
|
|
- clean up this text,
|
|||
|
|
- normalize style and grammar,
|
|||
|
|
- expand common medical abbreviations where helpful, and
|
|||
|
|
- keep the **clinical meaning intact**.
|
|||
|
|
|
|||
|
|
**Primary intended uses:**
|
|||
|
|
|
|||
|
|
- Pre-processing / normalization of:
|
|||
|
|
- patient histories
|
|||
|
|
- discharge summaries
|
|||
|
|
- clinician notes / staff remarks
|
|||
|
|
- “advice” / follow-up instructions
|
|||
|
|
- Creating cleaner text for:
|
|||
|
|
- analytics / NLP pipelines
|
|||
|
|
- downstream LLMs (e.g., for summarization or coding)
|
|||
|
|
- internal documentation
|
|||
|
|
|
|||
|
|
**Not intended for:**
|
|||
|
|
|
|||
|
|
- Medical diagnosis or triage
|
|||
|
|
- Treatment recommendations
|
|||
|
|
- Patient-facing communication without clinician review
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Training Data & Format
|
|||
|
|
|
|||
|
|
The model was finetuned on a **custom clinical dataset** derived from hospital discharge summaries and related fields. Each training sample is a pair:
|
|||
|
|
|
|||
|
|
- `content` – noisy / raw clinical text (e.g. original HISTORY, treatment, adviced, mdescript1)
|
|||
|
|
- `response` – cleaned version of that text (grammar fixed, abbreviations expanded, structure improved, artefacts removed)
|
|||
|
|
|
|||
|
|
Example pair (simplified):
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
content:
|
|||
|
|
PATIENT ADMITTED WITH ABOVE SAID COMPLAINT FOR FURTHER EVALUATION AND MANAGEMENT. -H/O LAP CHOLECYSTECTOMY(12 YEAR BACK
|
|||
|
|
|
|||
|
|
response:
|
|||
|
|
The patient was admitted with the above-mentioned complaints for further evaluation and management.
|
|||
|
|
The patient has a history of laparoscopic cholecystectomy performed 12 years ago.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
During finetuning, each row was wrapped as a **chat-style conversation**:
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
[
|
|||
|
|
{"role": "system", "content": "<system prompt (see below)>"},
|
|||
|
|
{"role": "user", "content": "<content>"},
|
|||
|
|
{"role": "assistant", "content": "<response>"}
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The Llama 3.x chat template (`llama-3.1` in Unsloth) was used to convert this into the final training text.
|
|||
|
|
Training used **`train_on_responses_only`**, so only the assistant segment (`response`) contributed to the loss.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## System Prompt (Baked-in Behavior)
|
|||
|
|
|
|||
|
|
During training, a system message of this form was used:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
You are a medical AI assistant that rewrites noisy, telegraphic, or poorly formatted clinical text
|
|||
|
|
(eg, discharge summaries, admission notes, advice sections) into clear, grammatically correct,
|
|||
|
|
contextually rich prose.
|
|||
|
|
|
|||
|
|
Given some raw 'content' from a clinician, you must:
|
|||
|
|
- preserve all medically relevant facts and timelines,
|
|||
|
|
- avoid inventing new clinical information,
|
|||
|
|
- expand or clarify abbreviations when appropriate,
|
|||
|
|
- fix spelling, grammar, punctuation, and formatting,
|
|||
|
|
- turn lists or fragments into coherent sentences and paragraphs.
|
|||
|
|
|
|||
|
|
Your reply should be only the cleaned clinical text, nothing else.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
For best results, you should **re-use a similar system prompt at inference time**.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## How to Use
|
|||
|
|
|
|||
|
|
### 1. Basic Chat-style Inference (Transformers)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
model_name = "your-username/clinical-info-corrector-1B"
|
|||
|
|
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
model_name,
|
|||
|
|
torch_dtype = torch.float16,
|
|||
|
|
device_map = "auto",
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
SYSTEM_PROMPT = """
|
|||
|
|
You are a medical AI assistant that rewrites noisy, telegraphic, or poorly formatted clinical text
|
|||
|
|
into clear, grammatically correct, medically faithful prose. Preserve all medical facts and do not
|
|||
|
|
invent new information. Your reply should be only the cleaned clinical text.
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
raw_content = "PATIENT ADMITTED WITH ABOVE SAID COMPLAINT FOR FURTHER EVALUATION AND MANAGEMENT. -H/O LAP CHOLECYSTECTOMY(12 YEAR BACK"
|
|||
|
|
|
|||
|
|
messages = [
|
|||
|
|
{"role": "system", "content": SYSTEM_PROMPT},
|
|||
|
|
{"role": "user", "content": raw_content},
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
from unsloth.chat_templates import get_chat_template
|
|||
|
|
tokenizer = get_chat_template(tokenizer, chat_template = "llama-3.1")
|
|||
|
|
|
|||
|
|
inputs = tokenizer.apply_chat_template(
|
|||
|
|
messages,
|
|||
|
|
tokenize = True,
|
|||
|
|
add_generation_prompt = True,
|
|||
|
|
return_tensors = "pt",
|
|||
|
|
).to(model.device)
|
|||
|
|
|
|||
|
|
outputs = model.generate(
|
|||
|
|
**inputs,
|
|||
|
|
max_new_tokens = 256,
|
|||
|
|
temperature = 0.7,
|
|||
|
|
top_p = 0.9,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens = True))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. Recommended Prompt Pattern
|
|||
|
|
|
|||
|
|
**User message:**
|
|||
|
|
|
|||
|
|
> “Please clean and normalize the following clinical text. Preserve all clinical facts and do not add anything new:
|
|||
|
|
> <raw content here>”
|
|||
|
|
|
|||
|
|
**System message:** same as above, or a shorter variant:
|
|||
|
|
|
|||
|
|
> “You are a clinical text cleaning assistant. Rewrite the input into clear, grammatically correct, medically faithful prose, preserving all facts.”
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Example
|
|||
|
|
|
|||
|
|
**Input (user):**
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
DAILY FLUID INTAKE 1.5LTR/DAY Check CBC, RFT,K+, RBS,CXR, ECG, in follow up Chest physiotherapy, wound hygiene & proper diet. Steam Inhalation three times a day. Syp Alex or TusQDx 2 spoons threee times a day if dry cough is present FBS AND PPBS AFTER 3DAYS IF BLOOD SUGAR IS <80 OR >200mg/dl THEN CONSULT to local physician. WHILE SPLEEPING MUST LIE SUPINE FOR ONE MONTH, CAN REMOVE BELT IN NIGHT ONLY.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Output (assistant):**
|
|||
|
|
|
|||
|
|
> Maintain a daily fluid intake of about 1.5 L per day. At the follow-up visit, get a complete blood count (CBC), renal function tests (RFT), serum potassium (K+), random blood sugar (RBS), chest X-ray (CXR), and electrocardiogram (ECG) done. Continue chest physiotherapy, maintain good wound hygiene, and follow a nutritious diet. Do steam inhalation three times daily. If you have a dry cough, take syrup Alex or TusQDx 2 spoonfuls three times daily. Check fasting and post-meal blood sugar after 3 days; if it is less than 80 mg/dL or more than 200 mg/dL, consult your local physician. While sleeping for the next month, lie on your back (supine), and you may remove the belt only at night.
|
|||
|
|
|
|||
|
|
*(Output is illustrative of the intended style; exact generations may vary.)*
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Training Details
|
|||
|
|
|
|||
|
|
- **Base model:** `unsloth/Llama-3.2-1B-Instruct`
|
|||
|
|
- **Method:** Supervised fine-tuning (SFT)
|
|||
|
|
- **Library:** Unsloth + TRL `SFTTrainer`
|
|||
|
|
- **Loss masking:** `train_on_responses_only` (only assistant outputs)
|
|||
|
|
- **Context length:** 2048
|
|||
|
|
- **Objective:** Minimize loss on cleaned `response` given `content` (and system + user context)
|
|||
|
|
|
|||
|
|
Because the dataset is **relatively small and noisy**, this model should be considered **experimental** and evaluated carefully before integration into production pipelines.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Suggested Hyperparameters (Inference)
|
|||
|
|
|
|||
|
|
You can tune to your own taste, but a reasonable starting point:
|
|||
|
|
|
|||
|
|
- **Temperature:** `0.3 – 0.7`
|
|||
|
|
- **Top-p:** `0.9`
|
|||
|
|
- **Max new tokens:** `256–512`
|
|||
|
|
- **Repetition penalty:** optional (`1.0–1.1`)
|
|||
|
|
|
|||
|
|
Lower temperatures help keep the model closer to “copy-editor” mode rather than creative rewriting.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Limitations & Risks
|
|||
|
|
|
|||
|
|
- **Not a medical device**:
|
|||
|
|
This model does *not* replace clinical judgment, and should not be used for diagnosis, triage, or direct treatment recommendations.
|
|||
|
|
|
|||
|
|
- **Data coverage**:
|
|||
|
|
Trained on a specific institution’s style and a limited range of note types. May not generalize to all clinical documentation formats.
|
|||
|
|
|
|||
|
|
- **Factual hallucination**:
|
|||
|
|
While the training objective and system prompt encourage fidelity to the source, the model can still drop or rephrase details incorrectly. **A human clinician must review outputs.**
|
|||
|
|
|
|||
|
|
- **Biases**:
|
|||
|
|
Any biases or idiosyncrasies present in the source data may be reflected in the cleaned text.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## License
|
|||
|
|
|
|||
|
|
The `license: other` field is a placeholder.
|
|||
|
|
Before public or commercial use, **please set the correct license** for this model and its training data in accordance with your institutional and legal requirements.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Acknowledgments
|
|||
|
|
|
|||
|
|
- [Unsloth](https://github.com/unslothai/unsloth) for efficient finetuning on small hardware.
|
|||
|
|
- The maintainers of Llama 3.x models for providing strong base models.
|
|||
|
|
- Clinical staff and data teams whose documentation (anonymized and processed) made it possible to experiment with this type of text-cleaning model.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Disclaimer
|
|||
|
|
|
|||
|
|
This model is intended **solely** for research and internal tooling around **text normalization** of clinical documentation. It must **not** be used to:
|
|||
|
|
|
|||
|
|
- make or suggest diagnoses,
|
|||
|
|
- decide treatment options,
|
|||
|
|
- communicate directly with patients, or
|
|||
|
|
- operate without human clinical oversight.
|
|||
|
|
|
|||
|
|
Always have qualified healthcare professionals review and validate outputs before any clinical use.
|