288 lines
9.9 KiB
Markdown
288 lines
9.9 KiB
Markdown
---
|
||
base_model:
|
||
- unsloth/Llama-3.2-1B-Instruct
|
||
tags:
|
||
- text-generation
|
||
- transformers
|
||
- unsloth
|
||
- llama-3.2
|
||
- clinical
|
||
- medical
|
||
- text-cleaning
|
||
- finetuned
|
||
license: other
|
||
language:
|
||
- eng
|
||
datasets:
|
||
- custom/clinical-text-cleaning
|
||
---
|
||
|
||
<div align="center">
|
||
|
||
# 🩺 clinical-info-corrector-1B
|
||
|
||
**Finetuned Llama 3.2 1B Instruct model for cleaning clinical text (histories, notes & staff remarks)**
|
||
|
||
<img src="https://huggingface.co/alphaaico/clinical-info-corrector-1B/resolve/main/clinical-info-corrector-1B.png"
|
||
alt="clinical-info-corrector-1B"
|
||
style="width: 500px;
|
||
height: auto;
|
||
object-position: center top;">
|
||
|
||
</div>
|
||
|
||
---
|
||
|
||
## Model Card
|
||
|
||
- **Model name:** `clinical-info-corrector-1B`
|
||
- **Base model:** `unsloth/Llama-3.2-1B-Instruct`
|
||
- **Finetuning framework:** [Unsloth](https://github.com/unslothai/unsloth) + TRL `SFTTrainer`
|
||
- **Task:** Clinical text *cleaning & rephrasing* (not diagnosis)
|
||
- **Status:** Experimental — trained on a relatively small, noisy custom dataset
|
||
|
||
This model is designed to take **messy clinical content** (patient histories, discharge notes, staff remarks, “adviced” sections, etc.) and rewrite it into **clear, grammatically correct, medically faithful text** while preserving the underlying facts.
|
||
|
||
It is *not* a diagnostic or decision-support model. Think of it as a domain-aware “clinical copy-editor” for structured text.
|
||
|
||
---
|
||
|
||
## Motivation & Intended Use
|
||
|
||
Many hospital information systems accumulate:
|
||
|
||
- telegraphic note fragments
|
||
- mixed-case, ALL-CAPS, abbreviations everywhere
|
||
- repeated or contradictory phrases
|
||
- encoding artefacts (e.g., `Â`, broken bullets, etc.)
|
||
|
||
The goal of `clinical-info-corrector-1B` is to explore whether even a **small 1B parameter model**, fine-tuned on a **crude but realistic dataset**, can:
|
||
|
||
- clean up this text,
|
||
- normalize style and grammar,
|
||
- expand common medical abbreviations where helpful, and
|
||
- keep the **clinical meaning intact**.
|
||
|
||
**Primary intended uses:**
|
||
|
||
- Pre-processing / normalization of:
|
||
- patient histories
|
||
- discharge summaries
|
||
- clinician notes / staff remarks
|
||
- “advice” / follow-up instructions
|
||
- Creating cleaner text for:
|
||
- analytics / NLP pipelines
|
||
- downstream LLMs (e.g., for summarization or coding)
|
||
- internal documentation
|
||
|
||
**Not intended for:**
|
||
|
||
- Medical diagnosis or triage
|
||
- Treatment recommendations
|
||
- Patient-facing communication without clinician review
|
||
|
||
---
|
||
|
||
## Training Data & Format
|
||
|
||
The model was finetuned on a **custom clinical dataset** derived from hospital discharge summaries and related fields. Each training sample is a pair:
|
||
|
||
- `content` – noisy / raw clinical text (e.g. original HISTORY, treatment, adviced, mdescript1)
|
||
- `response` – cleaned version of that text (grammar fixed, abbreviations expanded, structure improved, artefacts removed)
|
||
|
||
Example pair (simplified):
|
||
|
||
```text
|
||
content:
|
||
PATIENT ADMITTED WITH ABOVE SAID COMPLAINT FOR FURTHER EVALUATION AND MANAGEMENT. -H/O LAP CHOLECYSTECTOMY(12 YEAR BACK
|
||
|
||
response:
|
||
The patient was admitted with the above-mentioned complaints for further evaluation and management.
|
||
The patient has a history of laparoscopic cholecystectomy performed 12 years ago.
|
||
```
|
||
|
||
During finetuning, each row was wrapped as a **chat-style conversation**:
|
||
|
||
```json
|
||
[
|
||
{"role": "system", "content": "<system prompt (see below)>"},
|
||
{"role": "user", "content": "<content>"},
|
||
{"role": "assistant", "content": "<response>"}
|
||
]
|
||
```
|
||
|
||
The Llama 3.x chat template (`llama-3.1` in Unsloth) was used to convert this into the final training text.
|
||
Training used **`train_on_responses_only`**, so only the assistant segment (`response`) contributed to the loss.
|
||
|
||
---
|
||
|
||
## System Prompt (Baked-in Behavior)
|
||
|
||
During training, a system message of this form was used:
|
||
|
||
```text
|
||
You are a medical AI assistant that rewrites noisy, telegraphic, or poorly formatted clinical text
|
||
(eg, discharge summaries, admission notes, advice sections) into clear, grammatically correct,
|
||
contextually rich prose.
|
||
|
||
Given some raw 'content' from a clinician, you must:
|
||
- preserve all medically relevant facts and timelines,
|
||
- avoid inventing new clinical information,
|
||
- expand or clarify abbreviations when appropriate,
|
||
- fix spelling, grammar, punctuation, and formatting,
|
||
- turn lists or fragments into coherent sentences and paragraphs.
|
||
|
||
Your reply should be only the cleaned clinical text, nothing else.
|
||
```
|
||
|
||
For best results, you should **re-use a similar system prompt at inference time**.
|
||
|
||
---
|
||
|
||
## How to Use
|
||
|
||
### 1. Basic Chat-style Inference (Transformers)
|
||
|
||
```python
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
import torch
|
||
|
||
model_name = "your-username/clinical-info-corrector-1B"
|
||
|
||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
model_name,
|
||
torch_dtype = torch.float16,
|
||
device_map = "auto",
|
||
)
|
||
|
||
SYSTEM_PROMPT = """
|
||
You are a medical AI assistant that rewrites noisy, telegraphic, or poorly formatted clinical text
|
||
into clear, grammatically correct, medically faithful prose. Preserve all medical facts and do not
|
||
invent new information. Your reply should be only the cleaned clinical text.
|
||
"""
|
||
|
||
raw_content = "PATIENT ADMITTED WITH ABOVE SAID COMPLAINT FOR FURTHER EVALUATION AND MANAGEMENT. -H/O LAP CHOLECYSTECTOMY(12 YEAR BACK"
|
||
|
||
messages = [
|
||
{"role": "system", "content": SYSTEM_PROMPT},
|
||
{"role": "user", "content": raw_content},
|
||
]
|
||
|
||
from unsloth.chat_templates import get_chat_template
|
||
tokenizer = get_chat_template(tokenizer, chat_template = "llama-3.1")
|
||
|
||
inputs = tokenizer.apply_chat_template(
|
||
messages,
|
||
tokenize = True,
|
||
add_generation_prompt = True,
|
||
return_tensors = "pt",
|
||
).to(model.device)
|
||
|
||
outputs = model.generate(
|
||
**inputs,
|
||
max_new_tokens = 256,
|
||
temperature = 0.7,
|
||
top_p = 0.9,
|
||
)
|
||
|
||
print(tokenizer.decode(outputs[0], skip_special_tokens = True))
|
||
```
|
||
|
||
### 2. Recommended Prompt Pattern
|
||
|
||
**User message:**
|
||
|
||
> “Please clean and normalize the following clinical text. Preserve all clinical facts and do not add anything new:
|
||
> <raw content here>”
|
||
|
||
**System message:** same as above, or a shorter variant:
|
||
|
||
> “You are a clinical text cleaning assistant. Rewrite the input into clear, grammatically correct, medically faithful prose, preserving all facts.”
|
||
|
||
---
|
||
|
||
## Example
|
||
|
||
**Input (user):**
|
||
|
||
```text
|
||
DAILY FLUID INTAKE 1.5LTR/DAY Check CBC, RFT,K+, RBS,CXR, ECG, in follow up Chest physiotherapy, wound hygiene & proper diet. Steam Inhalation three times a day. Syp Alex or TusQDx 2 spoons threee times a day if dry cough is present FBS AND PPBS AFTER 3DAYS IF BLOOD SUGAR IS <80 OR >200mg/dl THEN CONSULT to local physician. WHILE SPLEEPING MUST LIE SUPINE FOR ONE MONTH, CAN REMOVE BELT IN NIGHT ONLY.
|
||
```
|
||
|
||
**Output (assistant):**
|
||
|
||
> Maintain a daily fluid intake of about 1.5 L per day. At the follow-up visit, get a complete blood count (CBC), renal function tests (RFT), serum potassium (K+), random blood sugar (RBS), chest X-ray (CXR), and electrocardiogram (ECG) done. Continue chest physiotherapy, maintain good wound hygiene, and follow a nutritious diet. Do steam inhalation three times daily. If you have a dry cough, take syrup Alex or TusQDx 2 spoonfuls three times daily. Check fasting and post-meal blood sugar after 3 days; if it is less than 80 mg/dL or more than 200 mg/dL, consult your local physician. While sleeping for the next month, lie on your back (supine), and you may remove the belt only at night.
|
||
|
||
*(Output is illustrative of the intended style; exact generations may vary.)*
|
||
|
||
---
|
||
|
||
## Training Details
|
||
|
||
- **Base model:** `unsloth/Llama-3.2-1B-Instruct`
|
||
- **Method:** Supervised fine-tuning (SFT)
|
||
- **Library:** Unsloth + TRL `SFTTrainer`
|
||
- **Loss masking:** `train_on_responses_only` (only assistant outputs)
|
||
- **Context length:** 2048
|
||
- **Objective:** Minimize loss on cleaned `response` given `content` (and system + user context)
|
||
|
||
Because the dataset is **relatively small and noisy**, this model should be considered **experimental** and evaluated carefully before integration into production pipelines.
|
||
|
||
---
|
||
|
||
## Suggested Hyperparameters (Inference)
|
||
|
||
You can tune to your own taste, but a reasonable starting point:
|
||
|
||
- **Temperature:** `0.3 – 0.7`
|
||
- **Top-p:** `0.9`
|
||
- **Max new tokens:** `256–512`
|
||
- **Repetition penalty:** optional (`1.0–1.1`)
|
||
|
||
Lower temperatures help keep the model closer to “copy-editor” mode rather than creative rewriting.
|
||
|
||
---
|
||
|
||
## Limitations & Risks
|
||
|
||
- **Not a medical device**:
|
||
This model does *not* replace clinical judgment, and should not be used for diagnosis, triage, or direct treatment recommendations.
|
||
|
||
- **Data coverage**:
|
||
Trained on a specific institution’s style and a limited range of note types. May not generalize to all clinical documentation formats.
|
||
|
||
- **Factual hallucination**:
|
||
While the training objective and system prompt encourage fidelity to the source, the model can still drop or rephrase details incorrectly. **A human clinician must review outputs.**
|
||
|
||
- **Biases**:
|
||
Any biases or idiosyncrasies present in the source data may be reflected in the cleaned text.
|
||
|
||
---
|
||
|
||
## License
|
||
|
||
The `license: other` field is a placeholder.
|
||
Before public or commercial use, **please set the correct license** for this model and its training data in accordance with your institutional and legal requirements.
|
||
|
||
---
|
||
|
||
## Acknowledgments
|
||
|
||
- [Unsloth](https://github.com/unslothai/unsloth) for efficient finetuning on small hardware.
|
||
- The maintainers of Llama 3.x models for providing strong base models.
|
||
- Clinical staff and data teams whose documentation (anonymized and processed) made it possible to experiment with this type of text-cleaning model.
|
||
|
||
---
|
||
|
||
## Disclaimer
|
||
|
||
This model is intended **solely** for research and internal tooling around **text normalization** of clinical documentation. It must **not** be used to:
|
||
|
||
- make or suggest diagnoses,
|
||
- decide treatment options,
|
||
- communicate directly with patients, or
|
||
- operate without human clinical oversight.
|
||
|
||
Always have qualified healthcare professionals review and validate outputs before any clinical use.
|