ModernPubMedBERT/README.md

---
license: mit
base_model:
- thomas-sounack/BioClinical-ModernBERT-base
tags:
- sentence-transformers
- sentence-similarity
- medical
- clinical
- biomedical
- pubmed
- healthcare
- medical-ai
- clinical-nlp
- bioinformatics
- medical-literature
- clinical-text
---
# Clinical ModernBERT Embedding Model

A specialized medical embedding model fine-tuned from Clinical ModernBERT using InfoNCE contrastive learning on PubMed title-abstract pairs.

## Model Details

- **Base Model**: thomas-sounack/BioClinical-ModernBERT-base
- **Training Method**: InfoNCE contrastive learning
- **Training Data**: PubMed title-abstract pairs
- **Max Sequence Length**: 2048 tokens

## Usage

```python
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("lokeshch19/ModernPubMedBERT")

# Encode medical texts
texts = [
    "Rheumatoid arthritis is an autoimmune disorder attacking joint linings.",
    "Inflammatory cytokines in RA lead to progressive cartilage and bone destruction."
]
embeddings = model.encode(texts)
```

## Applications

- Medical document similarity analysis
- Clinical text retrieval systems
- Biomedical literature search
- Medical concept matching and classification

## Model Comparison

Compared to `NeuML/bioclinical-modernbert-base-embeddings`, our model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.

### Comprehensive Evaluation Results

| Metric | Our Model | NeuML Model | Improvement |
|--------|-----------|-------------|-------------|
| **Accuracy@1** | 91.28% | 85.86% | +6.3% |
| **Accuracy@3** | 98.46% | 95.66% | +2.9% |
| **Accuracy@5** | 99.24% | 97.14% | +2.2% |
| **Accuracy@10** | 99.64% | 98.29% | +1.4% |
| **NDCG@5** | 95.96% | 92.37% | +3.9% |
| **NDCG@10** | 96.10% | 92.75% | +3.6% |
| **MRR@10** | 94.89% | 90.90% | +4.4% |
| **MAP@100** | 94.91% | 90.96% | +4.3% |

*Evaluation performed using `InformationRetrievalEvaluator` from sentence-transformers on the `gamino/wiki_medical_terms` dataset.*

## Model Comparison

Compared to `NeuML/bioclinical-modernbert-base-embeddings`, this model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.

### Medical Text Similarity

**Example 1: Related Medical Concepts**
```python
text1 = "Hypertension increases the risk of stroke and heart attack."
text2 = "High blood pressure damages arterial walls over time, leading to cardiovascular events."

# Cosine Similarity Results:
# Our Model: 0.5941 (59.4%)
# NeuML Model: 0.5267 (52.7%)
# Improvement: +12.7%
```

### Non-Medical Text Discrimination

**Example 2: Medical vs. Programming Terms**
```python
texts = ["diabetes type 2", "asyncio.run()"]

# Cosine Similarity Results:
# Our Model: 0.0804 (8.0%) - Correctly identifies low similarity
# NeuML Model: 0.1926 (19.3%) - Higher false similarity
# Better Discrimination: 58% lower false positive rate
```

### Key Advantages

- **Enhanced Medical Understanding**: 12.7% better similarity detection for related medical concepts
- **Improved Discrimination**: 58% reduction in false similarities between medical and non-medical terms
- **Domain Specialization**: Fine-tuned specifically on PubMed literature for optimal medical text processing

## Training Details

- **Optimizer**: AdamW (learning rate: 3e-4, weight decay: 0.1)
- **Batch Size**: 72
- **Training Steps**: 7,000
- **Warmup Steps**: 700

## Citation

If you use this model, please cite the base model paper and acknowledge this fine-tuning work.
初始化项目，由ModelHub XC社区提供模型 Model: lokeshch19/ModernPubMedBERT Source: Original Platform 2026-05-14 14:35:00 +08:00			`---`
			`license: mit`
			`base_model:`
			`- thomas-sounack/BioClinical-ModernBERT-base`
			`tags:`
			`- sentence-transformers`
			`- sentence-similarity`
			`- medical`
			`- clinical`
			`- biomedical`
			`- pubmed`
			`- healthcare`
			`- medical-ai`
			`- clinical-nlp`
			`- bioinformatics`
			`- medical-literature`
			`- clinical-text`
			`---`
			`# Clinical ModernBERT Embedding Model`

			`A specialized medical embedding model fine-tuned from Clinical ModernBERT using InfoNCE contrastive learning on PubMed title-abstract pairs.`

			`## Model Details`

			`- Base Model: thomas-sounack/BioClinical-ModernBERT-base`
			`- Training Method: InfoNCE contrastive learning`
			`- Training Data: PubMed title-abstract pairs`
			`- Max Sequence Length: 2048 tokens`

			`## Usage`

			```python
			`from sentence_transformers import SentenceTransformer`

			`# Load the model`
			`model = SentenceTransformer("lokeshch19/ModernPubMedBERT")`

			`# Encode medical texts`
			`texts = [`
			`"Rheumatoid arthritis is an autoimmune disorder attacking joint linings.",`
			`"Inflammatory cytokines in RA lead to progressive cartilage and bone destruction."`
			`]`
			`embeddings = model.encode(texts)`
			```

			`## Applications`

			`- Medical document similarity analysis`
			`- Clinical text retrieval systems`
			`- Biomedical literature search`
			`- Medical concept matching and classification`

			`## Model Comparison`

			Compared to `NeuML/bioclinical-modernbert-base-embeddings`, our model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.

			`### Comprehensive Evaluation Results`

			`\| Metric \| Our Model \| NeuML Model \| Improvement \|`
			`\|--------\|-----------\|-------------\|-------------\|`
			`\| Accuracy@1 \| 91.28% \| 85.86% \| +6.3% \|`
			`\| Accuracy@3 \| 98.46% \| 95.66% \| +2.9% \|`
			`\| Accuracy@5 \| 99.24% \| 97.14% \| +2.2% \|`
			`\| Accuracy@10 \| 99.64% \| 98.29% \| +1.4% \|`
			`\| NDCG@5 \| 95.96% \| 92.37% \| +3.9% \|`
			`\| NDCG@10 \| 96.10% \| 92.75% \| +3.6% \|`
			`\| MRR@10 \| 94.89% \| 90.90% \| +4.4% \|`
			`\| MAP@100 \| 94.91% \| 90.96% \| +4.3% \|`

			Evaluation performed using `InformationRetrievalEvaluator` from sentence-transformers on the `gamino/wiki_medical_terms` dataset.

			`## Model Comparison`

			Compared to `NeuML/bioclinical-modernbert-base-embeddings`, this model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.

			`### Medical Text Similarity`

			`Example 1: Related Medical Concepts`
			```python
			`text1 = "Hypertension increases the risk of stroke and heart attack."`
			`text2 = "High blood pressure damages arterial walls over time, leading to cardiovascular events."`

			`# Cosine Similarity Results:`
			`# Our Model: 0.5941 (59.4%)`
			`# NeuML Model: 0.5267 (52.7%)`
			`# Improvement: +12.7%`
			```

			`### Non-Medical Text Discrimination`

			`Example 2: Medical vs. Programming Terms`
			```python
			`texts = ["diabetes type 2", "asyncio.run()"]`

			`# Cosine Similarity Results:`
			`# Our Model: 0.0804 (8.0%) - Correctly identifies low similarity`
			`# NeuML Model: 0.1926 (19.3%) - Higher false similarity`
			`# Better Discrimination: 58% lower false positive rate`
			```

			`### Key Advantages`

			`- Enhanced Medical Understanding: 12.7% better similarity detection for related medical concepts`
			`- Improved Discrimination: 58% reduction in false similarities between medical and non-medical terms`
			`- Domain Specialization: Fine-tuned specifically on PubMed literature for optimal medical text processing`

			`## Training Details`

			`- Optimizer: AdamW (learning rate: 3e-4, weight decay: 0.1)`
			`- Batch Size: 72`
			`- Training Steps: 7,000`
			`- Warmup Steps: 700`

			`## Citation`

			`If you use this model, please cite the base model paper and acknowledge this fine-tuning work.`