初始化项目,由ModelHub XC社区提供模型
Model: lokeshch19/ModernPubMedBERT Source: Original Platform
This commit is contained in:
116
README.md
Normal file
116
README.md
Normal file
@@ -0,0 +1,116 @@
|
||||
---
|
||||
license: mit
|
||||
base_model:
|
||||
- thomas-sounack/BioClinical-ModernBERT-base
|
||||
tags:
|
||||
- sentence-transformers
|
||||
- sentence-similarity
|
||||
- medical
|
||||
- clinical
|
||||
- biomedical
|
||||
- pubmed
|
||||
- healthcare
|
||||
- medical-ai
|
||||
- clinical-nlp
|
||||
- bioinformatics
|
||||
- medical-literature
|
||||
- clinical-text
|
||||
---
|
||||
# Clinical ModernBERT Embedding Model
|
||||
|
||||
A specialized medical embedding model fine-tuned from Clinical ModernBERT using InfoNCE contrastive learning on PubMed title-abstract pairs.
|
||||
|
||||
## Model Details
|
||||
|
||||
- **Base Model**: thomas-sounack/BioClinical-ModernBERT-base
|
||||
- **Training Method**: InfoNCE contrastive learning
|
||||
- **Training Data**: PubMed title-abstract pairs
|
||||
- **Max Sequence Length**: 2048 tokens
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
# Load the model
|
||||
model = SentenceTransformer("lokeshch19/ModernPubMedBERT")
|
||||
|
||||
# Encode medical texts
|
||||
texts = [
|
||||
"Rheumatoid arthritis is an autoimmune disorder attacking joint linings.",
|
||||
"Inflammatory cytokines in RA lead to progressive cartilage and bone destruction."
|
||||
]
|
||||
embeddings = model.encode(texts)
|
||||
```
|
||||
|
||||
## Applications
|
||||
|
||||
- Medical document similarity analysis
|
||||
- Clinical text retrieval systems
|
||||
- Biomedical literature search
|
||||
- Medical concept matching and classification
|
||||
|
||||
## Model Comparison
|
||||
|
||||
Compared to `NeuML/bioclinical-modernbert-base-embeddings`, our model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.
|
||||
|
||||
### Comprehensive Evaluation Results
|
||||
|
||||
| Metric | Our Model | NeuML Model | Improvement |
|
||||
|--------|-----------|-------------|-------------|
|
||||
| **Accuracy@1** | 91.28% | 85.86% | +6.3% |
|
||||
| **Accuracy@3** | 98.46% | 95.66% | +2.9% |
|
||||
| **Accuracy@5** | 99.24% | 97.14% | +2.2% |
|
||||
| **Accuracy@10** | 99.64% | 98.29% | +1.4% |
|
||||
| **NDCG@5** | 95.96% | 92.37% | +3.9% |
|
||||
| **NDCG@10** | 96.10% | 92.75% | +3.6% |
|
||||
| **MRR@10** | 94.89% | 90.90% | +4.4% |
|
||||
| **MAP@100** | 94.91% | 90.96% | +4.3% |
|
||||
|
||||
*Evaluation performed using `InformationRetrievalEvaluator` from sentence-transformers on the `gamino/wiki_medical_terms` dataset.*
|
||||
|
||||
## Model Comparison
|
||||
|
||||
Compared to `NeuML/bioclinical-modernbert-base-embeddings`, this model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.
|
||||
|
||||
### Medical Text Similarity
|
||||
|
||||
**Example 1: Related Medical Concepts**
|
||||
```python
|
||||
text1 = "Hypertension increases the risk of stroke and heart attack."
|
||||
text2 = "High blood pressure damages arterial walls over time, leading to cardiovascular events."
|
||||
|
||||
# Cosine Similarity Results:
|
||||
# Our Model: 0.5941 (59.4%)
|
||||
# NeuML Model: 0.5267 (52.7%)
|
||||
# Improvement: +12.7%
|
||||
```
|
||||
|
||||
### Non-Medical Text Discrimination
|
||||
|
||||
**Example 2: Medical vs. Programming Terms**
|
||||
```python
|
||||
texts = ["diabetes type 2", "asyncio.run()"]
|
||||
|
||||
# Cosine Similarity Results:
|
||||
# Our Model: 0.0804 (8.0%) - Correctly identifies low similarity
|
||||
# NeuML Model: 0.1926 (19.3%) - Higher false similarity
|
||||
# Better Discrimination: 58% lower false positive rate
|
||||
```
|
||||
|
||||
### Key Advantages
|
||||
|
||||
- **Enhanced Medical Understanding**: 12.7% better similarity detection for related medical concepts
|
||||
- **Improved Discrimination**: 58% reduction in false similarities between medical and non-medical terms
|
||||
- **Domain Specialization**: Fine-tuned specifically on PubMed literature for optimal medical text processing
|
||||
|
||||
## Training Details
|
||||
|
||||
- **Optimizer**: AdamW (learning rate: 3e-4, weight decay: 0.1)
|
||||
- **Batch Size**: 72
|
||||
- **Training Steps**: 7,000
|
||||
- **Warmup Steps**: 700
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this model, please cite the base model paper and acknowledge this fine-tuning work.
|
||||
Reference in New Issue
Block a user