初始化项目，由ModelHub XC社区提供模型

Model: lokeshch19/ModernPubMedBERT Source: Original Platform
2026-05-14 14:35:00 +08:00
commit 3d9a6438bb
10 changed files with 252702 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,116 @@
+---
+license: mit
+base_model:
+- thomas-sounack/BioClinical-ModernBERT-base
+tags:
+- sentence-transformers
+- sentence-similarity
+- medical
+- clinical
+- biomedical
+- pubmed
+- healthcare
+- medical-ai
+- clinical-nlp
+- bioinformatics
+- medical-literature
+- clinical-text
+---
+# Clinical ModernBERT Embedding Model
+
+A specialized medical embedding model fine-tuned from Clinical ModernBERT using InfoNCE contrastive learning on PubMed title-abstract pairs.
+
+## Model Details
+
+- **Base Model**: thomas-sounack/BioClinical-ModernBERT-base
+- **Training Method**: InfoNCE contrastive learning
+- **Training Data**: PubMed title-abstract pairs
+- **Max Sequence Length**: 2048 tokens
+
+## Usage
+
+```python
+from sentence_transformers import SentenceTransformer
+
+# Load the model
+model = SentenceTransformer("lokeshch19/ModernPubMedBERT")
+
+# Encode medical texts
+texts = [
+    "Rheumatoid arthritis is an autoimmune disorder attacking joint linings.",
+    "Inflammatory cytokines in RA lead to progressive cartilage and bone destruction."
+]
+embeddings = model.encode(texts)
+```
+
+## Applications
+
+- Medical document similarity analysis
+- Clinical text retrieval systems
+- Biomedical literature search
+- Medical concept matching and classification
+
+## Model Comparison
+
+Compared to `NeuML/bioclinical-modernbert-base-embeddings`, our model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.
+
+### Comprehensive Evaluation Results
+
+| Metric | Our Model | NeuML Model | Improvement |
+|--------|-----------|-------------|-------------|
+| **Accuracy@1** | 91.28% | 85.86% | +6.3% |
+| **Accuracy@3** | 98.46% | 95.66% | +2.9% |
+| **Accuracy@5** | 99.24% | 97.14% | +2.2% |
+| **Accuracy@10** | 99.64% | 98.29% | +1.4% |
+| **NDCG@5** | 95.96% | 92.37% | +3.9% |
+| **NDCG@10** | 96.10% | 92.75% | +3.6% |
+| **MRR@10** | 94.89% | 90.90% | +4.4% |
+| **MAP@100** | 94.91% | 90.96% | +4.3% |
+
+*Evaluation performed using `InformationRetrievalEvaluator` from sentence-transformers on the `gamino/wiki_medical_terms` dataset.*
+
+## Model Comparison
+
+Compared to `NeuML/bioclinical-modernbert-base-embeddings`, this model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.
+
+### Medical Text Similarity
+
+**Example 1: Related Medical Concepts**
+```python
+text1 = "Hypertension increases the risk of stroke and heart attack."
+text2 = "High blood pressure damages arterial walls over time, leading to cardiovascular events."
+
+# Cosine Similarity Results:
+# Our Model: 0.5941 (59.4%)
+# NeuML Model: 0.5267 (52.7%)
+# Improvement: +12.7%
+```
+
+### Non-Medical Text Discrimination
+
+**Example 2: Medical vs. Programming Terms**
+```python
+texts = ["diabetes type 2", "asyncio.run()"]
+
+# Cosine Similarity Results:
+# Our Model: 0.0804 (8.0%) - Correctly identifies low similarity
+# NeuML Model: 0.1926 (19.3%) - Higher false similarity
+# Better Discrimination: 58% lower false positive rate
+```
+
+### Key Advantages
+
+- **Enhanced Medical Understanding**: 12.7% better similarity detection for related medical concepts
+- **Improved Discrimination**: 58% reduction in false similarities between medical and non-medical terms
+- **Domain Specialization**: Fine-tuned specifically on PubMed literature for optimal medical text processing
+
+## Training Details
+
+- **Optimizer**: AdamW (learning rate: 3e-4, weight decay: 0.1)
+- **Batch Size**: 72
+- **Training Steps**: 7,000
+- **Warmup Steps**: 700
+
+## Citation
+
+If you use this model, please cite the base model paper and acknowledge this fine-tuning work.