初始化项目，由ModelHub XC社区提供模型

Model: malteos/scincl Source: Original Platform
2026-05-13 18:57:29 +08:00
commit 91a894cfc9
14 changed files with 31144 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,28 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text 
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zstandard filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+model.safetensors filter=lfs diff=lfs merge=lfs -text
--- a/1_Pooling/config.json
+++ b/1_Pooling/config.json
@@ -0,0 +1,10 @@
+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": true,
+  "pooling_mode_mean_tokens": false,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}
--- a/README.md
+++ b/README.md
@@ -0,0 +1,121 @@
+---
+tags:
+- feature-extraction
+- sentence-transformers
+- transformers
+library_name: sentence-transformers
+language: en
+datasets:
+- SciDocs
+- s2orc
+metrics:
+- F1
+- accuracy
+- map
+- ndcg
+license: mit
+---
+
+## SciNCL
+
+SciNCL is a pre-trained BERT language model to generate document-level embeddings of research papers.
+It uses the citation graph neighborhood to generate samples for contrastive learning.
+Prior to the contrastive training, the model is initialized with weights from [scibert-scivocab-uncased](https://huggingface.co/allenai/scibert_scivocab_uncased).
+The underlying citation embeddings are trained on the [S2ORC citation graph](https://github.com/allenai/s2orc).
+
+Paper: [Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)](https://arxiv.org/abs/2202.06671).
+
+Code: https://github.com/malteos/scincl
+
+PubMedNCL: Working with biomedical papers? Try [PubMedNCL](https://huggingface.co/malteos/PubMedNCL).
+
+## How to use the pretrained model
+
+### Sentence Transformers
+
+```python
+from sentence_transformers import SentenceTransformer
+
+# Load the model
+model = SentenceTransformer("malteos/scincl")
+
+# Concatenate the title and abstract with the [SEP] token
+papers = [
+    "BERT [SEP] We introduce a new language representation model called BERT",
+    "Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
+]
+# Inference
+embeddings = model.encode(papers)
+
+# Compute the (cosine) similarity between embeddings
+similarity = model.similarity(embeddings[0], embeddings[1])
+print(similarity.item())
+# => 0.8440517783164978
+```
+
+### Transformers
+
+```python
+from transformers import AutoTokenizer, AutoModel
+
+# load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
+model = AutoModel.from_pretrained('malteos/scincl')
+
+papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
+          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
+
+# concatenate title and abstract with [SEP] token
+title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
+
+# preprocess the input
+inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
+
+# inference
+result = model(**inputs)
+
+# take the first token ([CLS] token) in the batch as the embedding
+embeddings = result.last_hidden_state[:, 0, :]
+
+# calculate the similarity
+embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
+similarity = (embeddings[0] @ embeddings[1].T)
+print(similarity.item())
+# => 0.8440518379211426
+```
+
+## Triplet Mining Parameters
+
+| **Setting**             | **Value**          |
+|-------------------------|--------------------|
+| seed                    | 4                  |
+| triples_per_query       | 5                  |
+| easy_positives_count    | 5                  |
+| easy_positives_strategy | 5                  |
+| easy_positives_k        | 20-25              |
+| easy_negatives_count    | 3                  |
+| easy_negatives_strategy | random_without_knn |
+| hard_negatives_count    | 2                  |
+| hard_negatives_strategy | knn                |
+| hard_negatives_k        | 3998-4000          |
+
+## SciDocs Results
+
+These model weights are the ones that yielded the best results on SciDocs (`seed=4`).
+In the paper we report the SciDocs results as mean over ten seeds.
+
+| **model**         | **mag-f1** | **mesh-f1** | **co-view-map** | **co-view-ndcg** | **co-read-map** | **co-read-ndcg** | **cite-map** | **cite-ndcg** | **cocite-map** | **cocite-ndcg** | **recomm-ndcg** | **recomm-P@1** | **Avg** |
+|-------------------|-----------:|------------:|----------------:|-----------------:|----------------:|-----------------:|-------------:|--------------:|---------------:|----------------:|----------------:|---------------:|--------:|
+| Doc2Vec           |       66.2 |        69.2 |            67.8 |             82.9 |            64.9 |             81.6 |         65.3 |          82.2 |           67.1 |            83.4 |            51.7 |           16.9 |    66.6 |
+| fasttext-sum      |       78.1 |        84.1 |            76.5 |             87.9 |            75.3 |             87.4 |         74.6 |          88.1 |           77.8 |            89.6 |            52.5 |             18 |    74.1 |
+| SGC               |       76.8 |        82.7 |            77.2 |               88 |            75.7 |             87.5 |         91.6 |          96.2 |           84.1 |            92.5 |            52.7 |           18.2 |    76.9 |
+| SciBERT           |       79.7 |        80.7 |            50.7 |             73.1 |            47.7 |             71.1 |         48.3 |          71.7 |           49.7 |            72.6 |            52.1 |           17.9 |    59.6 |
+| SPECTER           |         82 |        86.4 |            83.6 |             91.5 |            84.5 |             92.4 |         88.3 |          94.9 |           88.1 |            94.8 |            53.9 |             20 |      80 |
+| SciNCL (10 seeds) |       81.4 |        88.7 |            85.3 |             92.3 |            87.5 |             93.9 |         93.6 |          97.3 |           91.6 |            96.4 |            53.9 |           19.3 |    81.8 |
+| **SciNCL (seed=4)**   |       81.2 |        89.0 |            85.3 |             92.2 |            87.7 |             94.0 |         93.6 |          97.4 |           91.7 |            96.5 |            54.3 |           19.6 |    81.9 |
+
+Additional evaluations are available in the paper.
+
+## License
+
+MIT
--- a/config.json
+++ b/config.json
@@ -0,0 +1,24 @@
+{
+  "_name_or_path": "malteos/scincl",
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.5.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 31090
+}
--- a/config_sentence_transformers.json
+++ b/config_sentence_transformers.json
@@ -0,0 +1,10 @@
+{
+  "__version__": {
+    "sentence_transformers": "3.0.0",
+    "transformers": "4.41.2",
+    "pytorch": "2.3.0+cu121"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d65da219f67f051fdd065bfec666285c136e23002787603b2565e467bfed3c68
+size 439700404
--- a/modules.json
+++ b/modules.json
@@ -0,0 +1,14 @@
+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]
--- a/pytorch_model.bin
+++ b/pytorch_model.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:34bf9b9761e253927a6533218fbf41b7ebe06d4100e61da83f877af56e113299
+size 439758015
--- a/sentence_bert_config.json
+++ b/sentence_bert_config.json
@@ -0,0 +1,4 @@
+{
+  "max_seq_length": 512,
+  "do_lower_case": false
+}
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1 @@
+{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1 @@
+{"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": "special_tokens_map.json", "name_or_path": "malteos/scincl", "do_basic_tokenize": true, "never_split": null}
--- a/train_metadata.jsonl.gz
+++ b/train_metadata.jsonl.gz
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:f7622ef8de1acc2492b4f1107146142c9274bbb59aa082d6baf8354d403ffde1
+size 98323460
--- a/train_triples.csv.gz
+++ b/train_triples.csv.gz
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:706e660fbece1daac40d60dbd097dd45889cc6546d4a2b229f1376e99a36f103
+size 6716319
--- a/vocab.txt
+++ b/vocab.txt
				`@@ -0,0 +1 @@`
				`{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}`
				`@@ -0,0 +1 @@`
				`{"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": "special_tokens_map.json", "name_or_path": "malteos/scincl", "do_basic_tokenize": true, "never_split": null}`