初始化项目,由ModelHub XC社区提供模型

Model: cross-encoder/msmarco-MiniLM-L12-en-de-v1
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-13 16:24:06 +08:00
commit 99df2b4f39
21 changed files with 12725 additions and 0 deletions

18
.gitattributes vendored Normal file
View File

@@ -0,0 +1,18 @@
*.bin.* filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tar.gz filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
model.safetensors filter=lfs diff=lfs merge=lfs -text

81
README.md Normal file
View File

@@ -0,0 +1,81 @@
---
license: apache-2.0
datasets:
- sentence-transformers/msmarco
language:
- en
- de
base_model:
- microsoft/Multilingual-MiniLM-L12-H384
pipeline_tag: text-ranking
library_name: sentence-transformers
tags:
- transformers
---
# Cross-Encoder for MS MARCO - EN-DE
This is a cross-lingual Cross-Encoder model for EN-DE that can be used for passage re-ranking. It was trained on the [MS Marco Passage Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) task.
The model can be used for Information Retrieval: See [SBERT.net Retrieve & Re-rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html).
The training code is available in this repository, see `train_script.py`.
## Usage with SentenceTransformers
When you have [SentenceTransformers](https://www.sbert.net/) installed, you can use the model like this:
```python
from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name', max_length=512)
query = 'How many people live in Berlin?'
docs = ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.']
pairs = [(query, doc) for doc in docs]
scores = model.predict(pairs)
```
## Usage with Transformers
With the transformers library, you can use the model like this:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained('model_name')
tokenizer = AutoTokenizer.from_pretrained('model_name')
features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt")
model.eval()
with torch.no_grad():
scores = model(**features).logits
print(scores)
```
## Performance
The performance was evaluated on three datasets:
- **TREC-DL19 EN-EN**: The original [TREC 2019 Deep Learning Track](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html): Given an English query and 1000 documents (retrieved by BM25 lexical search), rank documents with according to their relevance. We compute NDCG@10. BM25 achieves a score of 45.46, a perfect re-ranker can achieve a score of 95.47.
- **TREC-DL19 DE-EN**: The English queries of TREC-DL19 have been translated by a German native speaker to German. We rank the German queries versus the English passages from the original TREC-DL19 setup. We compute NDCG@10.
- **GermanDPR DE-DE**: The [GermanDPR](https://www.deepset.ai/germanquad) dataset provides German queries and German passages from Wikipedia. We indexed the 2.8 Million paragraphs from German Wikipedia and retrieved for each query the top 100 most relevant passages using BM25 lexical search with Elasticsearch. We compute MRR@10. BM25 achieves a score of 35.85, a perfect re-ranker can achieve a score of 76.27.
We also check the performance of bi-encoders using the same evaluation: The retrieved documents from BM25 lexical search are re-ranked using query & passage embeddings with cosine-similarity. Bi-Encoders can also be used for end-to-end semantic search.
| Model-Name | TREC-DL19 EN-EN | TREC-DL19 DE-EN | GermanDPR DE-DE | Docs / Sec |
| ------------- |:-------------:| :-----: | :---: | :----: |
| BM25 | 45.46 | - | 35.85 | -|
| **Cross-Encoder Re-Rankers** | | | |
| [cross-encoder/msmarco-MiniLM-L6-en-de-v1](https://huggingface.co/cross-encoder/msmarco-MiniLM-L6-en-de-v1) | 72.43 | 65.53 | 46.77 | 1600 |
| [cross-encoder/msmarco-MiniLM-L12-en-de-v1](https://huggingface.co/cross-encoder/msmarco-MiniLM-L12-en-de-v1) | 72.94 | 66.07 | 49.91 | 900 |
| [svalabs/cross-electra-ms-marco-german-uncased](https://huggingface.co/svalabs/cross-electra-ms-marco-german-uncased) (DE only) | - | - | 53.67 | 260 |
| [deepset/gbert-base-germandpr-reranking](https://huggingface.co/deepset/gbert-base-germandpr-reranking) (DE only) | - | - | 53.59 | 260 |
| **Bi-Encoders (re-ranking)** | | | |
| [sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned](https://huggingface.co/sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned) | 63.38 | 58.28 | 37.88 | 940 |
| [sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch](https://huggingface.co/sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch) | 65.51 | 58.69 | 38.32 | 940 |
| [svalabs/bi-electra-ms-marco-german-uncased](https://huggingface.co/svalabs/bi-electra-ms-marco-german-uncased) (DE only) | - | - | 34.31 | 450 |
| [deepset/gbert-base-germandpr-question_encoder](https://huggingface.co/deepset/gbert-base-germandpr-question_encoder) (DE only) | - | - | 42.55 | 450 |
Note: Docs / Sec gives the number of (query, document) pairs we can re-rank within a second on a V100 GPU.

32
config.json Executable file
View File

@@ -0,0 +1,32 @@
{
"_name_or_path": "microsoft/Multilingual-MiniLM-L12-H384",
"architectures": [
"BertForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 384,
"id2label": {
"0": "LABEL_0"
},
"initializer_range": 0.02,
"intermediate_size": 1536,
"label2id": {
"LABEL_0": 0
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"tokenizer_class": "XLMRobertaTokenizer",
"transformers_version": "4.6.1",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 250037,
"sbert_ce_default_activation_function": "torch.nn.modules.linear.Identity"
}

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:bb2408b72dc487ed4496dff7cd5fe96a8847327aa755ee528f2789575e7ee478
size 470644310

3
onnx/model.onnx Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6c954f569ae86c63fc2e1d74a93191196b20c1395f2b23df1d5a0a969b0a433f
size 470918906

3
onnx/model_O1.onnx Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:84d6aef4a58465a98ea10c554858f04d9e166615dd7ae8d917ecb4414444b0aa
size 470822331

3
onnx/model_O2.onnx Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b822742a5bea0f790461ed8b5972ef0542ddb36bb67f6890d21244075a343ef8
size 470654650

3
onnx/model_O3.onnx Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:edaa98ec22d796d5ee4ffc96d3c5bfac1c49f5a2c1ec8ceb4789b97bb9746169
size 470654505

3
onnx/model_O4.onnx Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b69fd9449036f367b9a2466eb9d691992a960f73251d5f70118c13d7b41a477b
size 235476592

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f80cd82b7093386bd38ea9e732f2fd332d89aac7d362b09eaff4ee036b9bcb67
size 118605660

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f80cd82b7093386bd38ea9e732f2fd332d89aac7d362b09eaff4ee036b9bcb67
size 118605660

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f80cd82b7093386bd38ea9e732f2fd332d89aac7d362b09eaff4ee036b9bcb67
size 118605660

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:ca182b68ae82c0b3fd0a3a9bf8c4fe5d52ef89946c3e2175251079e7198fefce
size 118605658

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:ac476480052e3480392c544745e99b96e26087dfabd2a217cc1a12d47f372141
size 470620832

12299
openvino/openvino_model.xml Normal file

File diff suppressed because it is too large Load Diff

3
pytorch_model.bin Executable file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:fd9a04351955aa9f56f6b22ca61b14d1a91588ca72ef2769a63e847928b8309f
size 470705929

3
sentencepiece.bpe.model Executable file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
size 5069051

1
special_tokens_map.json Executable file
View File

@@ -0,0 +1 @@
{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": "<mask>"}

1
tokenizer.json Executable file

File diff suppressed because one or more lines are too long

1
tokenizer_config.json Executable file
View File

@@ -0,0 +1 @@
{"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "special_tokens_map_file": "/root/.cache/huggingface/transformers/8ed73a1ab9ef4e90a9451497bf96cfc38d34354352838a143f2dda1c81aed5ca.0dc5b1041f62041ebbd23b1297f2f573769d5c97d8b7c28180ec86b8f6185aa8", "name_or_path": "microsoft/Multilingual-MiniLM-L12-H384", "sp_model_kwargs": {}}

253
train_script.py Executable file
View File

@@ -0,0 +1,253 @@
import gzip
import random
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig, AdamW
import sys
import torch
import transformers
from torch.utils.data import Dataset, DataLoader
from torch.cuda.amp import autocast
import tqdm
from datetime import datetime
from shutil import copyfile
import os
####################################
import gzip
from collections import defaultdict
import logging
import tqdm
import numpy as np
import sys
import pytrec_eval
from sentence_transformers import SentenceTransformer, util, CrossEncoder
import torch
model_name = sys.argv[1]
max_length = 350
######### Evaluation
queries_filepath = 'msmarco-data/trec2019/msmarco-test2019-queries.tsv.gz'
queries_eval = {}
with gzip.open(queries_filepath, 'rt', encoding='utf8') as fIn:
for line in fIn:
qid, query = line.strip().split("\t")[0:2]
queries_eval[qid] = query
rel = defaultdict(lambda: defaultdict(int))
with open('msmarco-data/trec2019/2019qrels-pass.txt') as fIn:
for line in fIn:
qid, _, pid, score = line.strip().split()
score = int(score)
if score > 0:
rel[qid][pid] = score
relevant_qid = []
for qid in queries_eval:
if len(rel[qid]) > 0:
relevant_qid.append(qid)
# Read top 1k
passage_cand = {}
with gzip.open('msmarco-data/trec2019/msmarco-passagetest2019-top1000.tsv.gz', 'rt', encoding='utf8') as fIn:
for line in fIn:
qid, pid, query, passage = line.strip().split("\t")
if qid not in passage_cand:
passage_cand[qid] = []
passage_cand[qid].append([pid, passage])
def eval_modal(model_path):
run = {}
model = CrossEncoder(model_path, max_length=512)
for qid in relevant_qid:
query = queries_eval[qid]
cand = passage_cand[qid]
pids = [c[0] for c in cand]
corpus_sentences = [c[1] for c in cand]
## CrossEncoder
cross_inp = [[query, sent] for sent in corpus_sentences]
if model.config.num_labels > 1:
cross_scores = model.predict(cross_inp, apply_softmax=True)[:, 1].tolist()
else:
cross_scores = model.predict(cross_inp, activation_fct=torch.nn.Identity()).tolist()
cross_scores_sparse = {}
for idx, pid in enumerate(pids):
cross_scores_sparse[pid] = cross_scores[idx]
sparse_scores = cross_scores_sparse
run[qid] = {}
for pid in sparse_scores:
run[qid][pid] = float(sparse_scores[pid])
evaluator = pytrec_eval.RelevanceEvaluator(rel, {'ndcg_cut.10'})
scores = evaluator.evaluate(run)
scores_mean = np.mean([ele["ndcg_cut_10"] for ele in scores.values()])
print("NDCG@10: {:.2f}".format(scores_mean * 100))
return scores_mean
################################
device = 'cuda' if torch.cuda.is_available() else 'cpu'
config = AutoConfig.from_pretrained(model_name)
config.num_labels = 1
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
#######################
queries = {}
corpus = {}
output_save_path = 'output/train_cross-encoder_mse-{}-{}'.format(model_name.replace("/", "_"), datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))
output_save_path_latest = output_save_path+"-latest"
tokenizer.save_pretrained(output_save_path)
tokenizer.save_pretrained(output_save_path_latest)
# Write self to path
train_script_path = os.path.join(output_save_path, 'train_script.py')
copyfile(__file__, train_script_path)
with open(train_script_path, 'a') as fOut:
fOut.write("\n\n# Script was called via:\n#python " + " ".join(sys.argv))
####
train_script_path = os.path.join(output_save_path_latest, 'train_script.py')
copyfile(__file__, train_script_path)
with open(train_script_path, 'a') as fOut:
fOut.write("\n\n# Script was called via:\n#python " + " ".join(sys.argv))
#### Read train files
class MultilingualDataset(Dataset):
def __init__(self):
self.examples = defaultdict(lambda: defaultdict(list)) #[id][lang] => [samples...]
def add(self, lang, filepath):
open_method = gzip.open if filepath.endswith('.gz') else open
with open_method(filepath, 'rt') as fIn:
for line in fIn:
pid, passage = line.strip().split("\t")
self.examples[pid][lang].append(passage)
def __len__(self):
return len(self.examples)
def __getitem__(self, item):
all_examples = self.examples[item] #All examples in all languages
lang_examples = random.choice(list(all_examples.values())) #Examples in on specific language
return random.choice(lang_examples) #One random example
train_corpus = MultilingualDataset()
train_corpus.add('en', 'msmarco-data/collection.tsv')
train_corpus.add('de', 'msmarco-data/de/collection.de.opus-mt.tsv.gz')
train_corpus.add('de', 'msmarco-data/de/collection.de.wmt19.tsv.gz')
train_queries = MultilingualDataset()
train_queries.add('en', 'msmarco-data/queries.train.tsv')
train_queries.add('de', 'msmarco-data/de/queries.train.de.opus-mt.tsv.gz')
train_queries.add('de', 'msmarco-data/de/queries.train.de.wmt19.tsv.gz')
############## MSE Dataset
class MSEDataset(Dataset):
def __init__(self, filepath):
super().__init__()
self.examples = []
with open(filepath) as fIn:
for line in fIn:
pos_score, neg_score, qid, pid1, pid2 = line.strip().split("\t")
self.examples.append([qid, pid1, pid2, float(pos_score)-float(neg_score)])
def __len__(self):
return len(self.examples)
def __getitem__(self, item):
return self.examples[item]
train_batch_size = 16
train_dataset = MSEDataset('msmarco-data/bert_cat_ensemble_msmarcopassage_train_scores_ids.tsv')
train_dataloader = DataLoader(train_dataset, drop_last=True, shuffle=True, batch_size=train_batch_size)
############## Optimizer
weight_decay = 0.01
max_grad_norm = 1
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': weight_decay},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
scheduler = transformers.get_linear_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=len(train_dataloader))
scaler = torch.cuda.amp.GradScaler()
loss_fct = torch.nn.MSELoss()
### Start training
model.to(device)
auto_save = 10000
best_ndcg_score = 0
for step_idx, batch in tqdm.tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
batch_queries = [train_queries[qid] for qid in batch[0]]
batch_pos = [train_corpus[cid] for cid in batch[1]]
batch_neg = [train_corpus[cid] for cid in batch[2]]
scores = batch[3].float().to(device) #torch.tensor(batch[3], dtype=torch.float, device=device)
with autocast():
inp_pos = tokenizer(batch_queries, batch_pos, max_length=max_length, padding=True, truncation='longest_first', return_tensors='pt').to(device)
pred_pos = model(**inp_pos).logits.squeeze()
inp_neg = tokenizer(batch_queries, batch_neg, max_length=max_length, padding=True, truncation='longest_first', return_tensors='pt').to(device)
pred_neg = model(**inp_neg).logits.squeeze()
pred_diff = pred_pos - pred_neg
loss_value = loss_fct(pred_diff, scores)
scaler.scale(loss_value).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
scheduler.step()
if (step_idx+1) % auto_save == 0:
print("Step:", step_idx+1)
model.save_pretrained(output_save_path_latest)
ndcg_score = eval_modal(output_save_path_latest)
if ndcg_score >= best_ndcg_score:
best_ndcg_score = ndcg_score
print("Save to:", output_save_path)
model.save_pretrained(output_save_path)
model.save_pretrained(output_save_path)
# Script was called via:
#python train_cross-encoder_mse_multilingual.py microsoft/Multilingual-MiniLM-L12-H384