初始化项目，由ModelHub XC社区提供模型

Model: eduardofv/stsb-m-mt-es-distilbert-base-uncased Source: Original Platform
2026-05-13 17:08:20 +08:00
commit e59df12d9f
15 changed files with 30757 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,16 @@
+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tar.gz filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
--- a/0_Transformer/config.json
+++ b/0_Transformer/config.json
@@ -0,0 +1,23 @@
+{
+  "_name_or_path": "distilbert-base-uncased",
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertModel"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "initializer_range": 0.02,
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "transformers_version": "4.6.1",
+  "vocab_size": 30522
+}
--- a/0_Transformer/pytorch_model.bin
+++ b/0_Transformer/pytorch_model.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:4247d2f9cba7380e84be9a69bb9bff9608d505966ec86f078a1ff4d0285ebaaf
+size 265490176
--- a/0_Transformer/sentence_bert_config.json
+++ b/0_Transformer/sentence_bert_config.json
@@ -0,0 +1,4 @@
+{
+  "max_seq_length": null,
+  "do_lower_case": false
+}
--- a/0_Transformer/special_tokens_map.json
+++ b/0_Transformer/special_tokens_map.json
@@ -0,0 +1 @@
+{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
--- a/0_Transformer/tokenizer.json
+++ b/0_Transformer/tokenizer.json
--- a/0_Transformer/tokenizer_config.json
+++ b/0_Transformer/tokenizer_config.json
@@ -0,0 +1 @@
+{"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "distilbert-base-uncased"}
--- a/0_Transformer/vocab.txt
+++ b/0_Transformer/vocab.txt
--- a/1_Pooling/config.json
+++ b/1_Pooling/config.json
@@ -0,0 +1,7 @@
+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false
+}
--- a/README.md
+++ b/README.md
@@ -0,0 +1,51 @@
+---
+language: es
+datasets:
+- stsb_multi_mt
+tags:
+- sentence-similarity
+- sentence-transformers
+---
+# distilbert-base-uncased trained for Semantic Textual Similarity in Spanish
+
+This is a test model that was fine-tuned using the Spanish datasets from [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) in order to understand and benchmark STS models.
+
+## Model and training data description
+
+This model was built taking `distilbert-base-uncased` and training it on a Semantic Textual Similarity task using a modified version of the training script for STS from Sentece Transformers (the modified script is included in the repo). It was trained using the Spanish datasets from [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) which are the STSBenchmark datasets automatically translated to other languages using deepl.com. Refer to the dataset repository for more details.
+
+## Intended uses & limitations
+
+This model was built just as a proof-of-concept on STS fine-tuning using Spanish data and no specific use other than getting a sense on how this training works.
+
+## How to use
+
+You may use it as any other STS trained model to extract sentence embeddings. Check Sentence Transformers documentation. 
+
+## Training procedure
+
+Use the included script to train in Spanish the base model. You can also try to train another model passing it's reference as first argument. You can also train in some other language of those included in the training dataset.
+
+## Evaluation results
+
+Evaluating `distilbert-base-uncased` on the Spanish test dataset before training results in:
+
+```
+Cosine-Similarity :	Pearson: 0.2980	Spearman: 0.4008
+```
+
+While the fine-tuned version with the defaults of the training script and the Spanish training dataset results in:
+
+```
+Cosine-Similarity :	Pearson: 0.7451	Spearman: 0.7364
+```
+
+In our [STS Evaluation repository](https://github.com/eduardofv/sts_eval) we compare the performance of this model with other models from Sentence Transformers and Tensorflow Hub using the standard STSBenchmark and the 2017 STSBenchmark Task 3 for Spanish.
+
+
+## Resources
+
+- Training dataset [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt)
+- Sentence Transformers [Semantic Textual Similarity](https://www.sbert.net/examples/training/sts/README.html)
+- Check [sts_eval](https://github.com/eduardofv/sts_eval) for a comparison with Tensorflow and Sentence-Transformers models
+- Check the [development environment to run the scripts and evaluation](https://github.com/eduardofv/ai-denv)
--- a/config.json
+++ b/config.json
@@ -0,0 +1,3 @@
+{
+  "__version__": "1.2.0"
+}
--- a/modules.json
+++ b/modules.json
@@ -0,0 +1,14 @@
+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "0_Transformer",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]
--- a/similarity_evaluation_sts-dev_results.csv
+++ b/similarity_evaluation_sts-dev_results.csv
@@ -0,0 +1,5 @@
+epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
+0,-1,0.7674348102247248,0.7667286405457256,0.7502040339902296,0.7576824147761646,0.7492861452801035,0.7561106972845474,0.7285506159415656,0.7343265392689423
+1,-1,0.7782789678703184,0.7761241007579364,0.763784529139891,0.7691237269220588,0.7626542374963032,0.7680705009011701,0.7410908722043604,0.7446238826558748
+2,-1,0.7766684878569015,0.7754911419362798,0.7610854551118094,0.764845075190592,0.7603155185939217,0.7640645508417966,0.7416572024271656,0.7459471184421463
+3,-1,0.7784200404666838,0.7767716670405521,0.7601348642559405,0.7632601234978199,0.7594430674974024,0.7626042533614712,0.7442829550102651,0.7497069964750338
--- a/similarity_evaluation_stsb-multi-mt-test_results.csv
+++ b/similarity_evaluation_stsb-multi-mt-test_results.csv
@@ -0,0 +1,2 @@
+epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
+-1,-1,0.7450619922414319,0.7363506275013219,0.7336101237784383,0.7320799242026941,0.7332076483340091,0.7317512428692636,0.7022330212363639,0.6964875585742952
--- a/training_stsb_m_mt.py
+++ b/training_stsb_m_mt.py
@@ -0,0 +1,104 @@
+"""
+MODIFIED: (efv) Use STSb-multi-mt Spanish
+source: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py
+
+---
+
+This examples trains BERT (or any other transformer model like RoBERTa, DistilBERT etc.) for the STSbenchmark from scratch. It generates sentence embeddings
+that can be compared using cosine-similarity to measure the similarity.
+
+Usage:
+python training_nli.py
+
+OR
+python training_nli.py pretrained_transformer_model_name
+"""
+from torch.utils.data import DataLoader
+import math
+from sentence_transformers import SentenceTransformer,  LoggingHandler, losses, models, util
+from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
+from sentence_transformers.readers import InputExample
+import logging
+from datetime import datetime
+import sys
+import os
+import gzip
+import csv
+
+from datasets import load_dataset
+
+#### Just some code to print debug information to stdout
+logging.basicConfig(format='%(asctime)s - %(message)s',
+                    datefmt='%Y-%m-%d %H:%M:%S',
+                    level=logging.INFO,
+                    handlers=[LoggingHandler()])
+#### /print debug information to stdout
+
+
+
+#You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
+model_name = sys.argv[1] if len(sys.argv) > 1 else 'distilbert-base-uncased'
+
+# Read the dataset
+train_batch_size = 16
+num_epochs = 4
+model_save_path = 'output/training_stsbenchmark_'+model_name.replace("/", "-")+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+
+# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
+word_embedding_model = models.Transformer(model_name)
+
+# Apply mean pooling to get one fixed sized sentence vector
+pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
+                               pooling_mode_mean_tokens=True,
+                               pooling_mode_cls_token=False,
+                               pooling_mode_max_tokens=False)
+
+model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
+
+# Convert the dataset to a DataLoader ready for training
+logging.info("Read stsb-multi-mt train dataset")
+
+train_samples = []
+dev_samples = []
+test_samples = []
+
+def samples_from_dataset(dataset):
+    samples = [InputExample(texts=[e['sentence1'], e['sentence2']], label=e['similarity_score'] / 5) \
+        for e in dataset] 
+    return samples
+
+train_samples = samples_from_dataset(load_dataset("stsb_multi_mt", name="es", split="train"))
+dev_samples = samples_from_dataset(load_dataset("stsb_multi_mt", name="es", split="dev"))
+test_samples = samples_from_dataset(load_dataset("stsb_multi_mt", name="es", split="test"))
+
+train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
+train_loss = losses.CosineSimilarityLoss(model=model)
+
+
+logging.info("Read stsb-multi-mt dev dataset")
+evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
+
+
+# Configure the training. We skip evaluation in this example
+warmup_steps = math.ceil(len(train_dataloader) * num_epochs  * 0.1) #10% of train data for warm-up
+logging.info("Warmup-steps: {}".format(warmup_steps))
+
+
+## Train the model
+model.fit(train_objectives=[(train_dataloader, train_loss)],
+          evaluator=evaluator,
+          epochs=num_epochs,
+          evaluation_steps=1000,
+          warmup_steps=warmup_steps,
+          output_path=model_save_path)
+
+
+##############################################################################
+#
+# Load the stored model and evaluate its performance on STS benchmark dataset
+#
+##############################################################################
+
+#model = SentenceTransformer(model_save_path)
+test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='stsb-multi-mt-test')
+test_evaluator(model, output_path=model_save_path)
				`@@ -0,0 +1 @@`
				`{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}`
				`@@ -0,0 +1 @@`
				`{"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "distilbert-base-uncased"}`