初始化项目，由ModelHub XC社区提供模型

Model: eduardofv/stsb-m-mt-es-distiluse-base-multilingual-cased-v1 Source: Original Platform
2026-05-13 17:07:32 +08:00
commit cace405113
14 changed files with 119690 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,16 @@
+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tar.gz filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
--- a/1_Pooling/config.json
+++ b/1_Pooling/config.json
@@ -0,0 +1,7 @@
+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false
+}
--- a/README.md
+++ b/README.md
@@ -0,0 +1,59 @@
+---
+language: es
+datasets:
+- stsb_multi_mt
+tags:
+- sentence-similarity
+- sentence-transformers
+---
+
+
+This is a test model that was fine-tuned using the Spanish datasets from [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) in order to understand and benchmark STS models.
+
+## Model and training data description
+
+This model was built taking `distiluse-base-multilingual-cased-v1` and training it on a Semantic Textual Similarity task using a modified version of the training script for STS from Sentece Transformers (the modified script is included in the repo). It was trained using the Spanish datasets from [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) which are the STSBenchmark datasets automatically translated to other languages using deepl.com. Refer to the dataset repository for more details.
+
+## Intended uses & limitations
+
+This model was built just as a proof-of-concept on STS fine-tuning using Spanish data and no specific use other than getting a sense on how this training works.
+
+## How to use
+
+You may use it as any other STS trained model to extract sentence embeddings. Check Sentence Transformers documentation. 
+
+## Training procedure
+
+This model was trained using this [Colab Notebook](https://colab.research.google.com/drive/1ZNjDMFdy_lKhnD9BtbqzSbQ4LNz638ZA?usp=sharing)
+
+## Evaluation results
+
+Evaluating `distiluse-base-multilingual-cased-v1` on the Spanish test dataset before training results in:
+
+```
+2021-07-06 17:44:46 - EmbeddingSimilarityEvaluator: Evaluating the model on  dataset:
+2021-07-06 17:45:00 - Cosine-Similarity :	Pearson: 0.7662	Spearman: 0.7583
+2021-07-06 17:45:00 - Manhattan-Distance:	Pearson: 0.7805	Spearman: 0.7772
+2021-07-06 17:45:00 - Euclidean-Distance:	Pearson: 0.7816	Spearman: 0.7778
+2021-07-06 17:45:00 - Dot-Product-Similarity:	Pearson: 0.6610	Spearman: 0.6536
+```
+
+While the fine-tuned version with the defaults of the training script and the Spanish training dataset results in:
+
+```
+2021-07-06 17:49:22 - EmbeddingSimilarityEvaluator: Evaluating the model on stsb-multi-mt-test dataset:
+2021-07-06 17:49:24 - Cosine-Similarity :	Pearson: 0.8265	Spearman: 0.8207
+2021-07-06 17:49:24 - Manhattan-Distance:	Pearson: 0.8131	Spearman: 0.8190
+2021-07-06 17:49:24 - Euclidean-Distance:	Pearson: 0.8129	Spearman: 0.8190
+2021-07-06 17:49:24 - Dot-Product-Similarity:	Pearson: 0.7773	Spearman: 0.7692
+```
+
+In our [STS Evaluation repository](https://github.com/eduardofv/sts_eval) we compare the performance of this model with other models from Sentence Transformers and Tensorflow Hub using the standard STSBenchmark and the 2017 STSBenchmark Task 3 for Spanish.
+
+
+## Resources
+
+- Training dataset [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt)
+- Sentence Transformers [Semantic Textual Similarity](https://www.sbert.net/examples/training/sts/README.html)
+- Check [sts_eval](https://github.com/eduardofv/sts_eval) for a comparison with Tensorflow and Sentence-Transformers models
+- Check the [development environment to run the scripts and evaluation](https://github.com/eduardofv/ai-denv)
--- a/config.json
+++ b/config.json
@@ -0,0 +1,23 @@
+{
+  "_name_or_path": "sentence-transformers/distiluse-base-multilingual-cased-v1",
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertModel"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "initializer_range": 0.02,
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "transformers_version": "4.8.2",
+  "vocab_size": 119547
+}
--- a/config_sentence_transformers.json
+++ b/config_sentence_transformers.json
@@ -0,0 +1,7 @@
+{
+  "__version__": {
+    "sentence_transformers": "2.0.0",
+    "transformers": "4.8.2",
+    "pytorch": "1.9.0+cu102"
+  }
+}
--- a/eval/similarity_evaluation_sts-dev_results.csv
+++ b/eval/similarity_evaluation_sts-dev_results.csv
@@ -0,0 +1,5 @@
+epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
+0,-1,0.8652665968345811,0.8603021160706319,0.8497111617616812,0.8557700577868964,0.8491194556237853,0.8549563009982794,0.8385542095072176,0.8388913913494812
+1,-1,0.8645114658595832,0.860518712305324,0.8489194350105598,0.8550638501344121,0.8481446637438415,0.8541834331582855,0.8351583718649971,0.836709131838303
+2,-1,0.867030445948165,0.8634980335095029,0.8506727168486387,0.8564475574925573,0.8501374118495724,0.8558676389439143,0.8365838992705756,0.8392035046976417
+3,-1,0.8661164372932091,0.8624662517807953,0.8493156802662722,0.8550037227391485,0.848672351002434,0.854242569948961,0.8308664780710621,0.8349043949768572
--- a/modules.json
+++ b/modules.json
@@ -0,0 +1,14 @@
+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]
--- a/pytorch_model.bin
+++ b/pytorch_model.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:0f7593234a63463168275846bce5bfe5989641084ab092f4867a8d48a9fc1337
+size 538972985
--- a/sentence_bert_config.json
+++ b/sentence_bert_config.json
@@ -0,0 +1,4 @@
+{
+  "max_seq_length": 512,
+  "do_lower_case": false
+}
--- a/similarity_evaluation_stsb-multi-mt-test_results.csv
+++ b/similarity_evaluation_stsb-multi-mt-test_results.csv
@@ -0,0 +1,2 @@
+epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
+-1,-1,0.8273478826973636,0.8215128959042091,0.8137184308367987,0.8197566352546802,0.8138222620547743,0.8195473589233008,0.7755088639539296,0.7692474708707059
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1 @@
+{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1 @@
+{"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "max_len": 512, "special_tokens_map_file": "old_models/distiluse-base-multilingual-cased-v1/0_Transformer/special_tokens_map.json", "name_or_path": "sentence-transformers/distiluse-base-multilingual-cased-v1", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "DistilBertTokenizer"}
--- a/vocab.txt
+++ b/vocab.txt
				`@@ -0,0 +1 @@`
				`{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}`
				`@@ -0,0 +1 @@`
				`{"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "max_len": 512, "special_tokens_map_file": "old_models/distiluse-base-multilingual-cased-v1/0_Transformer/special_tokens_map.json", "name_or_path": "sentence-transformers/distiluse-base-multilingual-cased-v1", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "DistilBertTokenizer"}`