初始化项目,由ModelHub XC社区提供模型
Model: eduardofv/stsb-m-mt-es-distiluse-base-multilingual-cased-v1 Source: Original Platform
This commit is contained in:
16
.gitattributes
vendored
Normal file
16
.gitattributes
vendored
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
*.bin.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
7
1_Pooling/config.json
Normal file
7
1_Pooling/config.json
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
{
|
||||||
|
"word_embedding_dimension": 768,
|
||||||
|
"pooling_mode_cls_token": false,
|
||||||
|
"pooling_mode_mean_tokens": true,
|
||||||
|
"pooling_mode_max_tokens": false,
|
||||||
|
"pooling_mode_mean_sqrt_len_tokens": false
|
||||||
|
}
|
||||||
59
README.md
Normal file
59
README.md
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
---
|
||||||
|
language: es
|
||||||
|
datasets:
|
||||||
|
- stsb_multi_mt
|
||||||
|
tags:
|
||||||
|
- sentence-similarity
|
||||||
|
- sentence-transformers
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
This is a test model that was fine-tuned using the Spanish datasets from [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) in order to understand and benchmark STS models.
|
||||||
|
|
||||||
|
## Model and training data description
|
||||||
|
|
||||||
|
This model was built taking `distiluse-base-multilingual-cased-v1` and training it on a Semantic Textual Similarity task using a modified version of the training script for STS from Sentece Transformers (the modified script is included in the repo). It was trained using the Spanish datasets from [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) which are the STSBenchmark datasets automatically translated to other languages using deepl.com. Refer to the dataset repository for more details.
|
||||||
|
|
||||||
|
## Intended uses & limitations
|
||||||
|
|
||||||
|
This model was built just as a proof-of-concept on STS fine-tuning using Spanish data and no specific use other than getting a sense on how this training works.
|
||||||
|
|
||||||
|
## How to use
|
||||||
|
|
||||||
|
You may use it as any other STS trained model to extract sentence embeddings. Check Sentence Transformers documentation.
|
||||||
|
|
||||||
|
## Training procedure
|
||||||
|
|
||||||
|
This model was trained using this [Colab Notebook](https://colab.research.google.com/drive/1ZNjDMFdy_lKhnD9BtbqzSbQ4LNz638ZA?usp=sharing)
|
||||||
|
|
||||||
|
## Evaluation results
|
||||||
|
|
||||||
|
Evaluating `distiluse-base-multilingual-cased-v1` on the Spanish test dataset before training results in:
|
||||||
|
|
||||||
|
```
|
||||||
|
2021-07-06 17:44:46 - EmbeddingSimilarityEvaluator: Evaluating the model on dataset:
|
||||||
|
2021-07-06 17:45:00 - Cosine-Similarity : Pearson: 0.7662 Spearman: 0.7583
|
||||||
|
2021-07-06 17:45:00 - Manhattan-Distance: Pearson: 0.7805 Spearman: 0.7772
|
||||||
|
2021-07-06 17:45:00 - Euclidean-Distance: Pearson: 0.7816 Spearman: 0.7778
|
||||||
|
2021-07-06 17:45:00 - Dot-Product-Similarity: Pearson: 0.6610 Spearman: 0.6536
|
||||||
|
```
|
||||||
|
|
||||||
|
While the fine-tuned version with the defaults of the training script and the Spanish training dataset results in:
|
||||||
|
|
||||||
|
```
|
||||||
|
2021-07-06 17:49:22 - EmbeddingSimilarityEvaluator: Evaluating the model on stsb-multi-mt-test dataset:
|
||||||
|
2021-07-06 17:49:24 - Cosine-Similarity : Pearson: 0.8265 Spearman: 0.8207
|
||||||
|
2021-07-06 17:49:24 - Manhattan-Distance: Pearson: 0.8131 Spearman: 0.8190
|
||||||
|
2021-07-06 17:49:24 - Euclidean-Distance: Pearson: 0.8129 Spearman: 0.8190
|
||||||
|
2021-07-06 17:49:24 - Dot-Product-Similarity: Pearson: 0.7773 Spearman: 0.7692
|
||||||
|
```
|
||||||
|
|
||||||
|
In our [STS Evaluation repository](https://github.com/eduardofv/sts_eval) we compare the performance of this model with other models from Sentence Transformers and Tensorflow Hub using the standard STSBenchmark and the 2017 STSBenchmark Task 3 for Spanish.
|
||||||
|
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- Training dataset [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt)
|
||||||
|
- Sentence Transformers [Semantic Textual Similarity](https://www.sbert.net/examples/training/sts/README.html)
|
||||||
|
- Check [sts_eval](https://github.com/eduardofv/sts_eval) for a comparison with Tensorflow and Sentence-Transformers models
|
||||||
|
- Check the [development environment to run the scripts and evaluation](https://github.com/eduardofv/ai-denv)
|
||||||
23
config.json
Normal file
23
config.json
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
{
|
||||||
|
"_name_or_path": "sentence-transformers/distiluse-base-multilingual-cased-v1",
|
||||||
|
"activation": "gelu",
|
||||||
|
"architectures": [
|
||||||
|
"DistilBertModel"
|
||||||
|
],
|
||||||
|
"attention_dropout": 0.1,
|
||||||
|
"dim": 768,
|
||||||
|
"dropout": 0.1,
|
||||||
|
"hidden_dim": 3072,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"max_position_embeddings": 512,
|
||||||
|
"model_type": "distilbert",
|
||||||
|
"n_heads": 12,
|
||||||
|
"n_layers": 6,
|
||||||
|
"pad_token_id": 0,
|
||||||
|
"qa_dropout": 0.1,
|
||||||
|
"seq_classif_dropout": 0.2,
|
||||||
|
"sinusoidal_pos_embds": false,
|
||||||
|
"tie_weights_": true,
|
||||||
|
"transformers_version": "4.8.2",
|
||||||
|
"vocab_size": 119547
|
||||||
|
}
|
||||||
7
config_sentence_transformers.json
Normal file
7
config_sentence_transformers.json
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
{
|
||||||
|
"__version__": {
|
||||||
|
"sentence_transformers": "2.0.0",
|
||||||
|
"transformers": "4.8.2",
|
||||||
|
"pytorch": "1.9.0+cu102"
|
||||||
|
}
|
||||||
|
}
|
||||||
5
eval/similarity_evaluation_sts-dev_results.csv
Normal file
5
eval/similarity_evaluation_sts-dev_results.csv
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
|
||||||
|
0,-1,0.8652665968345811,0.8603021160706319,0.8497111617616812,0.8557700577868964,0.8491194556237853,0.8549563009982794,0.8385542095072176,0.8388913913494812
|
||||||
|
1,-1,0.8645114658595832,0.860518712305324,0.8489194350105598,0.8550638501344121,0.8481446637438415,0.8541834331582855,0.8351583718649971,0.836709131838303
|
||||||
|
2,-1,0.867030445948165,0.8634980335095029,0.8506727168486387,0.8564475574925573,0.8501374118495724,0.8558676389439143,0.8365838992705756,0.8392035046976417
|
||||||
|
3,-1,0.8661164372932091,0.8624662517807953,0.8493156802662722,0.8550037227391485,0.848672351002434,0.854242569948961,0.8308664780710621,0.8349043949768572
|
||||||
|
14
modules.json
Normal file
14
modules.json
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"idx": 0,
|
||||||
|
"name": "0",
|
||||||
|
"path": "",
|
||||||
|
"type": "sentence_transformers.models.Transformer"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"idx": 1,
|
||||||
|
"name": "1",
|
||||||
|
"path": "1_Pooling",
|
||||||
|
"type": "sentence_transformers.models.Pooling"
|
||||||
|
}
|
||||||
|
]
|
||||||
3
pytorch_model.bin
Normal file
3
pytorch_model.bin
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:0f7593234a63463168275846bce5bfe5989641084ab092f4867a8d48a9fc1337
|
||||||
|
size 538972985
|
||||||
4
sentence_bert_config.json
Normal file
4
sentence_bert_config.json
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
{
|
||||||
|
"max_seq_length": 512,
|
||||||
|
"do_lower_case": false
|
||||||
|
}
|
||||||
2
similarity_evaluation_stsb-multi-mt-test_results.csv
Normal file
2
similarity_evaluation_stsb-multi-mt-test_results.csv
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
|
||||||
|
-1,-1,0.8273478826973636,0.8215128959042091,0.8137184308367987,0.8197566352546802,0.8138222620547743,0.8195473589233008,0.7755088639539296,0.7692474708707059
|
||||||
|
1
special_tokens_map.json
Normal file
1
special_tokens_map.json
Normal file
@@ -0,0 +1 @@
|
|||||||
|
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
||||||
1
tokenizer.json
Normal file
1
tokenizer.json
Normal file
File diff suppressed because one or more lines are too long
1
tokenizer_config.json
Normal file
1
tokenizer_config.json
Normal file
@@ -0,0 +1 @@
|
|||||||
|
{"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "max_len": 512, "special_tokens_map_file": "old_models/distiluse-base-multilingual-cased-v1/0_Transformer/special_tokens_map.json", "name_or_path": "sentence-transformers/distiluse-base-multilingual-cased-v1", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "DistilBertTokenizer"}
|
||||||
Reference in New Issue
Block a user