初始化项目,由ModelHub XC社区提供模型

Model: eduardofv/stsb-m-mt-es-distilbert-base-uncased
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-13 17:08:20 +08:00
commit e59df12d9f
15 changed files with 30757 additions and 0 deletions

16
.gitattributes vendored Normal file
View File

@@ -0,0 +1,16 @@
*.bin.* filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tar.gz filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text

23
0_Transformer/config.json Normal file
View File

@@ -0,0 +1,23 @@
{
"_name_or_path": "distilbert-base-uncased",
"activation": "gelu",
"architectures": [
"DistilBertModel"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"transformers_version": "4.6.1",
"vocab_size": 30522
}

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:4247d2f9cba7380e84be9a69bb9bff9608d505966ec86f078a1ff4d0285ebaaf
size 265490176

View File

@@ -0,0 +1,4 @@
{
"max_seq_length": null,
"do_lower_case": false
}

View File

@@ -0,0 +1 @@
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1 @@
{"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "distilbert-base-uncased"}

30522
0_Transformer/vocab.txt Normal file

File diff suppressed because it is too large Load Diff

7
1_Pooling/config.json Normal file
View File

@@ -0,0 +1,7 @@
{
"word_embedding_dimension": 768,
"pooling_mode_cls_token": false,
"pooling_mode_mean_tokens": true,
"pooling_mode_max_tokens": false,
"pooling_mode_mean_sqrt_len_tokens": false
}

51
README.md Normal file
View File

@@ -0,0 +1,51 @@
---
language: es
datasets:
- stsb_multi_mt
tags:
- sentence-similarity
- sentence-transformers
---
# distilbert-base-uncased trained for Semantic Textual Similarity in Spanish
This is a test model that was fine-tuned using the Spanish datasets from [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) in order to understand and benchmark STS models.
## Model and training data description
This model was built taking `distilbert-base-uncased` and training it on a Semantic Textual Similarity task using a modified version of the training script for STS from Sentece Transformers (the modified script is included in the repo). It was trained using the Spanish datasets from [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) which are the STSBenchmark datasets automatically translated to other languages using deepl.com. Refer to the dataset repository for more details.
## Intended uses & limitations
This model was built just as a proof-of-concept on STS fine-tuning using Spanish data and no specific use other than getting a sense on how this training works.
## How to use
You may use it as any other STS trained model to extract sentence embeddings. Check Sentence Transformers documentation.
## Training procedure
Use the included script to train in Spanish the base model. You can also try to train another model passing it's reference as first argument. You can also train in some other language of those included in the training dataset.
## Evaluation results
Evaluating `distilbert-base-uncased` on the Spanish test dataset before training results in:
```
Cosine-Similarity : Pearson: 0.2980 Spearman: 0.4008
```
While the fine-tuned version with the defaults of the training script and the Spanish training dataset results in:
```
Cosine-Similarity : Pearson: 0.7451 Spearman: 0.7364
```
In our [STS Evaluation repository](https://github.com/eduardofv/sts_eval) we compare the performance of this model with other models from Sentence Transformers and Tensorflow Hub using the standard STSBenchmark and the 2017 STSBenchmark Task 3 for Spanish.
## Resources
- Training dataset [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt)
- Sentence Transformers [Semantic Textual Similarity](https://www.sbert.net/examples/training/sts/README.html)
- Check [sts_eval](https://github.com/eduardofv/sts_eval) for a comparison with Tensorflow and Sentence-Transformers models
- Check the [development environment to run the scripts and evaluation](https://github.com/eduardofv/ai-denv)

3
config.json Normal file
View File

@@ -0,0 +1,3 @@
{
"__version__": "1.2.0"
}

14
modules.json Normal file
View File

@@ -0,0 +1,14 @@
[
{
"idx": 0,
"name": "0",
"path": "0_Transformer",
"type": "sentence_transformers.models.Transformer"
},
{
"idx": 1,
"name": "1",
"path": "1_Pooling",
"type": "sentence_transformers.models.Pooling"
}
]

View File

@@ -0,0 +1,5 @@
epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
0,-1,0.7674348102247248,0.7667286405457256,0.7502040339902296,0.7576824147761646,0.7492861452801035,0.7561106972845474,0.7285506159415656,0.7343265392689423
1,-1,0.7782789678703184,0.7761241007579364,0.763784529139891,0.7691237269220588,0.7626542374963032,0.7680705009011701,0.7410908722043604,0.7446238826558748
2,-1,0.7766684878569015,0.7754911419362798,0.7610854551118094,0.764845075190592,0.7603155185939217,0.7640645508417966,0.7416572024271656,0.7459471184421463
3,-1,0.7784200404666838,0.7767716670405521,0.7601348642559405,0.7632601234978199,0.7594430674974024,0.7626042533614712,0.7442829550102651,0.7497069964750338
1 epoch steps cosine_pearson cosine_spearman euclidean_pearson euclidean_spearman manhattan_pearson manhattan_spearman dot_pearson dot_spearman
2 0 -1 0.7674348102247248 0.7667286405457256 0.7502040339902296 0.7576824147761646 0.7492861452801035 0.7561106972845474 0.7285506159415656 0.7343265392689423
3 1 -1 0.7782789678703184 0.7761241007579364 0.763784529139891 0.7691237269220588 0.7626542374963032 0.7680705009011701 0.7410908722043604 0.7446238826558748
4 2 -1 0.7766684878569015 0.7754911419362798 0.7610854551118094 0.764845075190592 0.7603155185939217 0.7640645508417966 0.7416572024271656 0.7459471184421463
5 3 -1 0.7784200404666838 0.7767716670405521 0.7601348642559405 0.7632601234978199 0.7594430674974024 0.7626042533614712 0.7442829550102651 0.7497069964750338

View File

@@ -0,0 +1,2 @@
epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
-1,-1,0.7450619922414319,0.7363506275013219,0.7336101237784383,0.7320799242026941,0.7332076483340091,0.7317512428692636,0.7022330212363639,0.6964875585742952
1 epoch steps cosine_pearson cosine_spearman euclidean_pearson euclidean_spearman manhattan_pearson manhattan_spearman dot_pearson dot_spearman
2 -1 -1 0.7450619922414319 0.7363506275013219 0.7336101237784383 0.7320799242026941 0.7332076483340091 0.7317512428692636 0.7022330212363639 0.6964875585742952

104
training_stsb_m_mt.py Normal file
View File

@@ -0,0 +1,104 @@
"""
MODIFIED: (efv) Use STSb-multi-mt Spanish
source: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py
---
This examples trains BERT (or any other transformer model like RoBERTa, DistilBERT etc.) for the STSbenchmark from scratch. It generates sentence embeddings
that can be compared using cosine-similarity to measure the similarity.
Usage:
python training_nli.py
OR
python training_nli.py pretrained_transformer_model_name
"""
from torch.utils.data import DataLoader
import math
from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import InputExample
import logging
from datetime import datetime
import sys
import os
import gzip
import csv
from datasets import load_dataset
#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S',
level=logging.INFO,
handlers=[LoggingHandler()])
#### /print debug information to stdout
#You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
model_name = sys.argv[1] if len(sys.argv) > 1 else 'distilbert-base-uncased'
# Read the dataset
train_batch_size = 16
num_epochs = 4
model_save_path = 'output/training_stsbenchmark_'+model_name.replace("/", "-")+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
word_embedding_model = models.Transformer(model_name)
# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
pooling_mode_mean_tokens=True,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
# Convert the dataset to a DataLoader ready for training
logging.info("Read stsb-multi-mt train dataset")
train_samples = []
dev_samples = []
test_samples = []
def samples_from_dataset(dataset):
samples = [InputExample(texts=[e['sentence1'], e['sentence2']], label=e['similarity_score'] / 5) \
for e in dataset]
return samples
train_samples = samples_from_dataset(load_dataset("stsb_multi_mt", name="es", split="train"))
dev_samples = samples_from_dataset(load_dataset("stsb_multi_mt", name="es", split="dev"))
test_samples = samples_from_dataset(load_dataset("stsb_multi_mt", name="es", split="test"))
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)
logging.info("Read stsb-multi-mt dev dataset")
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
# Configure the training. We skip evaluation in this example
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1) #10% of train data for warm-up
logging.info("Warmup-steps: {}".format(warmup_steps))
## Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
epochs=num_epochs,
evaluation_steps=1000,
warmup_steps=warmup_steps,
output_path=model_save_path)
##############################################################################
#
# Load the stored model and evaluate its performance on STS benchmark dataset
#
##############################################################################
#model = SentenceTransformer(model_save_path)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='stsb-multi-mt-test')
test_evaluator(model, output_path=model_save_path)