初始化项目,由ModelHub XC社区提供模型
Model: cointegrated/rubert-tiny2 Source: Original Platform
This commit is contained in:
62
README.md
Normal file
62
README.md
Normal file
@@ -0,0 +1,62 @@
|
||||
---
|
||||
language:
|
||||
- ru
|
||||
pipeline_tag: sentence-similarity
|
||||
tags:
|
||||
- russian
|
||||
- fill-mask
|
||||
- pretraining
|
||||
- embeddings
|
||||
- masked-lm
|
||||
- tiny
|
||||
- feature-extraction
|
||||
- sentence-similarity
|
||||
- sentence-transformers
|
||||
- transformers
|
||||
license: mit
|
||||
widget:
|
||||
- text: Миниатюрная модель для [MASK] разных задач.
|
||||
---
|
||||
This is an updated version of [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny): a small Russian BERT-based encoder with high-quality sentence embeddings. This [post in Russian](https://habr.com/ru/post/669674/) gives more details.
|
||||
|
||||
The differences from the previous version include:
|
||||
- a larger vocabulary: 83828 tokens instead of 29564;
|
||||
- larger supported sequences: 2048 instead of 512;
|
||||
- sentence embeddings approximate LaBSE closer than before;
|
||||
- meaningful segment embeddings (tuned on the NLI task)
|
||||
- the model is focused only on Russian.
|
||||
|
||||
The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task.
|
||||
|
||||
Sentence embeddings can be produced as follows:
|
||||
|
||||
```python
|
||||
# pip install transformers sentencepiece
|
||||
import torch
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2")
|
||||
model = AutoModel.from_pretrained("cointegrated/rubert-tiny2")
|
||||
# model.cuda() # uncomment it if you have a GPU
|
||||
|
||||
def embed_bert_cls(text, model, tokenizer):
|
||||
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
|
||||
with torch.no_grad():
|
||||
model_output = model(**{k: v.to(model.device) for k, v in t.items()})
|
||||
embeddings = model_output.last_hidden_state[:, 0, :]
|
||||
embeddings = torch.nn.functional.normalize(embeddings)
|
||||
return embeddings[0].cpu().numpy()
|
||||
|
||||
print(embed_bert_cls('привет мир', model, tokenizer).shape)
|
||||
# (312,)
|
||||
```
|
||||
|
||||
Alternatively, you can use the model with `sentence_transformers`:
|
||||
```Python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
model = SentenceTransformer('cointegrated/rubert-tiny2')
|
||||
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
|
||||
embeddings = model.encode(sentences)
|
||||
print(embeddings)
|
||||
```
|
||||
|
||||
For those who want to run the inference with [VLLM](https://docs.vllm.ai/en/latest/), there is a vLLM-optimized version of this model: [WpythonW/rubert-tiny2-vllm](https://huggingface.co/WpythonW/rubert-tiny2-vllm)
|
||||
Reference in New Issue
Block a user