初始化项目，由ModelHub XC社区提供模型

Model: cointegrated/rubert-tiny2 Source: Original Platform
2026-05-14 17:22:28 +08:00
commit 13f4367ae2
14 changed files with 83989 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,62 @@
+---
+language:
+- ru
+pipeline_tag: sentence-similarity
+tags:
+- russian
+- fill-mask
+- pretraining
+- embeddings
+- masked-lm
+- tiny
+- feature-extraction
+- sentence-similarity
+- sentence-transformers
+- transformers
+license: mit
+widget:
+- text: Миниатюрная модель для [MASK] разных задач.
+---
+This is an updated version of [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny): a small Russian BERT-based encoder with high-quality sentence embeddings. This [post in Russian](https://habr.com/ru/post/669674/) gives more details.
+
+The differences from the previous version include:
+- a larger vocabulary: 83828 tokens instead of 29564;
+- larger supported sequences: 2048 instead of 512;
+- sentence embeddings approximate LaBSE closer than before;
+- meaningful segment embeddings (tuned on the NLI task)
+- the model is focused only on Russian. 
+
+The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task.
+
+Sentence embeddings can be produced as follows:
+
+```python
+# pip install transformers sentencepiece
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2")
+model = AutoModel.from_pretrained("cointegrated/rubert-tiny2")
+# model.cuda()  # uncomment it if you have a GPU
+
+def embed_bert_cls(text, model, tokenizer):
+    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
+    with torch.no_grad():
+        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
+    embeddings = model_output.last_hidden_state[:, 0, :]
+    embeddings = torch.nn.functional.normalize(embeddings)
+    return embeddings[0].cpu().numpy()
+
+print(embed_bert_cls('привет мир', model, tokenizer).shape)
+# (312,)
+```
+
+Alternatively, you can use the model with `sentence_transformers`:
+```Python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer('cointegrated/rubert-tiny2')
+sentences = ["привет мир", "hello world", "здравствуй вселенная"]
+embeddings = model.encode(sentences)
+print(embeddings)
+```
+
+For those who want to run the inference with [VLLM](https://docs.vllm.ai/en/latest/), there is a vLLM-optimized version of this model: [WpythonW/rubert-tiny2-vllm](https://huggingface.co/WpythonW/rubert-tiny2-vllm)