初始化项目，由ModelHub XC社区提供模型

Model: sbintuitions/sarashina-embedding-v2-1b Source: Original Platform
2026-05-14 14:03:32 +08:00
commit d38a87a535
14 changed files with 410532 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,122 @@
+---
+language: 
+- ja
+license_name: sarahina-non-commercial-license
+license_link: LICENSE
+base_model:
+- sbintuitions/sarashina2.2-1b
+tags:
+- transformers
+- sentence-similarity
+- feature-extraction
+- sentence-transformers
+inference: false
+---
+
+# Sarashina-Embedding-v2-1B
+
+**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v2-1b/blob/main/README_JA.md)**
+
+"Sarashina-Embedding-v2-1B" is a Japanese text embedding model, based on the Japanese LLM "[Sarashina2.2-1B](https://huggingface.co/sbintuitions/sarashina2.2-1b)".
+We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score across 28 datasets in  [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark).(Benchmarked on July 28, 2025. )
+
+This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications.
+
+## Model Details
+
+### Model Description
+
+- **Model Type:** Sentence Transformer
+- **Base model:** [Sarashina2.2-1B](https://huggingface.co/sbintuitions/sarashina2.2-1b)
+- **Maximum Sequence Length:** 8,192 tokens
+- **Output Dimensionality:** 1,792 dimensions
+- **Similarity Function:** Cosine Similarity
+- **Language:**  Japanese
+- **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v2-1b/blob/main/LICENSE)
+
+### Full Model Architecture
+
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel 
+  (1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
+)
+```
+
+## Usage
+
+First install the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library:
+
+```bash
+pip install sentence-transformers==4.0.2
+```
+
+Then you can load this model and run inference.
+
+```python
+from sentence_transformers import SentenceTransformer
+
+# Download from the 🤗 Hub
+model = SentenceTransformer("sbintuitions/sarashina-embedding-v2-1b")
+# Run inference
+query = [
+      'task: クエリを与えるので、与えられたWeb検索クエリに答える関連文章を検索してください。\nquery: Sarashinaのテキスト埋め込みモデルはありますか?'
+  ]
+texts = [
+      'text: 更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
+      'text: Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
+      'text: サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。'
+]
+query_embedding = model.encode(query)
+text_embeddings = model.encode(texts)
+# Get the similarity scores between the embeddings
+similarities = model.similarity(query_embedding, text_embeddings)
+print(similarities)
+# tensor([[0.7403, 0.8651, 0.8775]])
+```
+### How to add instructions and prefixes
+
+For both the query and document sides, use different prefix formats. On the query side, add the prefix `task:` followed by instructions. (Only for STS task, both sentences are considered as query, and should be prefixed with the same instruction.)
+
+  - Query Side: ```task: {Instrcution}\nquery: {Query}```
+  - Document Side: ```text: {Document}```
+
+### Templates for instructions and prefixes
+
+The table below provides instruction and prefix templates for five main tasks.
+|Task|Query Side|Document Side|
+|:-:|:-|:-|
+|Retrieval<br>Reranking|task: 質問を与えるので、その質問に答えるのに役立つ関連文書を検索してください。\nquery: |text: |
+|Clustering|task: 与えられたドキュメントのトピックまたはテーマを特定してください。\nquery: | - |
+|Classification|task: 与えられたレビューを適切な評価カテゴリに分類してください。\nquery: | - |
+|STS|task: クエリを与えるので，もっともクエリに意味が似ている一節を探してください。\nquery: |task: クエリを与えるので，もっともクエリに意味が似ている一節を探してください。\nquery: |
+
+## Training
+
+Sarashina-Embedding-v2-1B is created through the following three-stage learning process:
+
+### Stage 1: Weakly-supervised Learning
+To build a general-purpose and high-performance embedding model for a wide range of domains, we employed contrastive learning using weak supervision data, which consists of our own web-crawled data and open datasets.
+
+### Step2: Supervised Fine-tuning
+To further train the model to better understand the similarity between queries and documents, we performed fine-tuning using higher-quality data than that used in Stage 1. Additionally, we trained multiple models by modifying parts of the data.
+
+### Stage 3: Model Merging
+To enhance performance, we merged the weights of the two models that yielded the highest JMTEB scores in Stage 2 through linear merging.
+
+## Evaluation Results (*) with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
+
+|Model|Avg.|Retrieval|STS|Classification|Reranking|Clustering|
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+|Sarashina-Embedding-v2-1B (This model)|**76.38**|**76.48**|**84.22**|77.14|**86.28**|52.56|
+|[cl-nagoya/ruri-v3-310m](https://huggingface.co/cl-nagoya/ruri-v3-310m)|75.85|76.03|81.59|**77.65**|85.84|50.52|
+|[sbintuitions/sarashina-embedding-v1-1b](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)|74.87|74.53|81.71|77.20|84.36|50.30|
+|[OpenAI/text-embedding-3-large](https://openai.com/ja-JP/index/new-embedding-models-and-api-updates/)|73.86|71.95|82.52|77.27|83.06|51.82|
+
+(*) Evaluated on July 28, 2025.
+
+## License
+
+This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v2-1b/blob/main/LICENSE).
+
+**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/contact/).**