初始化项目,由ModelHub XC社区提供模型

Model: knowledgator/Qwen-encoder-0.5B
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-17 07:42:49 +08:00
commit 17c31d1dd5
11 changed files with 454827 additions and 0 deletions

35
.gitattributes vendored Normal file
View File

@@ -0,0 +1,35 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text

167
README.md Normal file
View File

@@ -0,0 +1,167 @@
---
license: apache-2.0
datasets:
- wikimedia/wikipedia
language:
- en
library_name: transformers
tags:
- LLM2Vec
- encoder
- LLM
- classification
- NER
- question-answering
---
# LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
> LLM2Vec is a simple recipe to convert decoder-only LLMs into text encoders. It consists of 3 simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. The model can be further fine-tuned to achieve state-of-the-art performance.
- **Repository:** https://github.com/McGill-NLP/llm2vec
- **Paper:** https://arxiv.org/abs/2404.05961
## Overview:
This is a bi-directional version of Qwen2-0.5B trained with masked token prediction on the Wikipedia dataset. Modern decoder models offer several advantages over classical encoders like BERT:
They are pre-trained on more recent textual corpora
They are trained on larger and more diverse datasets
Modern decoders have better support for long-context windows
Flash-attention support is available for these models
Considering these benefits, we are excited to release a series of decoder models tuned to work in a bi-directional setting. This approach combines the strengths of modern decoder architectures with the versatility of bi-directional context understanding, potentially opening up new possibilities for various natural language processing tasks, such as NER.
In comparison to original LLM2Vec we trained all weights of LLama model, it potentially improve bi-directional abilities of the model.
## Installation
```bash
pip install llm2vec
```
## Usage
```python
from llm2vec.models import Qwen2BiModel
import torch
from transformers import AutoTokenizer
# Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs. MNTP LoRA weights are merged into the base model.
tokenizer = AutoTokenizer.from_pretrained(
"knowledgator/Qwen-encoder-0.5B"
)
model = Qwen2BiModel.from_pretrained("knowledgator/Qwen-encoder-0.5B")
text = "The quick brown fox jumps over the lazy dog."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
```
Here's an improved and expanded version of the README snippet:
## Adapting for Different Discriminative Tasks
Our bi-directional LLaMA model can be easily adapted for various discriminative tasks such as text classification, question answering, and token classification.
To use these specialized versions, we provide a [fork of LLM2Vec](https://github.com/Knowledgator/llm2vec) with additional functionality.
### Installation
To get started, clone our fork of LLM2Vec and install it:
```bash
git clone https://github.com/Knowledgator/llm2vec.git
cd llm2vec
pip install -e .
```
Using `-e` flag installs the package in editable mode, which is useful for development.
### Usage
Here's how to import and use the models for different tasks:
```python
from llm2vec import (
AutoLLMEncoderForSequenceClassification,
AutoLLMEncoderForQuestionAnswering,
AutoLLMEncoderForTokenClassification
)
# Load models for different tasks
classification_model = AutoLLMEncoderForSequenceClassification.from_pretrained('knowledgator/Qwen-encoder-0.5B')
question_answering_model = AutoLLMEncoderForQuestionAnswering.from_pretrained('knowledgator/Qwen-encoder-0.5B')
token_classification_model = AutoLLMEncoderForTokenClassification.from_pretrained('knowledgator/Qwen-encoder-0.5B')
```
### Example: Text Classification
Here's a basic example of how to use the model for text classification:
```python
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('knowledgator/Qwen-encoder-0.5B')
# Prepare input
text = "This movie is great!"
inputs = tokenizer(text, return_tensors="pt")
# Get classification logits
outputs = classification_model(**inputs)
logits = outputs.logits
# The logits can be used with a softmax function to get probabilities
# or you can use torch.argmax(logits, dim=1) to get the predicted class
```
### Fine-tuning
To fine-tune these models on your specific task:
1. Prepare your dataset in a format compatible with HuggingFace's `datasets` library.
2. Use the `Trainer` class from HuggingFace's `transformers` library to fine-tune the model.
Here's a basic example:
```python
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
# Load your dataset
dataset = load_dataset("your_dataset")
# Define training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
)
# Initialize Trainer
trainer = Trainer(
model=classification_model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
)
# Fine-tune the model
trainer.train()
```
### Contributing
We welcome contributions! If you have suggestions for improvements or encounter any issues, please open an issue or submit a pull request on our [GitHub repository](https://github.com/Knowledgator/llm2vec).

5
added_tokens.json Normal file
View File

@@ -0,0 +1,5 @@
{
"<|endoftext|>": 151643,
"<|im_end|>": 151645,
"<|im_start|>": 151644
}

28
config.json Normal file
View File

@@ -0,0 +1,28 @@
{
"_name_or_path": "/home/BiomikeeNew/werent4/llm2vec/output/mntp/qwen_0.5/checkpoint-120000",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 896,
"initializer_range": 0.02,
"intermediate_size": 4864,
"max_position_embeddings": 131072,
"max_window_layers": 24,
"model_type": "qwen2",
"num_attention_heads": 14,
"num_hidden_layers": 24,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 131072,
"tie_word_embeddings": true,
"torch_dtype": "float32",
"transformers_version": "4.40.2",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}

6
generation_config.json Normal file
View File

@@ -0,0 +1,6 @@
{
"_from_model_config": true,
"bos_token_id": 151643,
"eos_token_id": 151643,
"transformers_version": "4.40.2"
}

151388
merges.txt Normal file

File diff suppressed because it is too large Load Diff

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d2e3b9c982e78ca3bd1f10078f138228e835e0fddcc3a84650c8f0cb23af5511
size 1976163472

21
special_tokens_map.json Normal file
View File

@@ -0,0 +1,21 @@
{
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>"
],
"eos_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"mask_token": "_",
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

303121
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

52
tokenizer_config.json Normal file
View File

@@ -0,0 +1,52 @@
{
"add_prefix_space": false,
"added_tokens_decoder": {
"62": {
"content": "_",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151643": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151644": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151645": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>"
],
"bos_token": null,
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<|endoftext|>",
"errors": "replace",
"mask_token": "_",
"model_max_length": 32768,
"pad_token": "<|endoftext|>",
"split_special_tokens": false,
"tokenizer_class": "Qwen2Tokenizer",
"unk_token": null
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long