初始化项目，由ModelHub XC社区提供模型

Model: Qwen/Qwen3-VL-Embedding-2B Source: Original Platform
2026-06-11 23:33:12 +08:00
commit bee4c01e77
15 changed files with 1410 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,52 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zstandard filter=lfs diff=lfs merge=lfs -text
+*.tfevents* filter=lfs diff=lfs merge=lfs -text
+*.db* filter=lfs diff=lfs merge=lfs -text
+*.ark* filter=lfs diff=lfs merge=lfs -text
+**/*ckpt*data* filter=lfs diff=lfs merge=lfs -text
+**/*ckpt*.meta filter=lfs diff=lfs merge=lfs -text
+**/*ckpt*.index filter=lfs diff=lfs merge=lfs -text
+ 
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.gguf* filter=lfs diff=lfs merge=lfs -text
+*.ggml filter=lfs diff=lfs merge=lfs -text
+*.llamafile* filter=lfs diff=lfs merge=lfs -text
+*.pt2 filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+
+model.safetensors filter=lfs diff=lfs merge=lfs -text
+merges.txt filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+vocab.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,439 @@
+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: feature-extraction
+
+base_model:
+- Qwen/Qwen3-VL-2B-Instruct
+tags:
+- transformers
+- multimodal embedding
+- qwen
+- embedding
+---
+# Qwen3-VL-Embedding-2B
+
+<p align="center">
+    <img src="https://model-demo.oss-cn-hangzhou.aliyuncs.com/Qwen3-VL-Embedding.png" width="400"/>
+</p>
+
+## Highlights
+
+The **Qwen3-VL-Embedding** and **Qwen3-VL-Reranker** model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities.
+
+While the Embedding model generates high-dimensional vectors for broad applications like retrieval and clustering, the Reranker model is engineered to refine these results, establishing a comprehensive pipeline for state-of-the-art multimodal search.
+
+- **Multimodal Versatility**: Both models seamlessly handle a wide range of inputs—including text, images, screenshots, and video—within a unified framework. They deliver state-of-the-art performance across diverse multimodal tasks such as image-text retrieval, video-text matching, visual question answering (VQA), and multimodal content clustering.
+
+- **Unified Representation Learning (Embedding)**: By leveraging the Qwen3-VL architecture, the Embedding model generates semantically rich vectors that capture both visual and textual information in a shared space. This facilitates efficient similarity computation and retrieval across different modalities.
+
+- **High-Precision Reranking (Reranker)**: We also introduce the Qwen3-VL-Reranker series to complement the embedding model. The reranker takes a (query, document) pair as input—where both query and document may contain arbitrary single or mixed modalities—and outputs a precise relevance score. In retrieval pipelines, the two models are typically used in tandem: the embedding model performs efficient initial recall, while the reranker refines results in a subsequent re-ranking stage. This two-stage approach significantly boosts retrieval accuracy.
+
+- **Exceptional Practicality**: Inheriting Qwen3-VL’s multilingual capabilities, the series supports over 30 languages, making it ideal for global applications. It is highly practical for real-world scenarios, offering flexible vector dimensions, customizable instructions for specific use cases, and strong performance even with quantized embeddings. These capabilities enable developers to seamlessly integrate both models into existing pipelines, unlocking powerful cross-lingual and cross-modal understanding.
+
+## Model Overview
+
+**Qwen3-VL-Embedding-2B** has the following features:
+
+- Model Type: MultiModal Embedding
+- Supported Languages: 30+ Languages
+- Supported Input Modalities: Text, images, screenshots, videos, and arbitrary multimodal combinations (e.g., text + image, text + video)
+- Number of Parameters: 2B
+- Context Length: 32k
+- Embedding Dimension: Up to 2048, supports user-defined output dimensions ranging from 64 to 2048
+
+For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [technical report](https://arxiv.org/abs/2601.04720), [blog](https://qwen.ai/blog?id=qwen3-vl-embedding), [GitHub](https://github.com/QwenLM/Qwen3-VL-Embedding).
+
+## Qwen3-VL-Embedding and Qwen3-VL-Reranker Model list
+
+| Model | Size | Model Layers | Sequence Length | Embedding Dimension | Quantization Support | MRL Support | Instruction Aware |
+|---|---|---|---|---|----------------------|---|---|
+| [Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B) | 2B | 28 | 32K | 2048 | Yes | Yes | Yes |
+| [Qwen3-VL-Embedding-8B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B) | 8B | 36 | 32K | 4096 | Yes | Yes | Yes |
+| [Qwen3-VL-Reranker-2B](https://huggingface.co/Qwen/Qwen3-VL-Reranker-2B) | 2B | 28 | 32K | - | -  | - | Yes |
+| [Qwen3-VL-Reranker-8B](https://huggingface.co/Qwen/Qwen3-VL-Reranker-8B) | 8B | 36 | 32K | - | -  | - | Yes |
+
+> **Note**:
+> - `Quantization Support` indicates the supported quantization post process for the output embedding. 
+> - `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding. 
+> - `Instruction Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks.
+> Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English.
+
+## Model Performance
+
+### Evaluation Results on [MMEB-V2](https://huggingface.co/spaces/TIGER-Lab/MMEB-Leaderboard)
+
+Results on the MMEB-V2 benchmark. All models except IFM-TTE have been re-evaluated on the updated VisDoc OOD split. CLS: classification, QA: question answering, RET: retrieval, GD: grounding, MRET: moment retrieval, VDR: ViDoRe, VR: VisRAG, OOD: out-of-distribution.
+
+| Model                      | Model Size | Image CLS | Image QA | Image RET | Image GD | Image Overall | Video CLS | Video QA | Video RET | Video MRET | Video Overall | VisDoc VDRv1 | VisDoc VDRv2 | VisDoc VR | VisDoc OOD | VisDoc Overall | All    |
+|----------------------------|---------|-------|------|------|------|-----------|------|------|------|------|------|-------|------|--------|------|------|--------|
+| **# of Datasets →**        |         | 10    | 10   | 12   | 4    | 36        | 5    | 5    | 5    | 3    | 18   | 10    | 4    | 6      | 4    | 24   | 78     |
+| VLM2Vec                    | 2B      | 58.7 | 49.3 | 65.0 | 72.9 | 59.7 | 33.4 | 30.5 | 20.6 | 30.7 | 28.6 | 49.8 | 13.5 | 51.8 | 48.2 | 44.0 | 47.7 |
+| VLM2Vec-V2                 | 2B      | 62.9 | 56.3 | 69.5 | 77.3 | 64.9 | 39.3 | 34.3 | 28.8 | 36.8 | 34.6 | 75.5 | 44.9 | 79.4 | 62.2 | 69.2 | 59.2 |
+| GME-2B                     | 2B      | 54.4 | 29.9 | 66.9 | 55.5 | 51.9 | 34.9 | 42.0 | 25.6 | 31.1 | 33.6 | 86.1 | 54.0 | 82.5 | 67.5 | 76.8 | 55.3 |
+| GME-7B                     | 7B      | 57.7 | 34.7 | 71.2 | 59.3 | 56.0 | 37.4 | 50.4 | 28.4 | 37.0 | 38.4 | 89.4 | 55.6 | 85.0 | 68.3 | 79.3 | 59.1 |
+| Ops-MM-embedding-v1        | 8B      | 69.7 | 69.6 | 73.1 | 87.2 | 72.7 | 59.7 | 62.2 | 45.7 | 43.2 | 53.8 | 80.1 | 59.6 | 79.3 | 67.8 | 74.4 | 68.9 |
+| IFM-TTE                    | 8B      | 76.7 | 78.5 | 74.6 | 89.3 | 77.9 | 60.5 | 67.9 | 51.7 | 54.9 | 59.2 | 85.2 | 71.5 | 92.7 | 53.3 | 79.5 | 74.1 |
+| RzenEmbed                  | 8B      | 70.6 | 71.7 | 78.5 | 92.1 | 75.9 | 58.8 | 63.5 | 51.0 | 45.5 | 55.7 | 89.7 | 60.7 | 88.7 | 69.9 | 81.3 | 72.9 |
+| Seed-1.6-embedding-1215    | unknown | 75.0 | 74.9 | 79.3 | 89.0 | 78.0 | 85.2 | 66.7 | 59.1 | 54.8 | 67.7 | 90.0 | 60.3 | 90.0 | 70.7 | 82.2 | 76.9 | 
+| **Qwen3-VL-Embedding-2B**  | 2B      | 70.3 | 74.3 | 74.8 | 88.5 | 75.0 | 71.9 | 64.9 | 53.9 | 53.3 | 61.9 | 84.4 | 65.3 | 86.4 | 69.4 | 79.2 | 73.2 |
+| **Qwen3-VL-Embedding-8B**  | 8B      | 74.2 | 81.1 | 80.2 | 92.3 | 80.1 | 78.4 | 71.0 | 58.7 | 56.1 | 67.1 | 87.2 | 69.9 | 88.7 | 73.3 | 82.4 | **77.8** |
+
+### Evaluation Results on [MMTEB](https://huggingface.co/spaces/mteb/leaderboard)
+
+Results on the MMTEB benchmark. 
+
+| Model                            |  Size   |  Mean (Task)  | Mean (Type) | Bitxt Mining | Class. | Clust. | Inst. Retri. | Multi. Class. | Pair. Class. | Rerank | Retri. | STS  |
+|----------------------------------|:-------:|:-------------:|:-------------:|:--------------:|:--------:|:--------:|:--------------:|:---------------:|:--------------:|:--------:|:--------:|:------:|
+| NV-Embed-v2                      |   7B    |     56.29     | 49.58       | 57.84        | 57.29  | 40.80  | 1.04         | 18.63         | 78.94        | 63.82  | 56.72  | 71.10|
+| GritLM-7B                        |   7B    |     60.92     | 53.74       | 70.53        | 61.83  | 49.75  | 3.45         | 22.77         | 79.94        | 63.78  | 58.31  | 73.33|
+| BGE-M3                           |  0.6B   |     59.56     | 52.18       | 79.11        | 60.35  | 40.88  | -3.11        | 20.1          | 80.76        | 62.79  | 54.60  | 74.12|
+| multilingual-e5-large-instruct   |  0.6B   |     63.22     | 55.08       | 80.13        | 64.94  | 50.75  | -0.40        | 22.91         | 80.86        | 62.61  | 57.12  | 76.81|
+| gte-Qwen2-1.5B-instruct          |  1.5B   |     59.45     | 52.69       | 62.51        | 58.32  | 52.05  | 0.74         | 24.02         | 81.58        | 62.58  | 60.78  | 71.61|
+| gte-Qwen2-7b-Instruct            |   7B    |     62.51     | 55.93       | 73.92        | 61.55  | 52.77  | 4.94         | 25.48         | 85.13        | 65.55  | 60.08  | 73.98|
+| text-embedding-3-large           |    -    |     58.93     | 51.41       | 62.17        | 60.27  | 46.89  | -2.68        | 22.03         | 79.17        | 63.89  | 59.27  | 71.68|
+| Cohere-embed-multilingual-v3.0   |    -    |     61.12     | 53.23       | 70.50        | 62.95  | 46.89  | -1.89        | 22.74         | 79.88        | 64.07  | 59.16  | 74.80|
+| Gemini Embedding                 |    -    |     68.37     | 59.59       | 79.28        | 71.82  | 54.59  | 5.18         | **29.16**     | 83.63        | 65.58  | 67.71  | 79.40|
+| Qwen3-Embedding-0.6B        |  0.6B   |     64.33     | 56.00       | 72.22        | 66.83  | 52.33  | 5.09         | 24.59         | 80.83        | 61.41  | 64.64  | 76.17|
+| Qwen3-Embedding-4B           |   4B    |     69.45     | 60.86       | 79.36        | 72.33  | 57.15  | **11.56**    | 26.77         | 85.05        | 65.08  | 69.60  | 80.86|
+| Qwen3-Embedding-8B          |   8B    |   **70.58**   | **61.69**   | **80.89**    | **74.00** | **57.65** | 10.06      | 28.66         | **86.40**    | **65.63** | **70.88** | **81.08** |
+| Qwen3-VL-Embedding-2B | 2B | 63.87 | 55.84 | 69.51 | 65.86 | 52.50 | 3.87 | 26.08 | 78.50 | 64.80 | 67.12 | 74.29 |
+| Qwen3-VL-Embedding-8B | 8B | 67.88 | 58.88 | 77.48 | 71.95 | 55.82 | 4.46 | 28.59 | 81.08 | 65.72 | 69.41 | 75.41 |
+
+
+## Usage
+
+- **requirements**
+```text
+transformers>=4.57.0
+qwen-vl-utils>=0.0.14
+torch==2.8.0
+```
+
+### Basic Usage Example
+
+```python
+from scripts.qwen3_vl_embedding import Qwen3VLEmbedder
+import numpy as np
+import torch
+
+# Define a list of query texts
+queries = [
+    {"text": "A woman playing with her dog on a beach at sunset."},
+    {"text": "Pet owner training dog outdoors near water."},
+    {"text": "Woman surfing on waves during a sunny day."},
+    {"text": "City skyline view from a high-rise building at night."}
+]
+
+# Define a list of document texts and images
+documents = [
+    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust."},
+    {"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
+    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}
+]
+
+# Specify the model path
+model_name_or_path = "Qwen/Qwen3-VL-Embedding-2B"
+
+# Initialize the Qwen3VLEmbedder model
+model = Qwen3VLEmbedder(model_name_or_path=model_name_or_path)
+# We recommend enabling flash_attention_2 for better acceleration and memory saving,
+# model = Qwen3VLEmbedder(model_name_or_path=model_name_or_path, torch_dtype=torch.float16, attn_implementation="flash_attention_2")
+
+# Combine queries and documents into a single input list
+inputs = queries + documents
+
+# Process the inputs to get embeddings
+embeddings = model.process(inputs)
+
+# Compute similarity scores between query embeddings and document embeddings
+similarity_scores = (embeddings[:4] @ embeddings[4:].T)
+
+# Print out the similarity scores in a list format
+print(similarity_scores.tolist())
+
+# [[0.8157786130905151, 0.7178360223770142, 0.7173429131507874], [0.5195091962814331, 0.3302568793296814, 0.4391537308692932], [0.3884059488773346, 0.285782128572464, 0.33141762018203735], [0.1092604324221611, 0.03871120512485504, 0.06952016055583954]]
+```
+
+For more usage examples, please visit our [GitHub repository](https://github.com/QwenLM/Qwen3-VL-Embedding).
+
+### vLLM Basic Usage Example
+```python
+import argparse
+import numpy as np
+import os
+from typing import List, Dict, Any
+from vllm import LLM, EngineArgs
+from vllm.multimodal.utils import fetch_image
+
+
+# Define a list of query texts
+queries = [
+    {"text": "A woman playing with her dog on a beach at sunset."},
+    {"text": "Pet owner training dog outdoors near water."},
+    {"text": "Woman surfing on waves during a sunny day."},
+    {"text": "City skyline view from a high-rise building at night."}
+]
+
+# Define a list of document texts and images
+documents = [
+    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust."},
+    {"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
+    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}
+]
+
+def format_input_to_conversation(input_dict: Dict[str, Any], instruction: str = "Represent the user's input.") -> List[Dict]:
+    content = []
+    
+    text = input_dict.get('text')
+    image = input_dict.get('image')
+    
+    if image:
+        image_content = None
+        if isinstance(image, str):
+            if image.startswith(('http', 'https', 'oss')):
+                image_content = image
+            else:
+                abs_image_path = os.path.abspath(image)
+                image_content = 'file://' + abs_image_path
+        else:
+            image_content = image
+        
+        if image_content:
+            content.append({
+                'type': 'image', 
+                'image': image_content,
+            })
+    
+    if text:
+        content.append({'type': 'text', 'text': text})
+    
+    if not content:
+        content.append({'type': 'text', 'text': ""})
+    
+    conversation = [
+        {"role": "system", "content": [{"type": "text", "text": instruction}]},
+        {"role": "user", "content": content}
+    ]
+    
+    return conversation
+
+def prepare_vllm_inputs(input_dict: Dict[str, Any], llm, instruction: str = "Represent the user's input.") -> Dict[str, Any]:
+    text = input_dict.get('text')
+    image = input_dict.get('image')
+    
+    conversation = format_input_to_conversation(input_dict, instruction)
+    
+    prompt_text = llm.llm_engine.tokenizer.apply_chat_template(
+        conversation, 
+        tokenize=False, 
+        add_generation_prompt=True
+    )
+    
+    multi_modal_data = None
+    if image:
+        if isinstance(image, str):
+            if image.startswith(('http', 'https', 'oss')):
+                try:
+                    image_obj = fetch_image(image)
+                    multi_modal_data = {"image": image_obj}
+                except Exception as e:
+                    print(f"Warning: Failed to fetch image {image}: {e}")
+            else:
+                abs_image_path = os.path.abspath(image)
+                if os.path.exists(abs_image_path):
+                    from PIL import Image
+                    image_obj = Image.open(abs_image_path)
+                    multi_modal_data = {"image": image_obj}
+                else:
+                    print(f"Warning: Image file not found: {abs_image_path}")
+        else:
+            multi_modal_data = {"image": image}
+    
+    result = {
+        "prompt": prompt_text,
+        "multi_modal_data": multi_modal_data
+    }
+    return result
+
+def main():
+    parser = argparse.ArgumentParser(description="Offline Similarity Check with vLLM")
+    parser.add_argument("--model-path", type=str, default="models/Qwen3-VL-Embedding-2B", help="Path to the model")
+    parser.add_argument("--dtype", type=str, default="bfloat16", help="Data type (e.g., bfloat16)")
+    args = parser.parse_args()
+
+    print(f"Loading model from {args.model_path}...")
+    
+    engine_args = EngineArgs(
+        model=args.model_path,
+        runner="pooling",
+        dtype=args.dtype,
+        trust_remote_code=True,
+    )
+    
+    llm = LLM(**vars(engine_args))
+    
+    all_inputs = queries + documents
+    vllm_inputs = [prepare_vllm_inputs(inp, llm) for inp in all_inputs]
+    
+    
+    outputs = llm.embed(vllm_inputs)
+    
+    embeddings_list = []
+    for i, output in enumerate(outputs):
+        emb = output.outputs.embedding
+        embeddings_list.append(emb)
+        print(f"Input {i} embedding shape: {len(emb)}")
+    
+    embeddings = np.array(embeddings_list)
+    print(f"\nEmbeddings shape: {embeddings.shape}")
+    
+    num_queries = len(queries)
+    query_embeddings = embeddings[:num_queries]
+    doc_embeddings = embeddings[num_queries:]
+    
+    similarity_scores = query_embeddings @ doc_embeddings.T
+    
+    print("\nSimilarity Scores:")
+    print(similarity_scores.tolist())
+    
+
+if __name__ == "__main__":
+    main()
+```
+
+### SGLang Basic Usage Example
+```python
+import argparse
+import numpy as np
+import torch
+import os
+from typing import List, Dict, Any
+from sglang.srt.entrypoints.engine import Engine
+
+# Define a list of query texts
+queries = [
+    {"text": "A woman playing with her dog on a beach at sunset."},
+    {"text": "Pet owner training dog outdoors near water."},
+    {"text": "Woman surfing on waves during a sunny day."},
+    {"text": "City skyline view from a high-rise building at night."}
+]
+
+# Define a list of document texts and images
+documents = [
+    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust."},
+    {"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
+    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}
+]
+
+def format_input_to_conversation(input_dict: Dict[str, Any], instruction: str = "Represent the user's input.") -> List[Dict]:
+    content = []
+    
+    text = input_dict.get('text')
+    image = input_dict.get('image')
+
+    if image:
+        image_content = None
+        if isinstance(image, str):
+            if image.startswith(('http', 'oss')):
+                image_content = image
+            else:
+                abs_image_path = os.path.abspath(image)
+                image_content = 'file://' + abs_image_path
+        else:
+            image_content = image
+        if image_content:
+            content.append({
+                'type': 'image', 'image': image_content,
+            })
+
+    if text:
+        content.append({'type': 'text', 'text': text})
+
+    if not content:
+        content.append({'type': 'text', 'text': ""})
+
+    conversation = [
+        {"role": "system", "content": [{"type": "text", "text": instruction}]},
+        {"role": "user", "content": content}
+    ]
+
+    return conversation
+
+def convert_to_sglang_format(input_dict: Dict[str, Any], engine: Engine, instruction: str = "Represent the user's input.") -> Dict[str, Any]:
+    conversation = format_input_to_conversation(input_dict, instruction)
+    
+    text_for_api = engine.tokenizer_manager.tokenizer.apply_chat_template(
+        conversation, 
+        tokenize=False, 
+        add_generation_prompt=True
+    )
+
+    result = {"text": text_for_api}
+    
+    image = input_dict.get('image')
+    if image and isinstance(image, str):
+        result["image"] = image
+        
+        
+    return result
+
+def main():
+    parser = argparse.ArgumentParser(description="Offline Similarity Check with SGLang")
+    parser.add_argument("--model-path", type=str, default="models/Qwen3-VL-Embedding-2B", help="Path to the model")
+    parser.add_argument("--dtype", type=str, default="bfloat16", help="Data type (e.g., bfloat16)")
+    args = parser.parse_args()
+
+    print(f"Loading model from {args.model_path}...")
+    
+    engine = Engine(
+        model_path=args.model_path,
+        is_embedding=True,
+        dtype=args.dtype,
+        trust_remote_code=True,
+    )
+
+    inputs = queries + documents
+    sglang_inputs = [convert_to_sglang_format(inp, engine) for inp in inputs]
+    print(sglang_inputs[:])
+    print(f"sglang_inputs: {sglang_inputs}")
+    print(f"Processing {len(sglang_inputs)} inputs...")
+
+    prompts = [inp['text'] for inp in sglang_inputs]
+    images = [inp.get('image') for inp in sglang_inputs]
+
+
+    results = engine.encode(prompts, image_data=images)
+    
+    embeddings_list = []
+    for res in results:
+        embeddings_list.append(res['embedding'])
+            
+    embeddings = np.array(embeddings_list)
+    print(f"Embeddings shape: {embeddings.shape}")
+
+    num_queries = len(queries)
+    query_embeddings = embeddings[:num_queries]
+    doc_embeddings = embeddings[num_queries:]
+    
+    similarity_scores = (query_embeddings @ doc_embeddings.T)
+
+    print("\nSimilarity Scores:")
+    print(similarity_scores.tolist())
+
+if __name__ == "__main__":
+    main()
+```
+
+## Citation
+
+If you find our work helpful, feel free to give us a cite.
+
+```
+@article{qwen3vlembedding,
+  title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
+  author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
+  journal={arXiv preprint arXiv:2601.04720},
+  year={2026}
+}
+```
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -0,0 +1,28 @@
+{
+  "</think>": 151668,
+  "</tool_call>": 151658,
+  "</tool_response>": 151666,
+  "<think>": 151667,
+  "<tool_call>": 151657,
+  "<tool_response>": 151665,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}
--- a/chat_template.jinja
+++ b/chat_template.jinja
@@ -0,0 +1,125 @@
+{%- set default_system_message = 'Represent the user\'s input.' -%}
+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {%- if messages[0].content is string %}
+            {{- messages[0].content }}
+        {%- else %}
+            {%- for content in messages[0].content %}
+                {%- if 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '\n\n' }}
+    {%- else %}
+        {{- default_system_message + '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' }}
+        {%- if messages[0].content is string %}
+            {{- messages[0].content }}
+        {%- else %}
+            {%- for content in messages[0].content %}
+                {%- if 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\n' + default_system_message + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set image_count = namespace(value=0) %}
+{%- set video_count = namespace(value=0) %}
+{%- for message in messages %}
+    {%- if message.role == "user" %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content in message.content %}
+                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
+                    <|vision_start|><|image_pad|><|vision_end|>
+                {%- elif content.type == 'video' or 'video' in content %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
+                    <|vision_start|><|video_pad|><|vision_end|>
+                {%- elif 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content_item in message.content %}
+                {%- if 'text' in content_item %}
+                    {{- content_item.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and message.content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content in message.content %}
+                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
+                    <|vision_start|><|image_pad|><|vision_end|>
+                {%- elif content.type == 'video' or 'video' in content %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
+                    <|vision_start|><|video_pad|><|vision_end|>
+                {%- elif 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,66 @@
+{
+  "architectures": [
+    "Qwen3VLForConditionalGeneration"
+  ],
+  "dtype": "bfloat16",
+  "image_token_id": 151655,
+  "model_type": "qwen3_vl",
+  "text_config": {
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "bos_token_id": 151643,
+    "dtype": "bfloat16",
+    "eos_token_id": 151645,
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 2048,
+    "initializer_range": 0.02,
+    "intermediate_size": 6144,
+    "max_position_embeddings": 262144,
+    "model_type": "qwen3_vl_text",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 28,
+    "num_key_value_heads": 8,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": {
+      "mrope_interleaved": true,
+      "mrope_section": [
+        24,
+        20,
+        20
+      ],
+      "rope_type": "default"
+    },
+    "rope_theta": 5000000,
+    "tie_word_embeddings": true,
+    "use_cache": true,
+    "vocab_size": 151936
+  },
+  "tie_word_embeddings": true,
+  "transformers_version": "4.57.1",
+  "use_cache": false,
+  "video_token_id": 151656,
+  "vision_config": {
+    "deepstack_visual_indexes": [
+      5,
+      11,
+      17
+    ],
+    "depth": 24,
+    "dtype": "bfloat16",
+    "hidden_act": "gelu_pytorch_tanh",
+    "hidden_size": 1024,
+    "in_channels": 3,
+    "initializer_range": 0.02,
+    "intermediate_size": 4096,
+    "model_type": "qwen3_vl",
+    "num_heads": 16,
+    "num_position_embeddings": 2304,
+    "out_hidden_size": 2048,
+    "patch_size": 16,
+    "spatial_merge_size": 2,
+    "temporal_patch_size": 2
+  },
+  "vision_end_token_id": 151653,
+  "vision_start_token_id": 151652
+}
--- a/configuration.json
+++ b/configuration.json
@@ -0,0 +1 @@
+{"framework":"Pytorch","task":"multi-modal-embedding"}
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c73fa9caeddeb3ff831d46c085a7a5708343248ca777e90f2d486964464509c1
+size 4255140312
--- a/preprocessor_config.json
+++ b/preprocessor_config.json
@@ -0,0 +1,39 @@
+{
+  "crop_size": null,
+  "data_format": "channels_first",
+  "default_to_square": true,
+  "device": null,
+  "disable_grouping": null,
+  "do_center_crop": null,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_pad": null,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_processor_type": "Qwen2VLImageProcessorFast",
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "input_data_format": null,
+  "max_pixels": 1310720,
+  "merge_size": 2,
+  "min_pixels": 4096,
+  "pad_size": null,
+  "patch_size": 16,
+  "processor_class": "Qwen3VLProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "return_tensors": null,
+  "size": {
+    "longest_edge": 1310720,
+    "shortest_edge": 4096
+  },
+  "temporal_patch_size": 2
+}
--- a/scripts/qwen3_vl_embedding.py
+++ b/scripts/qwen3_vl_embedding.py
@@ -0,0 +1,337 @@
+import torch
+import torch.nn.functional as F
+import unicodedata
+import numpy as np
+import logging
+
+from PIL import Image
+from dataclasses import dataclass
+from typing import Optional, List, Union, Dict, Any
+from transformers.models.qwen3_vl.modeling_qwen3_vl import Qwen3VLPreTrainedModel, Qwen3VLModel, Qwen3VLConfig
+from transformers.models.qwen3_vl.processing_qwen3_vl import Qwen3VLProcessor
+from transformers.modeling_outputs import ModelOutput
+from transformers.processing_utils import Unpack
+from transformers.utils import TransformersKwargs
+from transformers.cache_utils import Cache
+from transformers.utils.generic import check_model_inputs
+from qwen_vl_utils.vision_process import process_vision_info
+
+logger = logging.getLogger(__name__)
+
+# Constants for configuration
+MAX_LENGTH = 8192
+IMAGE_BASE_FACTOR = 16
+IMAGE_FACTOR = IMAGE_BASE_FACTOR * 2
+MIN_PIXELS = 4 * IMAGE_FACTOR * IMAGE_FACTOR
+MAX_PIXELS = 1800 * IMAGE_FACTOR * IMAGE_FACTOR
+FPS = 1
+MAX_FRAMES = 64
+FRAME_MAX_PIXELS = 768 * IMAGE_FACTOR * IMAGE_FACTOR
+MAX_TOTAL_PIXELS = 10 * FRAME_MAX_PIXELS
+PAD_TOKEN = "<|endoftext|>"
+
+# Define output structure for embeddings
+@dataclass
+class Qwen3VLForEmbeddingOutput(ModelOutput):
+    last_hidden_state: Optional[torch.FloatTensor] = None
+    attention_mask: Optional[torch.Tensor] = None
+
+# Define model class to compute embeddings
+class Qwen3VLForEmbedding(Qwen3VLPreTrainedModel):
+    _checkpoint_conversion_mapping = {}
+    accepts_loss_kwargs = False
+    config: Qwen3VLConfig
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = Qwen3VLModel(config)
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.get_input_embeddings()
+
+    def set_input_embeddings(self, value):
+        self.model.set_input_embeddings(value)
+
+    def set_decoder(self, decoder):
+        self.model.set_decoder(decoder)
+
+    def get_decoder(self):
+        return self.model.get_decoder()
+
+    # Extract video features from model
+    def get_video_features(self, pixel_values_videos: torch.FloatTensor,
+                           video_grid_thw: Optional[torch.LongTensor] = None):
+        return self.model.get_video_features(pixel_values_videos, video_grid_thw)
+
+    # Extract image features from model
+    def get_image_features(self, pixel_values: torch.FloatTensor,
+                           image_grid_thw: Optional[torch.LongTensor] = None):
+        return self.model.get_image_features(pixel_values, image_grid_thw)
+
+    # Make modules accessible through properties
+    @property
+    def language_model(self):
+        return self.model.language_model
+
+    @property
+    def visual(self):
+        return self.model.visual
+
+    # Forward pass through model with input parameters
+    # @check_model_inputs
+    def forward(self,
+                input_ids: torch.LongTensor = None,
+                attention_mask: Optional[torch.Tensor] = None,
+                position_ids: Optional[torch.LongTensor] = None,
+                past_key_values: Optional[Cache] = None,
+                inputs_embeds: Optional[torch.FloatTensor] = None,
+                pixel_values: Optional[torch.Tensor] = None,
+                pixel_values_videos: Optional[torch.FloatTensor] = None,
+                image_grid_thw: Optional[torch.LongTensor] = None,
+                video_grid_thw: Optional[torch.LongTensor] = None,
+                cache_position: Optional[torch.LongTensor] = None,
+                logits_to_keep: Union[int, torch.Tensor] = 0,
+                **kwargs: Unpack[TransformersKwargs],
+    ) -> Union[tuple, Qwen3VLForEmbeddingOutput]:
+        # Pass inputs through the model
+        outputs = self.model(
+            input_ids=input_ids,
+            pixel_values=pixel_values,
+            pixel_values_videos=pixel_values_videos,
+            image_grid_thw=image_grid_thw,
+            video_grid_thw=video_grid_thw,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        # Return the model output
+        return Qwen3VLForEmbeddingOutput(
+            last_hidden_state=outputs.last_hidden_state,
+            attention_mask=attention_mask,
+        )
+
+def sample_frames(frames: List[Union[str, Image.Image]], num_segments: int, max_segments: int) -> List[str]:
+    duration = len(frames)
+    frame_id_array = np.linspace(0, duration - 1, num_segments, dtype=int)
+    frame_id_list = frame_id_array.tolist()
+    last_frame_id = frame_id_list[-1]
+
+    # Create a list of sampled frames
+    sampled_frames = []
+    for frame_idx in frame_id_list:
+        try:
+            sampled_frames.append(frames[frame_idx])
+        except:
+            break
+    # Ensure the sampled list meets the required segment count
+    while len(sampled_frames) < num_segments:
+        sampled_frames.append(frames[last_frame_id])
+    return sampled_frames[:max_segments]
+
+# Define embedder class for processing inputs and generating embeddings
+class Qwen3VLEmbedder():
+    def __init__(
+        self, 
+        model_name_or_path: str, 
+        max_length: int = MAX_LENGTH,
+        min_pixels: int = MIN_PIXELS,
+        max_pixels: int = MAX_PIXELS,
+        total_pixels: int = MAX_TOTAL_PIXELS,
+        fps: float = FPS,
+        num_frames: int = MAX_FRAMES,
+        max_frames: int = MAX_FRAMES,
+        default_instruction: str = "Represent the user's input.",
+        **kwargs
+    ):
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+        self.max_length = max_length
+        self.min_pixels = min_pixels
+        self.max_pixels = max_pixels
+        self.total_pixels = total_pixels
+        self.fps = fps
+        self.num_frames = num_frames
+        self.max_frames = max_frames
+
+        self.default_instruction = default_instruction
+
+        self.model = Qwen3VLForEmbedding.from_pretrained(
+            model_name_or_path, trust_remote_code=True, **kwargs
+        ).to(device)
+        self.processor = Qwen3VLProcessor.from_pretrained(
+            model_name_or_path, padding_side='right'
+        )
+        self.model.eval()
+
+    @torch.no_grad()
+    def forward(self, inputs: Dict[str, Any]) -> Dict[str, torch.Tensor]:
+        outputs = self.model(**inputs)
+        return {
+            'last_hidden_state': outputs.last_hidden_state,
+            'attention_mask': inputs.get('attention_mask')
+        }
+
+    # Truncate token sequence to a specified max length
+    def _truncate_tokens(self, token_ids: List[int], max_length: int) -> List[int]:
+        if len(token_ids) <= max_length:
+            return token_ids
+
+        special_token_ids = set(self.processor.tokenizer.all_special_ids)
+        num_special = sum(1 for token_idx in token_ids if token_idx in special_token_ids)
+        num_non_special_to_keep = max_length - num_special
+
+        final_token_ids = []
+        non_special_kept_count = 0
+        # Ensure retention of special tokens while truncating the rest
+        for token_idx in token_ids:
+            if token_idx in special_token_ids:
+                final_token_ids.append(token_idx)
+            elif non_special_kept_count < num_non_special_to_keep:
+                final_token_ids.append(token_idx)
+                non_special_kept_count += 1
+        return final_token_ids
+
+    # Format input based on provided text, image, video, and instruction
+    def format_model_input(
+        self, text: Optional[str] = None,
+        image: Optional[Union[str, Image.Image]] = None,
+        video: Optional[Union[str, List[Union[str, Image.Image]]]] = None,
+        instruction: Optional[str] = None,
+        fps: Optional[float] = None,
+        max_frames: Optional[int] = None
+    ) -> List[Dict]:
+
+        # Ensure instruction ends with punctuation
+        if instruction:
+            instruction = instruction.strip()
+            if instruction and not unicodedata.category(instruction[-1]).startswith('P'):
+                instruction = instruction + '.'
+
+        # Initialize conversation with system prompts
+        content = []
+        conversation = [
+            {"role": "system", "content": [{"type": "text", "text": instruction or self.default_instruction}]},
+            {"role": "user", "content": content}
+        ]
+
+        # Add text, image, or video content to conversation
+        if not text and not image and not video:
+            content.append({'type': 'text', 'text': "NULL"})
+            return conversation
+
+        if video:
+            video_content = None
+            video_kwargs = { 'total_pixels': self.total_pixels }
+            if isinstance(video, list):
+                video_content = video
+                if self.num_frames is not None or self.max_frames is not None:
+                    video_content = sample_frames(video_content, self.num_frames, self.max_frames)
+                video_content = [
+                    ('file://' + ele if isinstance(ele, str) else ele) 
+                    for ele in video_content
+                ]
+            elif isinstance(video, str):
+                video_content = video if video.startswith(('http://', 'https://')) else 'file://' + video
+                video_kwargs = {'fps': fps or self.fps, 'max_frames': max_frames or self.max_frames,}
+            else:
+                raise TypeError(f"Unrecognized video type: {type(video)}")
+
+            # Add video input details to content
+            if video_content:
+                content.append({
+                    'type': 'video', 'video': video_content,
+                    **video_kwargs
+                })
+
+        if image:
+            image_content = None
+            if isinstance(image, Image.Image):
+                image_content = image
+            elif isinstance(image, str):
+                image_content = image if image.startswith(('http', 'oss')) else 'file://' + image
+            else:
+                raise TypeError(f"Unrecognized image type: {type(image)}")
+
+            # Add image input details to content
+            if image_content:
+                content.append({
+                    'type': 'image', 'image': image_content,
+                    "min_pixels": self.min_pixels,
+                    "max_pixels": self.max_pixels
+                })
+
+        if text:
+            content.append({'type': 'text', 'text': text})
+
+        return conversation
+
+    # Preprocess input conversations for model consumption
+    def _preprocess_inputs(self, conversations: List[List[Dict]]) -> Dict[str, torch.Tensor]:
+        text = self.processor.apply_chat_template(
+            conversations, add_generation_prompt=True, tokenize=False
+        )
+
+        try:
+            images, video_inputs, video_kwargs = process_vision_info(
+                conversations, image_patch_size=16,
+                return_video_metadata=True, return_video_kwargs=True
+            )
+        except Exception as e:
+            logger.error(f"Error in processing vision info: {e}")
+            images = None
+            video_inputs = None
+            video_kwargs = {'do_sample_frames': False}
+            text = self.processor.apply_chat_template(
+                [{'role': 'user', 'content': [{'type': 'text', 'text': 'NULL'}]}], 
+                add_generation_prompt=True, tokenize=False
+            )
+
+        if video_inputs is not None:
+            videos, video_metadata = zip(*video_inputs)
+            videos = list(videos)
+            video_metadata = list(video_metadata)
+        else:
+            videos, video_metadata = None, None
+
+        inputs = self.processor(
+            text=text, images=images, videos=videos, video_metadata=video_metadata, truncation=True, 
+            max_length=self.max_length, padding=True, do_resize=False, return_tensors='pt',
+            **video_kwargs
+        )
+        return inputs
+
+    # Pool the last hidden state by attention mask for embeddings
+    @staticmethod
+    def _pooling_last(hidden_state: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
+        flipped_tensor = attention_mask.flip(dims=[1])
+        last_one_positions = flipped_tensor.argmax(dim=1)
+        col = attention_mask.shape[1] - last_one_positions - 1
+        row = torch.arange(hidden_state.shape[0], device=hidden_state.device)
+        return hidden_state[row, col]
+
+    # Process inputs to generate normalized embeddings
+    def process(self, inputs: List[Dict[str, Any]], normalize: bool = True) -> tuple:
+        conversations = [self.format_model_input(
+            text=ele.get('text'),
+            image=ele.get('image'),
+            video=ele.get('video'),
+            instruction=ele.get('instruction'),
+            fps=ele.get('fps'),
+            max_frames=ele.get('max_frames')
+        ) for ele in inputs]
+
+        processed_inputs = self._preprocess_inputs(conversations)
+        processed_inputs = {k: v.to(self.model.device) for k, v in processed_inputs.items()}
+
+        outputs = self.forward(processed_inputs)
+        embeddings = self._pooling_last(outputs['last_hidden_state'], outputs['attention_mask'])
+
+        # Normalize the embeddings if specified
+        if normalize:
+            embeddings = F.normalize(embeddings, p=2, dim=-1)
+
+        return embeddings
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,31 @@
+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:def76fb086971c7867b829c23a26261e38d9d74e02139253b38aeb9df8b4b50a
+size 11423705
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,239 @@
+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 262144,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}
--- a/video_preprocessor_config.json
+++ b/video_preprocessor_config.json
@@ -0,0 +1,41 @@
+{
+  "crop_size": null,
+  "data_format": "channels_first",
+  "default_to_square": true,
+  "device": null,
+  "do_center_crop": null,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "do_sample_frames": true,
+  "fps": 2,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "input_data_format": null,
+  "max_frames": 768,
+  "merge_size": 2,
+  "min_frames": 4,
+  "num_frames": null,
+  "pad_size": null,
+  "patch_size": 16,
+  "processor_class": "Qwen3VLProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "return_metadata": false,
+  "size": {
+    "longest_edge": 25165824,
+    "shortest_edge": 4096
+  },
+  "temporal_patch_size": 2,
+  "video_metadata": null,
+  "video_processor_type": "Qwen3VLVideoProcessor"
+}
--- a/vocab.json
+++ b/vocab.json
				`@@ -0,0 +1 @@`
				`{"framework":"Pytorch","task":"multi-modal-embedding"}`