[Docs][Model] Support Qwen3-VL-Embedding & Qwen3-VL-Reranker (#6034)
### What this PR does / why we need it?
Add docs for Qwen3-VL-Embedding & Qwen3-VL-Reranker.
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
This commit is contained in:
127
docs/source/tutorials/Qwen3-VL-Embedding.md
Normal file
127
docs/source/tutorials/Qwen3-VL-Embedding.md
Normal file
@@ -0,0 +1,127 @@
|
|||||||
|
# Qwen3-VL-Embedding
|
||||||
|
|
||||||
|
## Introduction
|
||||||
|
|
||||||
|
The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend.
|
||||||
|
|
||||||
|
## Supported Features
|
||||||
|
|
||||||
|
Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||||
|
|
||||||
|
## Environment Preparation
|
||||||
|
|
||||||
|
### Model Weight
|
||||||
|
|
||||||
|
- `Qwen3-VL-Embedding-8B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Embedding-8B)
|
||||||
|
- `Qwen3-VL-Embedding-2B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Embedding-2B)
|
||||||
|
|
||||||
|
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||||
|
|
||||||
|
### Installation
|
||||||
|
|
||||||
|
You can use our official docker image to run `Qwen3-VL-Embedding` series models.
|
||||||
|
|
||||||
|
- Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||||
|
|
||||||
|
If you don't want to use the docker image as above, you can also build all from source:
|
||||||
|
|
||||||
|
- Install `vllm-ascend` from source, refer to [installation](../installation.md).
|
||||||
|
|
||||||
|
## Deployment
|
||||||
|
|
||||||
|
Using the Qwen3-VL-Embedding-8B model as an example, first run the docker container with the following command:
|
||||||
|
|
||||||
|
### Online Inference
|
||||||
|
|
||||||
|
```bash
|
||||||
|
vllm serve Qwen/Qwen3-VL-Embedding-8B --runner pooling
|
||||||
|
```
|
||||||
|
|
||||||
|
Once your server is started, you can query the model with input prompts.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl http://127.0.0.1:8000/v1/embeddings -H "Content-Type: application/json" -d '{
|
||||||
|
"input": [
|
||||||
|
"The capital of China is Beijing.",
|
||||||
|
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
|
||||||
|
]
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Offline Inference
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
from vllm import LLM
|
||||||
|
|
||||||
|
def get_detailed_instruct(task_description: str, query: str) -> str:
|
||||||
|
return f'Instruct: {task_description}\nQuery: {query}'
|
||||||
|
|
||||||
|
|
||||||
|
if __name__=="__main__":
|
||||||
|
# Each query must come with a one-sentence instruction that describes the task
|
||||||
|
task = 'Given a web search query, retrieve relevant passages that answer the query'
|
||||||
|
|
||||||
|
queries = [
|
||||||
|
get_detailed_instruct(task, 'What is the capital of China?'),
|
||||||
|
get_detailed_instruct(task, 'Explain gravity')
|
||||||
|
]
|
||||||
|
# No need to add instruction for retrieval documents
|
||||||
|
documents = [
|
||||||
|
"The capital of China is Beijing.",
|
||||||
|
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
|
||||||
|
]
|
||||||
|
input_texts = queries + documents
|
||||||
|
|
||||||
|
model = LLM(model="Qwen/Qwen3-VL-Embedding-8B",
|
||||||
|
runner="pooling",
|
||||||
|
distributed_executor_backend="mp")
|
||||||
|
|
||||||
|
outputs = model.embed(input_texts)
|
||||||
|
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
|
||||||
|
scores = (embeddings[:2] @ embeddings[2:].T)
|
||||||
|
print(scores.tolist())
|
||||||
|
```
|
||||||
|
|
||||||
|
If you run this script successfully, you can see the info shown below:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 192.47it/s]
|
||||||
|
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=2425173) (Worker pid=2425180) INFO 01-09 00:44:40 [acl_graph.py:194] Replaying aclgraph
|
||||||
|
(EngineCore_DP0 pid=2425173) (Worker pid=2425180) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
|
||||||
|
Processed prompts: 100%|████████████████████████████████████| 4/4 [00:00<00:00, 21.34it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
|
||||||
|
[[0.9279120564460754, 0.32747742533683777], [0.4124627113342285, 0.7425257563591003]]
|
||||||
|
```
|
||||||
|
|
||||||
|
For more examples, refer to the vLLM official examples:
|
||||||
|
|
||||||
|
- [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_offline.py)
|
||||||
|
- [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_online.py)
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
Run performance of `Qwen3-VL-Embedding-8B` as an example.
|
||||||
|
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/cli/) for more details.
|
||||||
|
|
||||||
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
vllm bench serve --model Qwen/Qwen3-VL-Embedding-8B --backend openai-embeddings --dataset-name random --endpoint /v1/embeddings --random-input 200 --save-result --result-dir ./
|
||||||
|
```
|
||||||
|
|
||||||
|
After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
============ Serving Benchmark Result ============
|
||||||
|
Successful requests: 1000
|
||||||
|
Failed requests: 0
|
||||||
|
Benchmark duration (s): 19.53
|
||||||
|
Total input tokens: 200000
|
||||||
|
Request throughput (req/s): 51.20
|
||||||
|
Total token throughput (tok/s): 10240.42
|
||||||
|
----------------End-to-end Latency----------------
|
||||||
|
Mean E2EL (ms): 10360.53
|
||||||
|
Median E2EL (ms): 10354.37
|
||||||
|
P99 E2EL (ms): 19423.21
|
||||||
|
==================================================
|
||||||
|
```
|
||||||
243
docs/source/tutorials/Qwen3-VL-Reranker.md
Normal file
243
docs/source/tutorials/Qwen3-VL-Reranker.md
Normal file
@@ -0,0 +1,243 @@
|
|||||||
|
# Qwen3-VL-Reranker
|
||||||
|
|
||||||
|
## Introduction
|
||||||
|
|
||||||
|
The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend.
|
||||||
|
|
||||||
|
## Supported Features
|
||||||
|
|
||||||
|
Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||||
|
|
||||||
|
## Environment Preparation
|
||||||
|
|
||||||
|
### Model Weight
|
||||||
|
|
||||||
|
- `Qwen3-VL-Reranker-8B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Reranker-8B)
|
||||||
|
- `Qwen3-VL-Reranker-2B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Reranker-2B)
|
||||||
|
|
||||||
|
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||||
|
|
||||||
|
### Installation
|
||||||
|
|
||||||
|
You can use our official docker image to run `Qwen3-VL-Reranker` series models.
|
||||||
|
|
||||||
|
- Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||||
|
|
||||||
|
If you don't want to use the docker image as above, you can also build all from source:
|
||||||
|
|
||||||
|
- Install `vllm-ascend` from source, refer to [installation](../installation.md).
|
||||||
|
|
||||||
|
## Deployment
|
||||||
|
|
||||||
|
Using the Qwen3-VL-Reranker-8B model as an example:
|
||||||
|
|
||||||
|
### Chat Template
|
||||||
|
|
||||||
|
The Qwen3-VL-Reranker model requires a specific chat template for proper formatting. Create a file named `qwen3_vl_reranker.jinja` with the following content:
|
||||||
|
|
||||||
|
```jinja
|
||||||
|
<|im_start|>system
|
||||||
|
Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>
|
||||||
|
<|im_start|>user
|
||||||
|
<Instruct>: {{
|
||||||
|
messages
|
||||||
|
| selectattr("role", "eq", "system")
|
||||||
|
| map(attribute="content")
|
||||||
|
| first
|
||||||
|
| default("Given a search query, retrieve relevant candidates that answer the query.")
|
||||||
|
}}<Query>:{{
|
||||||
|
messages
|
||||||
|
| selectattr("role", "eq", "query")
|
||||||
|
| map(attribute="content")
|
||||||
|
| first
|
||||||
|
}}
|
||||||
|
<Document>:{{
|
||||||
|
messages
|
||||||
|
| selectattr("role", "eq", "document")
|
||||||
|
| map(attribute="content")
|
||||||
|
| first
|
||||||
|
}}<|im_end|>
|
||||||
|
<|im_start|>assistant
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
Save this file to a location of your choice (e.g., `./qwen3_vl_reranker.jinja`).
|
||||||
|
|
||||||
|
### Online Inference
|
||||||
|
|
||||||
|
Start the server with the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
vllm serve Qwen/Qwen3-VL-Reranker-8B \
|
||||||
|
--runner pooling \
|
||||||
|
--max-model-len 4096 \
|
||||||
|
--hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \
|
||||||
|
--chat-template ./qwen3_vl_reranker.jinja
|
||||||
|
```
|
||||||
|
|
||||||
|
Once your server is started, you can send request with follow examples.
|
||||||
|
|
||||||
|
```python
|
||||||
|
import requests
|
||||||
|
|
||||||
|
url = "http://127.0.0.1:8000/v1/rerank"
|
||||||
|
|
||||||
|
# Please use the query_template and document_template to format the query and
|
||||||
|
# document for better reranker results.
|
||||||
|
|
||||||
|
prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
|
||||||
|
suffix = "<|im_end|>\n<|im_start|>assistant\n"
|
||||||
|
|
||||||
|
query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
|
||||||
|
document_template = "<Document>: {doc}{suffix}"
|
||||||
|
|
||||||
|
instruction = (
|
||||||
|
"Given a search query, retrieve relevant candidates that answer the query."
|
||||||
|
)
|
||||||
|
|
||||||
|
query = "What is the capital of China?"
|
||||||
|
|
||||||
|
documents = [
|
||||||
|
"The capital of China is Beijing.",
|
||||||
|
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
|
||||||
|
]
|
||||||
|
|
||||||
|
documents = [
|
||||||
|
document_template.format(doc=doc, suffix=suffix) for doc in documents
|
||||||
|
]
|
||||||
|
|
||||||
|
response = requests.post(url,
|
||||||
|
json={
|
||||||
|
"query": query_template.format(prefix=prefix, instruction=instruction, query=query),
|
||||||
|
"documents": documents,
|
||||||
|
}).json()
|
||||||
|
|
||||||
|
print(response)
|
||||||
|
```
|
||||||
|
|
||||||
|
If you run this script successfully, you will see a list of scores printed to the console, similar to this:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
{'id': 'rerank-ac3495afa8e12404', 'model': 'Qwen/Qwen3-VL-Reranker-8B', 'usage': {'prompt_tokens': 315, 'total_tokens': 315}, 'results': [{'index': 0, 'document': {'text': '<Document>: The capital of China is Beijing.<|im_end|>\n<|im_start|>assistant\n', 'multi_modal': None}, 'relevance_score': 0.6368980407714844}, {'index': 1, 'document': {'text': '<Document>: Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.<|im_end|>\n<|im_start|>assistant\n', 'multi_modal': None}, 'relevance_score': 0.20816077291965485}]}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Offline Inference
|
||||||
|
|
||||||
|
```python
|
||||||
|
from vllm import LLM
|
||||||
|
|
||||||
|
model_name = "Qwen/Qwen3-VL-Reranker-8B"
|
||||||
|
|
||||||
|
# What is the difference between the official original version and one
|
||||||
|
# that has been converted into a sequence classification model?
|
||||||
|
# Qwen3-Reranker is a language model that doing reranker by using the
|
||||||
|
# logits of "no" and "yes" tokens.
|
||||||
|
# It needs to computing 151669 tokens logits, making this method extremely
|
||||||
|
# inefficient, not to mention incompatible with the vllm score API.
|
||||||
|
# A method for converting the original model into a sequence classification
|
||||||
|
# model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
|
||||||
|
# Models converted offline using this method can not only be more efficient
|
||||||
|
# and support the vllm score API, but also make the init parameters more
|
||||||
|
# concise, for example.
|
||||||
|
# model = LLM(model="Qwen/Qwen3-VL-Reranker-8B", runner="pooling")
|
||||||
|
|
||||||
|
# If you want to load the official original version, the init parameters are
|
||||||
|
# as follows.
|
||||||
|
|
||||||
|
model = LLM(
|
||||||
|
model=model_name,
|
||||||
|
runner="pooling",
|
||||||
|
hf_overrides={
|
||||||
|
# Manually route to sequence classification architecture
|
||||||
|
# This tells vLLM to use Qwen3VLForSequenceClassification instead of
|
||||||
|
# the default Qwen3VLForConditionalGeneration
|
||||||
|
"architectures": ["Qwen3VLForSequenceClassification"],
|
||||||
|
# Specify which token logits to extract from the language model head
|
||||||
|
# The original reranker uses "no" and "yes" token logits for scoring
|
||||||
|
"classifier_from_token": ["no", "yes"],
|
||||||
|
# Enable special handling for original Qwen3-Reranker models
|
||||||
|
# This flag triggers conversion logic that transforms the two token
|
||||||
|
# vectors into a single classification vector
|
||||||
|
"is_original_qwen3_reranker": True,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
# Why do we need hf_overrides for the official original version:
|
||||||
|
# vllm converts it to Qwen3VLForSequenceClassification when loaded for
|
||||||
|
# better performance.
|
||||||
|
# - Firstly, we need using `"architectures": ["Qwen3VLForSequenceClassification"],`
|
||||||
|
# to manually route to Qwen3VLForSequenceClassification.
|
||||||
|
# - Then, we will extract the vector corresponding to classifier_from_token
|
||||||
|
# from lm_head using `"classifier_from_token": ["no", "yes"]`.
|
||||||
|
# - Third, we will convert these two vectors into one vector. The use of
|
||||||
|
# conversion logic is controlled by `using "is_original_qwen3_reranker": True`.
|
||||||
|
|
||||||
|
# Please use the query_template and document_template to format the query and
|
||||||
|
# document for better reranker results.
|
||||||
|
|
||||||
|
prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
|
||||||
|
suffix = "<|im_end|>\n<|im_start|>assistant\n"
|
||||||
|
|
||||||
|
query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
|
||||||
|
document_template = "<Document>: {doc}{suffix}"
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
instruction = (
|
||||||
|
"Given a search query, retrieve relevant candidates that answer the query."
|
||||||
|
)
|
||||||
|
|
||||||
|
query = "What is the capital of China?"
|
||||||
|
|
||||||
|
documents = [
|
||||||
|
"The capital of China is Beijing.",
|
||||||
|
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
|
||||||
|
]
|
||||||
|
|
||||||
|
documents = [document_template.format(doc=doc, suffix=suffix) for doc in documents]
|
||||||
|
|
||||||
|
outputs = model.score(query_template.format(prefix=prefix, instruction=instruction, query=query), documents)
|
||||||
|
|
||||||
|
print([output.outputs.score for output in outputs])
|
||||||
|
```
|
||||||
|
|
||||||
|
If you run this script successfully, you will see a list of scores printed to the console, similar to this:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2409.83it/s]
|
||||||
|
Processed prompts: 0%| | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=682882) INFO 01-20 04:38:46 [acl_graph.py:188] Replaying aclgraph
|
||||||
|
Processed prompts: 100%|████████████████████████████████████| 2/2 [00:00<00:00, 9.44it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
|
||||||
|
[0.7235596776008606, 0.0002742875076364726]
|
||||||
|
```
|
||||||
|
|
||||||
|
For more examples, refer to the vLLM official examples:
|
||||||
|
|
||||||
|
- [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_offline.py)
|
||||||
|
- [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_online.py)
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
Run performance of `Qwen3-VL-Reranker-8B` as an example.
|
||||||
|
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/cli/) for more details.
|
||||||
|
|
||||||
|
Take the `serve` as an example. Run the code as follows.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
vllm bench serve --model Qwen/Qwen3-VL-Reranker-8B --backend vllm-rerank --dataset-name random-rerank --endpoint /v1/rerank --random-input 200 --save-result --result-dir ./
|
||||||
|
```
|
||||||
|
|
||||||
|
After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
============ Serving Benchmark Result ============
|
||||||
|
Successful requests: 1000
|
||||||
|
Failed requests: 0
|
||||||
|
Benchmark duration (s): 13.70
|
||||||
|
Total input tokens: 265122
|
||||||
|
Request throughput (req/s): 72.99
|
||||||
|
Total token throughput (tok/s): 19351.23
|
||||||
|
----------------End-to-end Latency----------------
|
||||||
|
Mean E2EL (ms): 7474.64
|
||||||
|
Median E2EL (ms): 7528.72
|
||||||
|
P99 E2EL (ms): 13523.32
|
||||||
|
==================================================
|
||||||
|
```
|
||||||
@@ -13,7 +13,9 @@ Qwen3-VL-30B-A3B-Instruct.md
|
|||||||
Qwen3-VL-235B-A22B-Instruct.md
|
Qwen3-VL-235B-A22B-Instruct.md
|
||||||
Qwen3-Coder-30B-A3B.md
|
Qwen3-Coder-30B-A3B.md
|
||||||
Qwen3_embedding.md
|
Qwen3_embedding.md
|
||||||
|
Qwen3-VL-Embedding.md
|
||||||
Qwen3_reranker.md
|
Qwen3_reranker.md
|
||||||
|
Qwen3-VL-Reranker.md
|
||||||
Qwen3-8B-W4A8.md
|
Qwen3-8B-W4A8.md
|
||||||
Qwen3-32B-W4A4.md
|
Qwen3-32B-W4A4.md
|
||||||
Qwen3-Next.md
|
Qwen3-Next.md
|
||||||
|
|||||||
@@ -61,7 +61,9 @@ Get the latest info here: <https://github.com/vllm-project/vllm-ascend/issues/16
|
|||||||
| Model | Support | Note | Supported Hardware | Doc |
|
| Model | Support | Note | Supported Hardware | Doc |
|
||||||
|-------------------------------|-----------|----------------------------------------------------------------------|--------------------------|------|
|
|-------------------------------|-----------|----------------------------------------------------------------------|--------------------------|------|
|
||||||
| Qwen3-Embedding | ✅ | | A2/A3 | [Qwen3_embedding](../../tutorials/Qwen3_embedding.md)|
|
| Qwen3-Embedding | ✅ | | A2/A3 | [Qwen3_embedding](../../tutorials/Qwen3_embedding.md)|
|
||||||
|
| Qwen3-VL-Embedding | ✅ | | A2/A3 | [Qwen3-VL-Embedding](../../tutorials/Qwen3-VL-Embedding.md)|
|
||||||
| Qwen3-Reranker | ✅ | | A2/A3 | [Qwen3_reranker](../../tutorials/Qwen3_reranker.md)|
|
| Qwen3-Reranker | ✅ | | A2/A3 | [Qwen3_reranker](../../tutorials/Qwen3_reranker.md)|
|
||||||
|
| Qwen3-VL-Reranker | ✅ | | A2/A3 | [Qwen3-VL-Reranker](../../tutorials/Qwen3-VL-Reranker.md)|
|
||||||
| Molmo | ✅ | [1942](https://github.com/vllm-project/vllm-ascend/issues/1942) | A2/A3 | |
|
| Molmo | ✅ | [1942](https://github.com/vllm-project/vllm-ascend/issues/1942) | A2/A3 | |
|
||||||
| XLM-RoBERTa-based | ✅ | | A2/A3 | |
|
| XLM-RoBERTa-based | ✅ | | A2/A3 | |
|
||||||
| Bert | ✅ | | A2/A3 | |
|
| Bert | ✅ | | A2/A3 | |
|
||||||
|
|||||||
Reference in New Issue
Block a user