2025-04-19 17:38:18 +08:00
|
|
|
|
#
|
|
|
|
|
|
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
|
|
|
|
|
|
# This file is a part of the vllm-ascend project.
|
|
|
|
|
|
#
|
|
|
|
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
|
|
# you may not use this file except in compliance with the License.
|
|
|
|
|
|
# You may obtain a copy of the License at
|
|
|
|
|
|
#
|
|
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
#
|
|
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
|
|
# limitations under the License.
|
2025-04-22 14:13:00 +08:00
|
|
|
|
|
|
|
|
|
|
# ----------------------------------------------------------------------------------
|
|
|
|
|
|
# This module manage the patch for vllm. There are two folders in this module:
|
|
|
|
|
|
# - platform: contains the patches applied before worker starts. It's called by
|
|
|
|
|
|
# `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
|
|
|
|
|
|
# `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
|
|
|
|
|
|
# - worker: contains the patches applied when worker starts. It's called by
|
|
|
|
|
|
# `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
|
|
|
|
|
|
# each worker's `__init__` function.
|
|
|
|
|
|
#
|
|
|
|
|
|
# Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
|
|
|
|
|
|
# ----------------------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
|
|
# What's Patched and how it works:
|
|
|
|
|
|
# --------------------------------
|
|
|
|
|
|
# * Platform Patch:
|
|
|
|
|
|
# =================
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# ** 1. File: platform/patch_distributed.py**
|
2025-04-22 14:13:00 +08:00
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# 1. `torch.distributed.all_reduce`, `torch.distributed.broadcast`
|
2025-09-08 21:42:12 +08:00
|
|
|
|
# Why:
|
|
|
|
|
|
# tensor alignment for 310p
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# rewrite all_reduce and broadcast in torch.distributed
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# No, not ready yet.
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Find a better way to support tensor alignment for 310p without this patch.
|
2025-04-27 11:27:24 +08:00
|
|
|
|
#
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# ** 2. File: platform/patch_ec_connector.py**
|
2025-09-24 10:25:28 +08:00
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# 1. `vllm.distributed.ec_transfer.ec_connector.shared_storage_connector.ECSharedStorageConnector.start_load_caches`
|
2025-09-24 10:25:28 +08:00
|
|
|
|
# Why:
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# it's hard code to cuda
|
2025-09-24 10:25:28 +08:00
|
|
|
|
# How:
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# change the cuda to npu
|
2025-09-24 10:25:28 +08:00
|
|
|
|
# Related PR (if no, explain why):
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# https://github.com/vllm-project/vllm/pull/30225
|
2025-09-24 10:25:28 +08:00
|
|
|
|
# Future Plan:
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# Remove this patch when vllm merges the PR.
|
|
|
|
|
|
#
|
|
|
|
|
|
# ** 3. File: platform/patch_mamba_config.py**
|
|
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.model_executor.models.config.HybridAttentionMambaModelConfig.verify_and_update_config`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# block size is set to 16 in vLLM which is not supported by Ascend.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# Set block size to 128 on npu.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# we'll fix this in vLLM soon.
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch when vLLM merges the PR.
|
|
|
|
|
|
#
|
|
|
|
|
|
# ** 4. File: platform/patch_multiproc_executor.py**
|
|
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.v1.executor.multiproc_executor.MultiprocExecutor`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# vLLM create child process with daemon=True, which doesn't work with EPLB case, since EPLB will create
|
|
|
|
|
|
# a new process which is not allowed by daemon=True.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# Set daemon=False in MultiprocExecutor.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# Find a way to support daemon=False in vLLM
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch when vLLM fix the issue.
|
|
|
|
|
|
#
|
|
|
|
|
|
# ** 5. File: platform/patch_sched_yield.py**
|
|
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.distributed.utils.USE_SCHED_YIELD`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# os.sched_yield() doesn't work on Arm systems.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# avoid using os.sched_yield() on Arm systems.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# https://github.com/vllm-project/vllm/pull/30228
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch when vLLM merge the PR.
|
|
|
|
|
|
#
|
[Main] [Patch] support balance scheduling patch (#5212)
### Motivation.
**Limitations of the current vLLM v1 scheduling strategy**
vLLM v1 scheduling currently enables chunkedprefill by default, which
processes prefill and decode requests simultaneously in a single
scheduling session. This can impact the overall system throughput and
performance in some scenarios.
Balance scheduling addresses this issue by synchronizing the number of
running queues across all schedulers to delay the scheduling of new
requests, thereby improving the overall system's steady-state decoding
time. This achieves:
✅Adding `balance_gather` to the scheduler synchronizes the number of
requests in the running queues between DPs.
✅Balance scheduling improves the decode steady-state time, thereby
increasing the overall output throughput of the inference system.
### Proposed Change.
**1.Feature Overview**
In the vLLM scheduler, running requests (i.e., requests that are already
undergoing pre-filled computation) have the highest priority, followed
by waiting requests (i.e., requests that have not yet been computed).
As shown in the diagram above, when the entire inference system exits
from a steady state, the scheduler will schedule a batch of new requests
for prefill operations and then synchronize them among the dynamic
programming (DP) models. This can cause some DP models that are entirely
decoded to synchronize with the number of prefilled tokens. Frequent
prefill scheduling by certain DP models can lead to a deterioration in
the overall system output throughput.
Balance scheduling synchronizes the number of running queue requests
across different DPs, and only schedules new requests for prefilling
when at least every scheduler has fewer than max_nun_requst.
**2.Implementation Design**
**3.Experiment Results**
- Fixed-length input scenario: In the performance test scenario with
3.5K fixed-length input and 1.5K fixed-length output, the throughput
performance was improved by approximately **18%** after adding balance
scheduling.
| Method | Model | Input Len | Request Count | Output Len | BatchSize |
Average TTFT | Average TPOT | e2e duration | Input Token Throughput |
Output Token Throughput | Request Throughput
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
---- | ---- |
| Baseline | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 6600 | 86.85 |
591.9s | 3030.5 | 1297.3 | 0.86 |
| Balance scheduling | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 7012 |
70.63 | 501.7s | 3575.7 | 1530.7 | 1.02 |
**4.Demo PR**
[#29721 ](https://github.com/vllm-project/vllm/pull/29721)
---------
Signed-off-by: GDzhu01 <809721801@qq.com>
2025-12-23 09:04:38 +08:00
|
|
|
|
# ** 6. File: platform/patch_balance_schedule.py**
|
|
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.v1.engine.core.EngineCoreProc.run_engine_core`
|
|
|
|
|
|
# `vllm.v1.core.sched.scheduler.Scheduler`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# vLLM v1 scheduling currently enables chunkedprefill by default, which processes prefill and decode
|
|
|
|
|
|
# requests simultaneously in a single scheduling session. This can impact the overall system throughput
|
|
|
|
|
|
# and performance in some scenarios.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# Set environmental variables VLLM_ASCEND_BALANCE_SCHEDULING=1 in startup script.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# https://github.com/vllm-project/vllm/pull/29721
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch when vLLM merge the PR.
|
2025-09-24 10:25:28 +08:00
|
|
|
|
#
|
2025-12-26 09:18:16 +08:00
|
|
|
|
# ** 7. File: platform/patch_compile_backend.py**
|
|
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.compilation.backends.PiecewiseCompileInterpreter`
|
|
|
|
|
|
# `vllm.compilation.piecewise_backend.PiecewiseBackend`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# vllm removed the compile graph for general shape, which caused operator fusion to fail.
|
|
|
|
|
|
# This issue affects the performance of model inference on Ascend.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# recover the compiled graph for dynamic_shape in PiecewiseBackend.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# https://github.com/vllm-project/vllm/pull/24252
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch when fix the problem.
|
|
|
|
|
|
#
|
2025-04-22 14:13:00 +08:00
|
|
|
|
# * Worker Patch:
|
|
|
|
|
|
# ===============
|
2025-12-10 23:27:45 +08:00
|
|
|
|
#
|
|
|
|
|
|
# ** 1. File: worker/patch_deepseek.py **
|
|
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `DeepseekV2Model.forward`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# getattr(self.config, "llama_4_scaling", None) will raise AttributeError
|
|
|
|
|
|
# on npu with graph mode.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# catch the AttributeError and set llama_4_scaling to None.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# No, this is a bug in vLLM Ascend
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Find the root cause of this bug and fix it in vLLM Ascend.
|
|
|
|
|
|
#
|
|
|
|
|
|
# ** 2. File: worker/patch_distributed.py **
|
|
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.distributed.parallel_state.GroupCoordinator`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# vllm doesn't support all_to_all for GroupCoordinator.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# Add all_to_all implementation for GroupCoordinator.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# No, we should use vlLM all2all manager to support all_to_all for npu.
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch when the refactor of all2all manager is done.
|
|
|
|
|
|
#
|
|
|
|
|
|
# ** 3. File: worker/patch_minicpm.py **
|
2025-04-27 11:27:24 +08:00
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.model_executor.models.minicpm.MiniCPMAttention.forward`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# The forward func of MiniCPMAttention in vllm do a datatype convert
|
|
|
|
|
|
# (original datatype --> float32) to ensure the precision on cuda.
|
|
|
|
|
|
# However float32 is not supported in cann rope op, thus we keep this patch
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# Removed the dtype convert operations in forward
|
2025-06-10 17:18:09 +08:00
|
|
|
|
# Related PR (if no, explain why):
|
2025-04-27 11:27:24 +08:00
|
|
|
|
# NO, only for npu due to rope op.
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Keep this patch in vllm-ascend.
|
|
|
|
|
|
#
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# ** 4. File: worker/patch_multimodal_merge.py**
|
|
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.model_executor.models.utils._merge_multimodal_embeddings`
|
2025-10-11 15:55:22 +08:00
|
|
|
|
# Why:
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# '_merge_multimodal_embeddings' func of vllm is incompatible with Ascend.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# Replace with CPU operation that can be executed asynchronously.
|
2025-10-11 15:55:22 +08:00
|
|
|
|
# Related PR (if no, explain why):
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# This is a bug by Ascend only. It can' be fixed in vLLM.
|
2025-10-11 15:55:22 +08:00
|
|
|
|
# Future Plan:
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# Identify this pattern in torch-npu and remove this patch.
|
|
|
|
|
|
#
|
[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667)
### What this PR does / why we need it?
Following https://github.com/vllm-project/vllm/pull/29873, register
`AscendApplyRotaryEmb` CustomOp and remove related patch.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
#### ✅ Test Qwen2.5-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d
```
#### ✅ Test Qwen3-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-23 10:04:37 +08:00
|
|
|
|
# ** 5. File: worker/patch_roberta.py **
|
[Feat] Supports Aclgraph for bge-m3 (#3171)
### What this PR does / why we need it?
[Feat] Supports Aclgraph for bge-m3
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
```
pytest -s tests/e2e/singlecard/test_embedding.py
pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py
```
to start an online server with bs 10, each batch's seq length=8192, we
set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked:
```
vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}'
```
For bs10, each batch's seq length=8192, QPS is improved from 85 to 104,
which is a 22% improvement, lots of host bound is reduced.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Co-authored-by: wangyongjun <1104133197@qq.com>
2025-10-14 23:07:45 +08:00
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
2025-12-10 11:37:57 +08:00
|
|
|
|
# 1. `vllm.model_executor.models.bert `
|
[Feat] Supports Aclgraph for bge-m3 (#3171)
### What this PR does / why we need it?
[Feat] Supports Aclgraph for bge-m3
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
```
pytest -s tests/e2e/singlecard/test_embedding.py
pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py
```
to start an online server with bs 10, each batch's seq length=8192, we
set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked:
```
vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}'
```
For bs10, each batch's seq length=8192, QPS is improved from 85 to 104,
which is a 22% improvement, lots of host bound is reduced.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Co-authored-by: wangyongjun <1104133197@qq.com>
2025-10-14 23:07:45 +08:00
|
|
|
|
# Why:
|
|
|
|
|
|
# shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# Replace shift operation with multiplication and division.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# No, this need CANN add an aclnn shift operation
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Revert this when CANN support shift aclnn operation
|
2025-10-23 09:54:31 +08:00
|
|
|
|
#
|
[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667)
### What this PR does / why we need it?
Following https://github.com/vllm-project/vllm/pull/29873, register
`AscendApplyRotaryEmb` CustomOp and remove related patch.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
#### ✅ Test Qwen2.5-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d
```
#### ✅ Test Qwen3-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-23 10:04:37 +08:00
|
|
|
|
# ** 6. File: worker/patch_triton.py**
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.model_executor.layers.mamba.ops`, `vllm.model_executor.layers.fla.ops`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# triton ops in vLLM perform not good on NPU. And there is no dispatch mechanism for triton ops.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# override triton ops in vLLM with ascend implementation
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# Let vLLM support triton ops dispatch.
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch when vLLM support the dispatch function.
|
|
|
|
|
|
#
|
[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667)
### What this PR does / why we need it?
Following https://github.com/vllm-project/vllm/pull/29873, register
`AscendApplyRotaryEmb` CustomOp and remove related patch.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
#### ✅ Test Qwen2.5-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d
```
#### ✅ Test Qwen3-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-23 10:04:37 +08:00
|
|
|
|
# ** 7. File: worker/patch_weight_loader.py**
|
2025-10-23 09:54:31 +08:00
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# 1. `vllm.model_executor.layers.linear.UnquantizedLinearMethod`
|
2025-10-23 09:54:31 +08:00
|
|
|
|
# Why:
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# vLLM Ascend doesn't work with weight loader v2
|
2025-10-23 09:54:31 +08:00
|
|
|
|
# How:
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# patch it to fix the bug.
|
2025-10-23 09:54:31 +08:00
|
|
|
|
# Related PR (if no, explain why):
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# This is a bug by Ascend only. We should fix it soon
|
2025-10-23 09:54:31 +08:00
|
|
|
|
# Future Plan:
|
2025-12-10 23:27:45 +08:00
|
|
|
|
# Remove this patch when the bug is fixed.
|
2025-10-23 09:54:31 +08:00
|
|
|
|
#
|
[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667)
### What this PR does / why we need it?
Following https://github.com/vllm-project/vllm/pull/29873, register
`AscendApplyRotaryEmb` CustomOp and remove related patch.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
#### ✅ Test Qwen2.5-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d
```
#### ✅ Test Qwen3-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-23 10:04:37 +08:00
|
|
|
|
# ** 8. File: worker/patch_qwen3_next_mtp.py**
|
2025-12-10 22:54:24 +08:00
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.v1.worker.utils.bind_kv_cache`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# 'bind_kv_cache' func will raise an exception when current_platform is npu.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# Replace with a new bind_kv_cache.
|
|
|
|
|
|
# Skip the raise.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
2025-12-15 13:22:30 +08:00
|
|
|
|
# It need discuss.
|
2025-12-10 22:54:24 +08:00
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch after discussing with vllm community and adapting bind_kv_cache to npu.
|
|
|
|
|
|
#
|
[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667)
### What this PR does / why we need it?
Following https://github.com/vllm-project/vllm/pull/29873, register
`AscendApplyRotaryEmb` CustomOp and remove related patch.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
#### ✅ Test Qwen2.5-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d
```
#### ✅ Test Qwen3-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-23 10:04:37 +08:00
|
|
|
|
# ** 9. File: worker/patch_module.py**
|
2025-12-10 22:54:24 +08:00
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.v1.attention.backends.gdn_attn.torch.argsort`
|
|
|
|
|
|
# Why:
|
2025-12-15 13:22:30 +08:00
|
|
|
|
# 1. 'torch.argsort' func of npu does not support bool.
|
|
|
|
|
|
# 2. Without `stable=True`, the output will have a lot of redundant tokens.
|
2025-12-10 22:54:24 +08:00
|
|
|
|
# How:
|
2025-12-15 13:22:30 +08:00
|
|
|
|
# Replace with a new torch.argsort that will cast the input to torch.int32
|
|
|
|
|
|
# and do stable sort.
|
2025-12-10 22:54:24 +08:00
|
|
|
|
# Related PR (if no, explain why):
|
2025-12-15 13:22:30 +08:00
|
|
|
|
# 1. It depends on torch_npu.
|
|
|
|
|
|
# 2. https://github.com/vllm-project/vllm/pull/30632
|
2025-12-10 22:54:24 +08:00
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch when bool is supported in 'torch.argsort' func of npu.
|
2025-12-15 13:22:30 +08:00
|
|
|
|
# Make 'torch.argsort' in `vllm.v1.attention.backends.gdn_attn` be stable.
|
2025-12-10 22:54:24 +08:00
|
|
|
|
#
|
[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667)
### What this PR does / why we need it?
Following https://github.com/vllm-project/vllm/pull/29873, register
`AscendApplyRotaryEmb` CustomOp and remove related patch.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
#### ✅ Test Qwen2.5-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d
```
#### ✅ Test Qwen3-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-23 10:04:37 +08:00
|
|
|
|
# ** 10. File: worker/patch_rejection_sampler.py**
|
2025-12-16 11:32:26 +08:00
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.v1.sample.rejection_sampler`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# - some functions from `rejection_sampler` are not supported or slow on npu.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# - add npu_top_k_top_p to 'apply_sampling_constraints' func
|
|
|
|
|
|
# - add custom triton kernel to `expand_batch_to_tokens` and `rejection_sample`
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# https://github.com/vllm-project/vllm/pull/874
|
|
|
|
|
|
# https://github.com/vllm-project/vllm/pull/4849
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# 1. make these functions as class func of RejectionSampler, create AscendRejectionSampler
|
|
|
|
|
|
# to override them, then delete the patch file `worker/patch_rejection_sampler.py`.
|
|
|
|
|
|
# 2. make these functions as costom op, then remove AscendRejectionSampler
|
2025-12-18 11:31:04 +08:00
|
|
|
|
#
|
[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667)
### What this PR does / why we need it?
Following https://github.com/vllm-project/vllm/pull/29873, register
`AscendApplyRotaryEmb` CustomOp and remove related patch.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
#### ✅ Test Qwen2.5-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d
```
#### ✅ Test Qwen3-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-23 10:04:37 +08:00
|
|
|
|
# ** 11.File: worker/patch_qwen3_next.py**
|
2025-12-18 11:31:04 +08:00
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet.forward`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# The Qwen3Next GatedDeltaNet forward cannot directly add custom operators.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# Add a branch in Qwen3NextGatedDeltaNet.forward to adapt to fused_qkvzba_split_reshape_cat.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# https://github.com/vllm-project/vllm/pull/30863
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch when vLLM support these operators.
|
|
|
|
|
|
#
|
[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667)
### What this PR does / why we need it?
Following https://github.com/vllm-project/vllm/pull/29873, register
`AscendApplyRotaryEmb` CustomOp and remove related patch.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
#### ✅ Test Qwen2.5-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d
```
#### ✅ Test Qwen3-VL
Run:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```
Output:
```
{"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-23 10:04:37 +08:00
|
|
|
|
# ** 12. File: worker/patch_qwen3_next.py**
|
2025-12-19 16:34:11 +08:00
|
|
|
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
# 1. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet._forward_core`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# triton ops fused_recurrent_gated_delta_rule and fused_gdn_gating in vLLM perform not good on NPU.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# add a new fused triton ops in vLLM with ascend implementation.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# https://github.com/vllm-project/vllm/pull/30860
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch when vLLM support these operators.
|
|
|
|
|
|
#
|
2025-12-22 14:09:19 +08:00
|
|
|
|
# 2. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet._forward_core`
|
|
|
|
|
|
# Why:
|
|
|
|
|
|
# The Qwen3Next GatedDeltaNet _forward_core cannot directly add custom operators.
|
|
|
|
|
|
# How:
|
|
|
|
|
|
# Add a branch in Qwen3NextGatedDeltaNet._forward_core to adapt to fused_gdn_gating_patch.
|
|
|
|
|
|
# Related PR (if no, explain why):
|
|
|
|
|
|
# https://github.com/vllm-project/vllm/pull/31002
|
|
|
|
|
|
# Future Plan:
|
|
|
|
|
|
# Remove this patch when vLLM support these operators.
|
|
|
|
|
|
#
|