**Background:**
There are two principles about operator registration in PyTorch
- The same namespace can be only registered once by `TORCH_LIBRARY`
- The operator signatures can be only registered once by `def`
Considering that all custom operators defined in the current repo are
only used by Ascend, instead of defining a common operator schema by
vLLM, all accelerators then follow this operator schema and complete the
implementation based on their respective hardware, which is conducive to
functional abstraction.
Therefore, we can rename the operator registration namespace to an
Ascend-specific namespace(**_C_ascend**).
Related ISSUE: https://github.com/vllm-project/vllm-ascend/issues/2742
- vLLM version: main
- vLLM main:
f592b3174b
Signed-off-by: FFFrog <ljw1101.vip@gmail.com>
Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM
- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Fix some ci issue and refactor modelrunner
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
4d9c61993a
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.
Signed-off-by: liuchn <909698896@qq.com>
- vLLM version: v0.10.0
- vLLM main:
1f83e7d849
---------
Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
### What this PR does / why we need it?
This PR fixes a tensor shape mismatch in `add_lora_logits`.
Previously, `lora_a_stacked` was passed as shape `[num_loras, in_dim,
rank]`, which does not match the expected einsum pattern `"bi, boi ->
bo"` used in `bgmv_shrink`.
This causes runtime errors like:
RuntimeError: einsum(): subscript i has size 3 for operand 1 which does
not broadcast with previously seen size 4

This fix transposes `lora_a_stacked` and `lora_b_stacked` to match the
expected shapes:
- `lora_a`: `[num_loras, rank, in_dim]`
- `lora_b`: `[num_loras, out_dim, rank]`
All unit tests pass after this fix.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
```
import torch
import pytest
from unittest.mock import patch, PropertyMock, ANY
from vllm_ascend.lora.punica_wrapper.punica_npu import PunicaWrapperNPU
@pytest.fixture
def wrapper_cpu():
cfg = {"max_num_batched_tokens": 10, "max_batches": 2, "device": "cpu"}
w = PunicaWrapperNPU(**cfg)
w.is_prefill = True
w.no_lora = False
return w
def test_add_lora_logits(wrapper_cpu):
batch_size = 2
hidden_size = 4
lora_rank = 3
vocab_size = 5
y = torch.zeros(batch_size, vocab_size)
x = torch.randn(batch_size, hidden_size)
num_loras = 1
lora_a = torch.randn(num_loras, hidden_size, lora_rank)
lora_b = torch.randn(num_loras, lora_rank, vocab_size)
with patch.object(wrapper_cpu.__class__, "sampler_indices",
new_callable=PropertyMock) as mock_idx:
mock_idx.return_value = torch.zeros(batch_size, dtype=torch.long)
wrapper_cpu.add_lora_logits(y, x, lora_a, lora_b, scale=1.0)
assert y.shape == (batch_size, vocab_size)
assert not torch.allclose(y, torch.zeros_like(y))
Signed-off-by: hongfugui <hongfugui_yewu@cmss.chinamobile.com>
### What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the
performance of LoRA
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by
about 70%.
- vLLM version: v0.9.2
- vLLM main:
40d86ee412
---------
Signed-off-by: taoxudonghaha <justsheldon@163.com>