25 Commits

Author SHA1 Message Date
starkwj
389030a8f8 add env vars & misc 2026-02-11 06:27:58 +00:00
starkwj
739d074b0c update other platforms' Dockerfile 2026-01-23 03:24:25 +00:00
starkwj
2a571d8bc8 support multi npu partially 2026-01-09 04:36:39 +00:00
starkwj
135cc0a505 vllm-ascend vnpu v1 2025-12-26 07:37:35 +00:00
Wang Yixuan
d412565ec9 [Cherry-pick]bmm_transpose to v011dev (#3995)
### What this PR does / why we need it?
Add a custom op to acclerater the deepseek model. The fusion ops combine
the bmm and transpose together, which is applied to mla module.
Cherry-pick from this commtid c68ddc11ce53334fc9a17bad58342148cbf14e86

### Does this PR introduce _any_ user-facing change?
No

---------

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-12-08 19:22:14 +08:00
leo-pony
0cead5c1ee Quality enhancement: Immediately interrupt execution when allocate NPU memory OOM (#3944)
### What this PR does / why we need it?
Protect the scene where the first problem occurs. The execution should
be interrupted when the video memory application fails, rather than
waiting until an illegal address is accessed.


### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-11-04 08:55:22 +08:00
Chen Chen
6b290acfe1 remove redundant params in mla_preprocess kernel (#3530)
### What this PR does / why we need it?

This pull request removes the redundant parameters `gamma1` and `beta1`
(also named `gamma0`/`beta0` in some places) from the `mla_preprocess`
kernel and its calling hierarchy. The changes are consistent across C++
kernel code, bindings, and Python call sites. The parameters were unused
in the lower-level functions, so their removal is a good cleanup.

### Does this PR introduce _any_ user-facing change?

The python interface of the kernel is affected, and the params of
`gamma0` and `beta0` are not needed.

### How was this patch tested?

The unit-test of the kernel is adapted accordingly.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: mojave2 <chenchen145@huawei.com>
2025-10-21 19:20:13 +08:00
Chen Chen
bcc313e8f2 add mla_preprocess kernel (#3226)
### What this PR does / why we need it?

- Adds the `mla_preprocess` custom kernel to provide an optimized
pre-processing operator for Multi-head Latent Attention (MLA) on Ascend
NPUs.
- Wires the new kernel into the C++ extension pipeline so vLLM can
invoke it directly, cutting Python-side tensor shuffling and memory
copies that previously bottlenecked MLA compilation paths.

### Does this PR introduce any user-facing change?

- No. The change only introduces a low-level kernel; public APIs and
inference behavior remain unchanged.

### How was this patch tested?

- Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.

- vLLM version: v0.11.0

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
2025-10-12 07:39:45 +08:00
Jiawei Li
e57cca971c Fix the bugs about operator registration by PyTorch Dispatcher (#2786)
**Background:**

There are two principles about operator registration in PyTorch
- The same namespace can be only registered once by `TORCH_LIBRARY`
- The operator signatures can be only registered once by `def`

Considering that all custom operators defined in the current repo are
only used by Ascend, instead of defining a common operator schema by
vLLM, all accelerators then follow this operator schema and complete the
implementation based on their respective hardware, which is conducive to
functional abstraction.

Therefore, we can rename the operator registration namespace to an
Ascend-specific namespace(**_C_ascend**).

Related ISSUE: https://github.com/vllm-project/vllm-ascend/issues/2742


- vLLM version: main
- vLLM main:
f592b3174b

Signed-off-by: FFFrog <ljw1101.vip@gmail.com>
2025-09-13 11:58:52 +08:00
Jiawei Li
b7ee3fdad3 [Code clean] Remove the unnecessary code (#2815)
**Background:**

A dynamic library named vllm_ascend_C.so will be generated when
compiling vLLM-Ascend, so when import vllm_ascend.vllm_ascend_C in
python, the interperter will search vllm_ascend_C.so and try to find a
external symbol named PyInit_vllm_ascend_C which is provided in
csrc/camem_allocator.cpp.

**Conclusion:**

The PyInit__C is redundent.

- vLLM version: v0.10.1.1
- vLLM main:
717fc00e98

Signed-off-by: FFFrog <ljw1101.vip@gmail.com>
2025-09-10 17:19:39 +08:00
yupeng
9f1e054fe3 [Bugfix][LoRA][Operator] Fix LoRA custom operators accuracy issue (#2672)
### What this PR does / why we need it?
Fix the LoRA accuracy issue that introduced by custom AscendC operator
"bgmv_shrink, sgmv_shrink, bgmv_expand, sgmv_epand".

The bug details are: 
- In the kernel function, if you want to call GlobalTensor.GetSize
method, you have to pass the second parameter of bufferSize when you
call GlobalTensor.SetGlobalBuffer first.
- Or GlobalTensor.GetSize method will return a random value.
- You can refer to [this
doc](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha002/apiref/ascendcopapi/atlasascendc_api_07_00024.html).

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py

- vLLM version: v0.10.1.1
- vLLM main:
a344a5aa0a

---------

Signed-off-by: paulyu12 <paulyu0307@gmail.com>
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: paulyu12 <paulyu0307@gmail.com>
2025-09-02 11:46:59 +08:00
liuchenbing
3648d18e67 Add Custom Kernels For LoRA Performance (#2325)
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
      no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.

Signed-off-by: liuchn <909698896@qq.com>

- vLLM version: v0.10.0
- vLLM main:
1f83e7d849

---------

Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
2025-08-19 09:09:11 +08:00
Pleaplusone
19fdc9a3f0 [Bugfix] Fix header include issue in rope (#2397)
### What this PR does / why we need it?
vLLM-Ascend's rope implementaion include several header file that are
not supposed to be included by outside users. Current implementation may
break when canntoolkits update, this PR remove those not compatible file
includes to guarantee the safety of upgrading cann toolkits.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested by rope unittest

- vLLM version: v0.10.0
- vLLM main:
3e6dd40016

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-08-18 14:33:38 +08:00
Pleaplusone
c0f0b70813 [core] Support capture custom ops into aclgraph (#2113)
### What this PR does / why we need it?
Thanks to the PR https://github.com/vllm-project/vllm-ascend/pull/426
make vllm-ascend support the aclgraph inference to reduce the host
overhead. However, the capability of aclgraph strongly relies on the
functionality provided by `torch.compile`, which is the key feature
supported in torch 2.x . Therefore, capture custom op into aclgraph is
only possible when it can be recognize and captured by `torch.compile`.

In this PR, we register the meta implementation of current custom ops to
enable the fx graph capture. And by doing that, insert those custom ops
into aclgraph become a natural thing to the ascend runtime.

### Does this PR introduce _any_ user-facing change?
No user face change.

### How was this patch tested?
Tested in unittest, we will integrate the `rotary_embedding` op into a
small custom model and use `torch.compile` and aclgraph to capture and
replay it to verify its functionality.

- vLLM version: v0.10.0
- vLLM main:
1b99028069

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-08-11 15:59:42 +08:00
taoxudonghaha
540336edc9 Add Custom Kernels For LoRA Performance (#1884)
### What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the
performance of LoRA
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by
about 70%.

- vLLM version: v0.9.2
- vLLM main:
40d86ee412

---------

Signed-off-by: taoxudonghaha <justsheldon@163.com>
2025-07-29 19:27:50 +08:00
leo-pony
b5ad70e1a6 [Optimize]Change AI Vector core number getting function to glibc ABI free funcition (#1974)
### What this PR does / why we need it?
Change AI Vector core number getting function to glibc ABI free
function. After this PR merged in, there should been no glibc ABI
problems for bump torch version to 2.7.1.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.9.2
- vLLM main:
f59ec35b7f

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-24 10:00:19 +08:00
Shanshan Shen
8a91e6e59c [Misc][V0 Deprecation] Remove V0 Related Custom Ops (#1871)
### What this PR does / why we need it?
This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
ca4eb82bcb

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-18 23:06:03 +08:00
Yikun Jiang
097e7149f7 [Platform] Add initial experimental support for Altlas 300I series (#1333)
### What this PR does / why we need it?
Add initial experimental support for Ascend 310P, this patch squash
below PR into one to help validation:

- https://github.com/vllm-project/vllm-ascend/pull/914
- https://github.com/vllm-project/vllm-ascend/pull/1318
- https://github.com/vllm-project/vllm-ascend/pull/1327


### Does this PR introduce _any_ user-facing change?
User can run vLLM on Altlas 300I DUO series

### How was this patch tested?
CI passed with:
- E2E image build for 310P
- CI test on A2 with e2e test and longterm test
- Unit test missing because need a real 310P image to have the test,
will add in a separate PR later.
- Manually e2e test:
- Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B:
https://github.com/vllm-project/vllm-ascend/pull/914#issuecomment-2942989322
  - Pangu MGoE 72B


The patch has been tested locally on Ascend 310P hardware to ensure that
the changes do not break existing functionality and that the new
features work as intended.

#### ENV information

CANN, NNAL version: 8.1.RC1
> [!IMPORTANT]  
> PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ
format and calling NNAL operators on 310P

#### Code example

##### Build vllm-ascend from source code

```shell
# download source code as vllm-ascend
cd vllm-ascend
export SOC_VERSION=Ascend310P3
pip install -v -e .
cd ..
```

##### Run offline inference

```python
from vllm import LLM, SamplingParams
prompts = ["水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。",
           "水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。"]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10)
# Create an LLM.
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,
    max_num_seqs=4,
    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P
    disable_custom_all_reduce=True,
    trust_remote_code=True,
    tensor_parallel_size=2,
    compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]},
)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

```

---------

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: Vincent Yuan <farawayboat@gmail.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-21 09:00:16 +08:00
ttanzhiqiang
2498d297ae add custom ascendc kernel vocabparallelembedding (#796)
This PR add custom ascendc kernel vocabparallelembedding support in
vllm-ascend, related CMakeLists and setuptools is also added in this PR.

pytest -s benchmarks/ops/ben_vocabparallelembedding.py
pytest -s tests/ops/test_vocabparallelembedding.py

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-06-12 10:44:33 +08:00
Wan_Danfeng
5cf9ff18e9 [Performance]: Custom AscendC Kernel of Multi-Step Prepare Input (#814)
### What this PR does / why we need it?

- According to https://github.com/vllm-project/vllm-ascend/issues/807,
we pull request for customer ascendc kernel of multi-step.
- also a bug we found in multi_step_runner.py is fixed when we use
multi-step on V0 Engine.


### Does this PR introduce _any_ user-facing change?

no user-facing change


### How was this patch tested?
we add Unit Test file and offline inference file to test the custom
ascendc kernel. See test/ops/test_multi_step.py and
examples/offline_multi_step.py

---------

Signed-off-by: wan_danfeng <wonderful199082@126.com>
2025-05-20 09:31:30 +08:00
wangxiyuan
0dae55a9a3 [MISC] fix format check error (#654)
This pr makes format.sh works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-29 11:14:19 +08:00
Bug Hunter Yan
05bdcbeae4 support aclgraph (#426)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
This PR supports the access of vllm-acend to the piecewise_graph feature
provided by the v1 engine.

1. register unifiled_ascend_attention_with_output for piecewise_graph to
split graph.
2. support NPUGraph to accelerate kernel launch.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
support npugraph to default, Users can disenable the npugraph feature by
configuring enforce_eager.

This has corresponding requirements for the versions of torch_npu and
CANN, and they need to support graph capture.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
it turn to default

---------

Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-04-23 20:56:24 +08:00
Shuqiao Li
84563fc65d Add sleep mode feature for Ascend NPU (#513)
### What this PR does / why we need it?
This PR adds sleep mode feature for vllm-ascend, when sleeps, we do
mainly two things:

- offload model weights
- discard kv cache

RLHF tools(such as https://github.com/volcengine/verl and
https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode
to accelerate the training process.

This PR may solve #375 and #320 .

### Does this PR introduce _any_ user-facing change?
No existing user interfaces changed.
Users will have two new methods(`sleep()` and `wake_up()`) to use.

### How was this patch tested?
This PR is tested with Qwen/Qwen2.5-0.5B-Instruct.

At first, we have free NPU memory M1.

After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)`
executed, we have free NPU memory M2. M2 < M1.

Then we call `llm.sleep(level=1)`, we have free NPU memory M3.

We have M3 > M2, M3 is very close to M1.

Plus, we have the same output tokens before sleep and after wake up,
with the config of `SamplingParams(temperature=0, max_tokens=10)` and
with the same input tokens of course.


This PR is utilizing the CMake procedure of #371 , thanks a lot.

Signed-off-by: Shuqiao Li <celestialli@outlook.com>
2025-04-18 13:11:39 +08:00
Pleaplusone
66a0837963 adopt rope in vllm-ascend (#530)
### What this PR does / why we need it?
Adopt custom kernel rotary embedding in actual model inference,
customized rotary_embedding will generate contiguous query and key in
the cpp side to reduce the overhead of two contiguous and index_select
compared with rotary_embedding in torch_npu. For now, rotary_embedding
can only support the scenario of `is_neox = true`, non-neox version rope
will be updated soon in the future.
---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-18 08:56:05 +08:00
Pleaplusone
ce8259975e [core] Support custom ascendc kernels in vllm-ascend (#233)
This PR add custom ascendc kernel rotary_embedding support in
vllm-ascend, related CMakeLists and setuptools is also added in this PR.

Related: https://github.com/vllm-project/vllm-ascend/issues/156

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-03 14:52:34 +08:00