<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
Currently, the implementation for MLA V1 pads q, k, v to `head_dim` 256
to conform to early MLA kernel. But the new MLA kernel supports
`head_dim` that can't be devided by 128. Therefore we can remove those
unnecessary paddings to boost the performance
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
### What this PR does / why we need it?
Update attention nz and mla nz modules to improve TPOP 6ms performance
Convert W_UV and W_UK_T to NPU format in mla_v1.py
Convert layer.weight to NPU format in w8a8.py
Signed-off-by: ttanzhiqiang <389825161@qq.com>
Implement save kv cache logic for v1 disaggregated prefill in ascend
scheduler
This PR adds support for saving kv cache in the ascend scheduler, which
is part of the v1 disaggregated prefill design. The load functionality
is not yet implemented.
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
### What this PR does / why we need it?
Revert the default value of enable_chunked_prefill to 'False' in
additional_scheduler_config. In engine v1, enable_chunked_prefill is
forcibly set to True in VllmConfig, which causes it to be perceived as
True in check_and_update_config(). As a result, when the v0 scheduler is
enabled, the chunked prefill feature remains active, leading to the
failure of the v0 scheduler and causing it to fall back to the native v1
scheduling logic.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
Fix the bug of #703, where vllm wrong raised the ERROR : Failed to
import vllm_ascend_C:No module named 'vllm_ascend.vllm_ascend_C'. The
format for reporting import vllm_ascend_C failure is unified by warning
("Failed to import vllm_ascend_C:%s", e).
### Does this PR introduce _any_ user-facing change?
No
---------
Signed-off-by: yangpuPKU <604425840@qq.com>
### What this PR does / why we need it?
Add V1Engine LoRA support.
Add LoRA e2e test on single card and multiple cards.
### Does this PR introduce _any_ user-facing change?
support lora for V1
### How was this patch tested?
CI passed with new added test
---------
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: paulyu <paulyu0307@gmail.com>
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: jesse <szxfml@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
### What this PR does / why we need it?
Fix the bugs when run deepseek model in engine v1.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
Set div_mode to False to use the ACLNN kernel, which is crucial when
using ACL Graph.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
- According to https://github.com/vllm-project/vllm-ascend/issues/807,
we pull request for customer ascendc kernel of multi-step.
- also a bug we found in multi_step_runner.py is fixed when we use
multi-step on V0 Engine.
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file and offline inference file to test the custom
ascendc kernel. See test/ops/test_multi_step.py and
examples/offline_multi_step.py
---------
Signed-off-by: wan_danfeng <wonderful199082@126.com>
For online serving, "ascend" quantization method is not a choice
natively, so we need to add "ascend" quantization method to quantization
methods list and the user can enable quantization using "vllm serve
--quantization ascend" command.
---------
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
The [vllm
commit](67da5720d4)
changed the input and rotary position embedding for qwen 2.5 vl which
break CI. This PR fix the CI failure for qwen2.5 vl in quick
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
this PR fix CI failure broken by vllm.
1. add moe_config for fused_moe
2. adjust the change for kv cache group from vllm. currently vllm-ascend
doesn't support this feature. this is just a quick fix for backward
compatibility
fix: #872
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
1. This PR introduces native `all_to_all` communication operator to fix
`allgather` bugs when dp_size > 1. Besides, it adds a naive
implementation of force-load-balance when doing profile runs.
2. The operator `npu_dequant_swiglu_quant` only supports input
hidden_states with dtype `torch.int32`. This tensor occupies space of
`global_bs * seq_len * topk * hidden_size`, which might be very large as
`ep_size` grows. Therefore we need to disable this operator and use
original `swiglu` && `quantize`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By performing offline inference:

---------
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
make sure pytorch infer_schema check is patched before some case which
using fused moe ops:
1. model register
2. quantization loading
3. fused moe ut
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Patch torch.library.infer_schema for torch 2.5 backward compatibility
- Introduced a new module `patch_utils` under
`vllm_ascend/patch/worker/patch_common/`.
- Added a function `ascend_direct_register_custom_op` to handle custom
operator registration with backward compatibility for PyTorch < 2.7
(such as torch 2.5.1).
- Implemented type conversion logic for annotations to ensure
compatibility across different PyTorch versions.
- Registered the function `ascend_direct_register_custom_op` to
`utils.direct_register_custom_op`.
- Updated `__init__.py` to include `patch_utils` as the first import.
- Ensured `patch_utils` is available for use in other patch files and
skipped isort checks for `patch_utils` import.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
moe support for llama4 and mllama4 in vllm-ascend
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
start sever:
python -m vllm.entrypoints.openai.api_server --model
/data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct \
--max-num-seqs=256 \
--max-model-len=8192 \
--tensor-parallel-size=8 \
--block-size=128 \
--dtype bfloat16 \
--host=0.0.0.0 \
--port=8000 \
--gpu-memory-utilization=0.9 \
--trust-remote-code
client:
python online_server.py --model-path
/data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct
--image-path /data/nfs/w60040464/cherry_blossom.jpg --docker-ip
7.242.108.253 --served-port 8000 --text "what is the content of this
image?"
result:
{'id': 'chatcmpl-2b709a5d2e1a4017991ec4ba8248686a', 'object':
'chat.completion', 'created': 1747056823, 'model':
'/data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct',
'choices': [{'index': 0, 'message': {'role': 'assistant',
'reasoning_content': None, 'content': 'The image depicts a tower, likely
Tokyo Skytree, framed by branches of a cherry blossom tree. The tower is
white and has a distinctive shape, with a large sphere at the top and a
long, thin spire extending from it. The branches of the cherry blossom
tree are in the foreground, with pink flowers blooming on them. The
background is a clear blue sky.\n\n**Key Features:**\n\n* **Tower:**
White, spherical shape at the top, long thin spire\n', 'tool_calls':
[]}, 'logprobs': None, 'finish_reason': 'length', 'stop_reason': None}],
'usage': {'prompt_tokens': 2340, 'total_tokens': 2440,
'completion_tokens': 100, 'prompt_tokens_details': None},
'prompt_logprobs': None}
Signed-off-by: chenxu <chenxu68@huawei.com>
Co-authored-by: chenxu <chenxu68@huawei.com>
Co-authored-by: evian <eviantai@u.nus.edu>
### What this PR does / why we need it?
Fix the method of importing environment variables in DeepSeek model to
support successful compilation via aclgraph.
Signed-off-by: rjg-lyh <1318825571@qq.com>
1. Fix format check error to make format.sh work
2. Add codespell check CI
3. Add the missing required package for vllm-ascend.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This PR fixes two bugs in AscendScheduler:
1. When running with high concurrency, the length of running queue may
exceed the limit of max_num_seqs
2. When some requests are prempted and recomputing is activated, the
logic of computing new tokens is wrong.
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Add padding for ACL Graph and refactor graph batch size adjustments to
utils.py
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Support the features of prefix cache and chunked prefill in v0/v1.
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
In the w8a8 quantization code of `fused_experts`, the output of almost
every operator is assigned a new variable name. If we want to save NPU
memory, we manually `del` these variables to end their lifecycle, which
fills the code with `del` statements and looks inelegant.
Therefore, I plan to names the output of most operators as
`hidden_states`, thereby ending the lifecycle of the previous
`hidden_states`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Signed-off-by: ApsarasX <apsarax@outlook.com>
This change ensures proper functionality for longer sequences by
correctly invoking the _set_cos_sin_cache method with self as the first
argument.
For example, with DeepSeek R1, if this change isn't made, the program
will crash when the input sequence exceeds 4096.
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
This PR add new function of : aclgraph_batch_size can dynamic adjust to
different model; before this PR, the aclgraph_batch_sizes given from
vllm to vllm-ascend always too large, and that may result in ERROR while
running on different, with the information: "The resources are
insufficient".
Now, with this PR, the code can dynamic adjust aclgraph_batch_sizes
depend on the model hidden_layer_nums and parallel config, for example:
a. for Qwen2.5-7B, the aclgraph_batch_size length is 33 total;
b. for Qwen2.5-72B, the aclgraph_batch_size length is 11 total;
Signed-off-by: chris668899 <15105191595@126.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
Fix output tensor shape in vanilla_chunked_prefill function.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
None.
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Run offline inference on DeepSeek models.
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
The root cause of the bug is that numerical computations involving NaNs
cannot eliminate them. We addressed it by using `masked_fill_` to
eliminate NaNs while avoiding memory-wasting `torch.where` approach.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This patch was tested with vllm v0.8.5 and vllm-ascend master. I run
deepseek_v3 model with offline inference scripts
(examples/dp_offline/run_dp.sh & data_parallel.py).
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Fix function name typo, make `mask_fill_` to `masked_fill_`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: ApsarasX <apsarax@outlook.com>
### What this PR does / why we need it?
- Revert "Re-patch TritonPlaceholder on main to make CI happy (#753)"
because upstream main CI already merged:
https://github.com/vllm-project/vllm/pull/17446
- Keep 0.8.5.post1 compatible
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
---------
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Optimize NPU memory usage.
https://github.com/vllm-project/vllm-ascend/issues/723
vllm v0.8.4.rc2 and DeepSeek R1 can only support a model length of 16K.
When attempting to run with a model length of 32K, an "Out of Memory"
(OOM) error will occur.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: sunbaosong <13793883820@163.com>
### What this PR does / why we need it?
- This PR proposes a P2P version of Disaggregated Prefill based on
llm_datadist which manages data transfer.
- This solution reconstructs previous offline single-node Disaggregated
Prefill solution, and supports multi-node and online serveing now.
- Currently this solution supports 1P1D situation of Deepseek hybrid
parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered
in the solution design, and will be supported soon within v1 engine.
---------
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: ganyi <pleaplusone.gy@gmail.com>
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
1. Improve inference speed and usability for deepsek models with NPU
graph mode.
2. Modify some codes to adapt to CANN 8.1.RC1.beta1.
3. Add a switch for NPU graph mode and its cache.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
This PR provides an experimental configuration to enable NPU graph mode
for Deepseek models. User can set
additional_config={'enable_graph_mode': True} to try this feature. Note
that this feature currently only supports for V0 engine.
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
This patch was tested with the newest torch_npu 2.5.1
(https://pypi.org/project/torch-npu/#files) and CANN 8.1.RC1.beta1
toolkit&nnal&kernels
(https://www.hiascend.com/developer/download/community/result?module=cann)
released in 25/30 April.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
After discussed with MindStudio about the quantization model format, we
decide to support another quant format which may used in new modelslim
tool, in which case, `quantization_config` may be removed from the
`config.json` file and `quant_model_description.json` will be used for
quantization configuration.
### Does this PR introduce _any_ user-facing change?
Yes, using the latest quantization format
### How was this patch tested?
Test locally
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it?
Optimize qwen2_vl and qwen2_5_vl.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Testing this PR on 1080p picture with tp=1, bs=1 on Qwen2-VL and
Qwen2.5-VL, every fa op's during time lasting from 11ms to 9ms, got
roughly 22% perf boost.
---------
Signed-off-by: zouyida2052 <zouyida@huawei.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
Co-authored-by: zouyida2052 <zouyida@huawei.com>
Platform should only contain the function that based from vllm. This PR
move the unrelated function to the right place to make platform more
clear.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Deepseek v3 now adopt vanilla chunked prefill on MLA part which is
ineffcient for computing but necessary for chunked prefill. Since PR
https://github.com/vllm-project/vllm-ascend/pull/543 bring v0 scheduler
into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside
the mla backend for more performance boost. Also there are some
redundant computation inside the rope, which is also removed. This PR
should bring some performance gain for deepseek eager mode inference.
---------
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it?
Fix#674 to avoild KVCache overallocation and OOM risks.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Signed-off-by: ApsarasX <apsarax@outlook.com>
Replace torch function with reshape_and_cache fused op for better
performance. The `reshape_and_cache` function wasn't working because it
expected torch.int32 tensor, but a torch.int64 tensor was provided.
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
As custom deepseek modeling do some changes to support graph mode in
https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to
change custom deepseek_mtp modeling.
And some modifications for k>1 were not carried over by the
https://github.com/vllm-project/vllm-ascend/pull/429, now i add it.
In order to better take care of the MTP feature in the vllm-ascend
repository, I added cases related to graph mode(torchair), but i skip it
since torchair can not correctly clean up memory in vllmrunner.
Also i add some case for MTP quantization weights, but test weight is
not ready, so i skip it and i will open it when test quant weights is
ready.
https://github.com/vllm-project/vllm-ascend/pull/648 did not completely
fix the sample
change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I
added the relevant changes.
### Does this PR introduce _any_ user-facing change?
now, u can use following method to use mtp in deepseek v3/r1 float or
quant weights with eager mode.
```python
llm = LLM(
model="wemaster/deepseek_mtp_main_random_bf16",
tensor_parallel_size=2,
speculative_config={
"num_speculative_tokens": 1,
},
enforce_eager=True,
trust_remote_code=True,
disable_log_stats=False,
gpu_memory_utilization=0.8,
max_model_len=64,
)
```
or use mtp in deepseek v3/r1 float or quant weights with graph
mode(torchair)
```python
llm = LLM(
model="wemaster/deepseek_mtp_main_random_bf16",
tensor_parallel_size=2,
speculative_config={
"num_speculative_tokens": 1,
},
trust_remote_code=True,
additional_config={
'enable_graph_mode': True,
},
disable_log_stats=False,
gpu_memory_utilization=0.8,
max_model_len=64,
)
```
add notes:
1. now, we support k>1, so u can set num_speculative_tokens > 1 if there
is sufficient redundant computing power;
2. MTP is not supported in V1, we will support it when vLLM does it in
https://github.com/vllm-project/vllm/issues/13500.
3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3
patch https://github.com/vllm-project/vllm-ascend/pull/236 file
`vllm_ascend/patch/patch_metrics.py` method
`__npu_async_metrics_collector_init__`
### How was this patch tested?
local tested passed and test by CI
Signed-off-by: mengwei805 <mengwei25@huawei.com>
Sometimes, user install a dev/editable version of vllm. In this case, we
should make sure vllm-ascend works as well.
This PR add a new env `VLLM_VERSION`. It's used for developers who edit
vllm. In this case, developers should set thie env to make sure which
vllm version is installed and used.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
`reshape_and_cache_siso` seems have some funcitonality issues, use torch
op combination replace this custom op by default.
---------
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>