Commit Graph

156 Commits

Author SHA1 Message Date
Zhu Yi Lin
538dd357e6 Add graph mode and improve on multi_npu_moge.md (#1849)
### What this PR does / why we need it?
Add graph mode and improve on multi_npu_moge.md

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
CI passed with new existing test.


- vLLM version: v0.9.2
- vLLM main:
5a7fb3ab9e

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-07-17 17:53:37 +08:00
wangxiyuan
eb921d2b6f [Doc] Fix 404 error (#1797)
Fix url 404 error in doc
- vLLM version: v0.9.2
- vLLM main:
9ad0a4588b

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 11:52:38 +08:00
Li Wang
afcfe91dfa [Doc] Fix multi node doc (#1783)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Pin docker image to latest release
### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
1e9438e0b0

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-14 17:56:57 +08:00
wangxiyuan
3c404de1b1 [Release]Update release note (#1753)
There is still issue with pp in some case. such as aclgraph, ray. Remove
the related doc in release note

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-11 17:58:26 +08:00
wangxiyuan
b5b7e0ecc7 [Doc] Add qwen3 embedding 8b guide (#1734)
1. Add the tutorials for qwen3-embedding-8b
2. Remove VLLM_USE_V1=1  in docs, it's useless any more from 0.9.2


- vLLM version: v0.9.2
- vLLM main:
5923ab9524

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-11 17:40:17 +08:00
wangxiyuan
9c560b009a [Release] Add 0.9.2rc1 release note (#1725)
Add release note for 0.9.2rc1, we'll release soon









- vLLM version: v0.9.2
- vLLM main:
7bd4c37ae7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-11 17:36:05 +08:00
wangxiyuan
3d1e6a5929 [Doc] Update user doc index (#1581)
Add user doc index to make the user guide more clear
- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-10 14:26:59 +08:00
Li Wang
c7446438a9 [1/N][CI] Move linting system to pre-commits hooks (#1256)
### What this PR does / why we need it?

Follow vllm-project/vllm lint way:
https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml

Enable pre-commit to avoid some low level error  AMAP.

This pr is one step of #1241, The purpose is make linting system more
clear and convenient, on this step, Mainly did the following things:
yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit,
enforce-import-regex-instead-of-re.

TODO: 
- clang-format(check for csrc with google style)
need clean code, disable for now 
- pymarkdown
need clean code, disable for now 
- shellcheck
need clean code, disable for now 

### Does this PR introduce _any_ user-facing change?

Only developer UX change:

https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally

```
pip install -r requirements-lint.txt && pre-commit install
bash format.sh
```

### How was this patch tested?

CI passed with new added/existing test.

Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-10 14:17:15 +08:00
Yikun Jiang
997f156a51 Use ci_vllm_version when recording vLLM commit (#1689)
### What this PR does / why we need it?
Use ci_vllm_version when recording vllm commit

Followup on https://github.com/vllm-project/vllm-ascend/pull/1623

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test mannually.
$ python3 docs/source/conf.py | jq .ci_vllm_version | tr -d '"'
v0.9.2
- Test on my local repo: https://github.com/Yikun/vllm-ascend/pull/35

- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-10 11:07:27 +08:00
Li Wang
0c4aa2b4f1 [Doc] Add multi node data parallel doc (#1685)
### What this PR does / why we need it?
 add multi node data parallel doc
### Does this PR introduce _any_ user-facing change?
 add multi node data parallel doc
### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
805d62ca88

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-10 09:36:37 +08:00
leo-pony
b4b19ea588 [Doc] Add multi-npu qwen3-MoE-32B Tutorials (#1419)
Signed-off-by: leo-pony <nengjunma@outlook.com>

### What this PR does / why we need it?
Add multi-npu qwen3-MoE-32B Tutorials
Relate RFC: https://github.com/vllm-project/vllm-ascend/issues/1248
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-10 09:06:51 +08:00
wangxiyuan
830332ebfc Clean up v0.9.1 code (#1672)
vllm has released 0.9.2. This PR drop 0.9.1 support.

- vLLM version: v0.9.1
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:52:24 +08:00
Yikun Jiang
e4e9ea02ab Upgrade vLLM version to v0.9.2 (#1652)
### What this PR does / why we need it?

This patch upgrade vLLM version to v0.9.2, this patch didn't remove the
v0.9.1 compatible code to easy review.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
14601f5fba
- Accuracy test with 0.9.2:
https://github.com/vllm-project/vllm-ascend/actions/runs/16121612087

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-08 14:18:17 +08:00
Yikun Jiang
0c1d239df4 Add unit test local cpu guide and enable base testcase (#1566)
### What this PR does / why we need it?
Use Base test and cleanup all manaul patch code
- Cleanup EPLB config to avoid tmp test file
- Use BaseTest with global cache
- Add license
- Add a doc to setup unit test in local env 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-06 10:42:27 +08:00
Angazenn
a5f33590d3 [CORE]initial support for torchair with non-mla backend (#1506)
### What this PR does / why we need it?
This PR supports torchair graph mode with non-mla backend on both 800IA2
and 300I Duo platforms. The main change is to add
`attention_v1_torchair.py` to support specific attention related
operations that are required by torchair.

### Does this PR introduce _any_ user-facing change?
Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we
can also use it with pangu. Besides, we add a support model list to
control which type of models that can use torchair.

### How was this patch tested?
We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms,
and model generates answer normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:21:42 +08:00
yupeng
d96da1f00c [DOC] Fix word spelling (#1595)
### What this PR does / why we need it?
Fix word spelling in DOC.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.

Signed-off-by: paulyu12 <507435917@qq.com>
2025-07-02 21:42:39 +08:00
yupeng
c3c8c9317c [DOC] add LoRA user guide (#1265)
### What this PR does / why we need it?
Add LoRA user guide to DOC. The content refers to [LoRA
Adapters](https://docs.vllm.ai/en/latest/features/lora.html).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

---------

Signed-off-by: paulyu12 <507435917@qq.com>
2025-07-02 14:41:31 +08:00
leo-pony
53ec583bbb [Docs] Update Altlas 300I series doc and fix CI lint (#1537)
### What this PR does / why we need it?
- Update Altlas 300I series doc: cleanup unused parameters and enable
optimized ops
- Fix code spell CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 23:34:00 +08:00
Shanshan Shen
ba577dfc52 [Doc] Add Structured Output guide (#1499)
### What this PR does / why we need it?
Add Structured Output guide.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-30 17:21:44 +08:00
Yikun Jiang
e4df0a4395 Add Pangu MoE Pro for 300I series docs (#1516)
### What this PR does / why we need it?
Add Pangu MoE Pro for 300I series docs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 13:37:22 +08:00
Yikun Jiang
cad4c693c6 Add Pangu MoE Pro docs (#1512)
### What this PR does / why we need it?
This PR add Pangu MoE Pro 72B docs

[1] https://gitcode.com/ascend-tribe/pangu-pro-moe-model

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 12:15:33 +08:00
Zhu Yi Lin
b308a7a258 support pangumoe w8a8c8 and docs (#1477)
### What this PR does / why we need it?
support pangu moe w8a8c8

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added test.

Signed-off-by: zhuyilin <809721801@qq.com>
2025-06-28 18:51:07 +08:00
Shanshan Shen
99e685532d [Doc] Add Qwen2.5-VL eager mode doc (#1394)
### What this PR does / why we need it?
Add Qwen2.5-VL eager mode doc.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-28 09:08:51 +08:00
Shanshan Shen
3687676fa7 [Doc] Add guidance on how to implement and register new models (#1426)
### What this PR does / why we need it?
Add guidance on how to implement and register new models.

Modified based on PR
https://github.com/vllm-project/vllm-ascend/pull/1126, thanks for the
contribution of @linfeng-yuan.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-27 16:46:49 +08:00
Zesheng Zong
192dbbcc6e Optimize Patch developer guide (#1452)
### What this PR does / why we need it?
Fix some terms in the user guide.


Signed-off-by: zeshengzong <zesheng.zong@outlook.com>
2025-06-26 19:10:16 +08:00
Shanshan Shen
4e2daf5ab7 [Doc] Add qwen2-audio eager mode tutorial (#1371)
### What this PR does / why we need it?
Add qwen2-audio eager mode tutorial.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-26 16:56:05 +08:00
leo-pony
1025344912 Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode (#1374)
### What this PR does / why we need it?
Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode.
Relate RFC: https://github.com/vllm-project/vllm-ascend/issues/1248

### Does this PR introduce _any_ user-facing change?
No changes.


### How was this patch tested?
Preview

 Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-06-26 16:52:54 +08:00
wangxiyuan
205cb85a1e [Doc] Fix doc typo (#1424)
1. Fix the typo
2. Fix 404 url
3. update graph mode and additional config user guide

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 19:28:26 +08:00
Li Wang
15df8be937 [Doc] Add sleep mode doc (#1295)
### What this PR does / why we need it?
Add sleep related doc and example

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-25 14:07:14 +08:00
wangxiyuan
e4e0b7af05 [Doc] Add patch doc (#1414)
1. Format the developer guide  content to make it more clear
2. Add the patch doc for developer guide

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 12:00:45 +08:00
Mengqing Cao
c1c5d56255 [Doc] Update FAQ and add test guidance (#1360)
### What this PR does / why we need it?
- Add test guidance
- Add reduce layer guidance
- update faq on determinitic calculation

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-25 09:59:23 +08:00
Yikun Jiang
917c6b71af [TEST][DOC] Fix doctest and add system package installation (#1375)
### What this PR does / why we need it?
- Fix
[doctest](https://github.com/vllm-project/vllm-ascend/actions/workflows/vllm_ascend_doctest.yaml?query=event%3Aschedule)
- add system package installation
- Add doc for run doctests
- Cleanup all extra steps in .github/workflows/vllm_ascend_doctest.yaml
- Change schedule job from 4 ---> 12 hours

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- doctest CI passed
- Local test with
`/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`.

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-23 20:50:33 +08:00
Icey
08cfc7cb4b Modify installation.md for adding pip extra index of torch-npu (#1272)
### What this PR does / why we need it?
Modify installation.md for adding pip extra index of torch-npu

### How was this patch tested?
No need

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-06-23 15:37:50 +08:00
weiguihua2
e1123172d1 [Doc] Add reinstall instructions doc (#1303)
Add a new FAQ, if users re-install vllm-ascend with pip, the `build`
folder should be removed first

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
Signed-off-by: weiguihua <weiguihua2@huawei.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-06-23 14:06:27 +08:00
Pleaplusone
7e6efbf2a9 update torch-npu to 2.5.1.post1.dev20250619 (#1347)
### What this PR does / why we need it?
This PR update the torch_npu to newest release version
2.5.1.post1.dev20250619 .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CI tested will guarantee the update

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-06-23 09:02:09 +08:00
xleoken
4447e53d7a [Doc] Change not to no in faqs.md (#1357)
### What this PR does / why we need it?

Change not to no in faqs.md.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Local Test

Signed-off-by: xleoken <xleoken@163.com>
2025-06-23 09:01:00 +08:00
Yikun Jiang
2e5f312530 Cleanup ununsed doc (#1352)
### What this PR does / why we need it?
Cleanup ununsed doc for MoGE model, we will add back this when MoGE
model ready.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-22 15:05:30 +08:00
Yikun Jiang
c30ddb8331 Bump v0.9.1rc1 release (#1349)
### What this PR does / why we need it?
Bump v0.9.1rc1 release

Closes: https://github.com/vllm-project/vllm-ascend/pull/1341
Closes: https://github.com/vllm-project/vllm-ascend/pull/1334

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed


---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-22 13:15:36 +08:00
wangxiyuan
45be1aac0c [CI] Add codespell check for doc (#1314)
Add codespell check test for doc only PR

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-20 16:48:14 +08:00
22dimensions
761bd3d9d7 Add user guide for quantization (#1206)
### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-20 15:53:25 +08:00
Yikun Jiang
05dec7eda9 [Doc] Refactor and init user story page (#1224)
### What this PR does / why we need it?
This PR refactor the user stories page:
- Move it to community
- Add initial info of LLaMA-Factory, Huggingface/trl, MindIE Turbo,
GPUStack, verl
- Add a new page for LLaMA-Factory

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview locally

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-17 09:36:35 +08:00
Yikun Jiang
9d3cbc0953 [Doctest] add installation doctest (#1179)
### What this PR does / why we need it?
Install doctest

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Related: https://github.com/vllm-project/vllm-ascend/pull/983

Co-authored-by: wangli <wangli858794774@gmail.com>

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
2025-06-17 08:52:26 +08:00
Mengqing Cao
96fa7ff63b [DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235)
### What this PR does / why we need it?
1. Fix rank set in DP scenario. The new poc version of torch-npu support
setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the
rank set in `DPEngineCoreProc` directly instead of calculating local
rank across dp by hand in the patched `_init_data_parallel`

Closes: https://github.com/vllm-project/vllm-ascend/issues/1170

2. Bump torch-npu version to 2.5.1.post1.dev20250528

Closes: https://github.com/vllm-project/vllm-ascend/pull/1242
Closes: https://github.com/vllm-project/vllm-ascend/issues/1232


### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-06-16 23:09:53 +08:00
22dimensions
0d2074a1ec [Doc] fix VLLM_USE_V1 value in graph mode docs (#1226)
os.environ["VLLM_USE_V1"] must be assigned with str, not other type.


![image](https://github.com/user-attachments/assets/9d337ae5-00e5-4179-832e-c6c917dd5798)

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-15 15:41:11 +08:00
fems14
ab5d110fcc vllm-ascend support chunked prefill (#1172)
### What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA


---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-06-14 22:31:16 +08:00
Mengqing Cao
a3b5af8307 [CI/UT][Graph] Add ut for torchair graph mode (#1103)
### What this PR does / why we need it?
Add ut for torchair graph mode on DeepSeekV3

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-06-14 16:59:00 +08:00
Yikun Jiang
94a52cf577 Add ShouJian Zheng (@jianzs) as vLLM Ascend maintainer (#1203)
### What this PR does / why we need it?

Add @jianzs as vLLM Ascend maintainer

@jianzs
----
I would like to nominate Shoujian Zheng (@jianzs
<https://github.com/jianzs>) as a maintainer, starting with my +1.

- He focuses on the code quality and good design with solid reviews in P/D
disaggregation and DeepSeek improvement area about 30+ high quality review, such
as #issuecomment-2811764833, #discussion_r2069927605 and
#pullrequestreview-2820996674. This is the most important reason why I nominated
him, because helping community developers complete PRs with high quality and
continuously ensure the quality of codebase is one of the important
responsibilities of a maintainer. We believe he is a great addition.
- Shoujian's main expertise is distributed inference. He has a lot of experience
in production about AI infra. He has very good habits and explains in great
detail all changes #issue-3023082580 anqd share results open:
#issuecomment-2853140443. And High quality PR: #706, #774, #852.
- Community Involvement: Active involved in community discussion, he is
collaborative and helps the users solve problems, involved in 30+ PR and issue,
such as #issuecomment-2911934292 and #issuecomment-2833523571.

Reference:
[1] https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html
[2] https://vllm-ascend.readthedocs.io/en/latest/community/governance.html

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-13 18:25:50 +08:00
sdmyzlp
e72f94e38f Support multistream of MLA vector operations (#1135)
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-12 21:42:09 +08:00
Wan_Danfeng
55c0e68883 [Doc] Add Referer header for CANN package download url. (#1192)
### What this PR does / why we need it?
fix the CANN download url

### Does this PR introduce _any_ user-facing change?
no, do not have any user-facing change

### How was this patch tested?
run the **wget** command and cann package is rightly downloaded.

---------

Signed-off-by: wan_danfeng <wonderful199082@126.com>
2025-06-12 21:22:23 +08:00
chenwaner
e46dc142bf Enable kvcache_nz for the decode process in torchair graph mode (#1098)
What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test: 

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2948542159

https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2954496588

---------

Signed-off-by: chenwaner <861645847@qq.com>
2025-06-11 14:09:28 +08:00