Commit Graph

8 Commits

Author SHA1 Message Date
zhangxinyuehfad
f7b904641e [Main2Main] Upgrade vllm commit to 0109 (#5752)
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
https://github.com/vllm-project/vllm/pull/31786
2. fix spec_decode e2e test due to
https://github.com/vllm-project/vllm/pull/29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
https://github.com/vllm-project/vllm/pull/31891
4. fix `self.seq_lens - query_lens` on same device due to
https://github.com/vllm-project/vllm/pull/31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-13 19:14:43 +08:00
zzhxxx
64d29875f9 [Refactor] Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for sharded CP (#5698)
### What this PR does / why we need it?
Based on the Sharded-CP feature
PR:https://github.com/vllm-project/vllm-ascend/pull/4702;
RFC:https://github.com/vllm-project/vllm/issues/30055

This PR officially integrates Deepseek V3.2's DSA-CP support on the
basis of https://github.com/vllm-project/vllm-ascend/pull/4702,
improving inference efficiency and scalability under mixed
prefill-decode workloads. The main improvements include:
- Replace the implementations of o_proj, q_b_proj, and kv_b_proj with
custom_op for TP=1.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: Kurumi5210 <jaychou1620@gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
2026-01-09 15:58:40 +08:00
weiguihua2
15d73f248e [refactor] refactor model runner capture model (#5230)
### What this PR does / why we need it?
Refactor the `capture_model` method in model_runner to directly reuse
the method from vLLM.

Currently, most of the logic in the capture_model method is similar to
that in the vllm code. Directly using the vllm method can reduce the
maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to
initialize_attn_backend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-30 08:32:14 +08:00
weijinqian0
dbe4c338f2 [Refactor] cache cos/sin in mla & remove parameter model in builder. (#5277)
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629

1. Cache cos/sin in mla
2. AttentionBuilder inherits from the original class of vllm.



version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-12-28 10:35:07 +08:00
meihanc
592cfb6a6f [CI] Add Triton Ascend in CI (#4921)
Add triton-ascend in UT and e2e

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2025-12-23 12:47:35 +08:00
Yizhou
5b179c53f1 [FEAT] Support DeepSeek-V3.2 with FULL_DECODE_ONLY mode (#4706)
### What this PR does / why we need it?
The first commit support `FULL_DECODE_ONLY`:
- Update `AscendSFAMetadataBuilder` to use `num_input_tokens` for
slicing slots and positions, ensuring fixed tensor shapes.
- Implement padding logic for `query_start_loc` in `NPUModelRunner` to
support uniform decode in full graph mode, aligning with GPU runner
behavior.
- Adjust MLA cosine cache allocation to occur independently of graph
mode and switch to using device-resident sequence lengths for attention
metadata.
- Remove redundant slicing of hidden states and outputs in
`AscendSFAImpl` and optimize `sin`/`cos` cache updates.

The second commit take MTP into account:
- Update `AscendSFAMetadataBuilder` to use `num_input_tokens` for
slicing slots and positions, ensuring fixed tensor shapes.
- Implement padding logic for `query_start_loc` in `NPUModelRunner` to
support uniform decode in full graph mode, aligning with GPU runner
behavior.
- Adjust MLA cosine cache allocation to occur independently of graph
mode and switch to using device-resident sequence lengths for attention
metadata.
- Remove redundant slicing of hidden states and outputs in
`AscendSFAImpl` and optimize `sin`/`cos` cache updates.

And the rest of them are just bugfix.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases needed.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-12-10 20:11:09 +08:00
wangxiyuan
2938bd5ad2 remove get_metadata_cls (#4087)
remove get_metadata_cls. It's only used for V0 engine and has been removed from vLLM already.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-19 14:58:17 +08:00
1Fire4
0b9b6d79fe [Feat][UT] Support Deepseekv32 FULL_DECODE_ONLY mode and add unit test of sfa_v1 (#3763)
### What this PR does / why we need it?
- Add support for DeepSeek v3.2 in FULL_DECODE_ONLY mode.
- Add unit test for sfa_v1.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
2025-11-03 10:02:47 +08:00