Commit Graph

35 Commits

Author SHA1 Message Date
xuyexiong
79821106e6 [BugFix]Fix mtp torchair bug caused by #2719 (#3566)
### What this PR does / why we need it?
Fix mtp tochair bug cuased by #2719
Since FIA need extra space for padding, we need to enforce
`self.max_num_seqs > self.scheduler_config.max_num_seqs` in KV consumer
+ MTP
This means that, `self.max_num_seqs` **>** the actual maximum requests
(`self.scheduler_config.max_num_seqs`)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-10-21 22:21:44 +08:00
wangxiyuan
13e8e75143 [Refactor] refactor patch module (#3555)
### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-21 20:19:46 +08:00
Yizhou
274b708e0c [Fix] Refactor dummy attention metadata creation (#3497)
### What this PR does / why we need it?
The `force_attention` parameter is designed for flash infer kernel
warmup, we don't actually need it on Ascend device (at least for
now).And it tends to make things more complicated. So we replace the
`force_attention` parameter with `aclgraph_runtime_mode` in the
attention metadata creation logic.

This change makes the control flow more explicit by directly using the
graph runtime mode to determine how to build attention metadata, rather
than relying on an intermediate boolean flag. This simplification
removes redundant logic and clarifies the conditions for building
attention metadata for full decode graph mode.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
DP + `FULL_DECODE_ONLY` + online serving.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-21 00:00:42 +08:00
xuyexiong
0777e2f899 Optimize torchair kv_consumer padding logic (#3526)
### What this PR does / why we need it?
Optimize torchair kv_consumer padding logic. Only pad when it is spec
decoding

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-10-18 16:42:17 +08:00
xuyexiong
21769e8f44 [BUGFIX] Mtp torchair pd fix (#3506)
### What this PR does / why we need it?

In memory of https://github.com/vllm-project/vllm-ascend/pull/2610 and
#3449 Fix Mtp torchair pd bug.

In the pd Disaggregation scenario, the first token of the inference
after the d node receives the kv follows the eager mode.

Fixes:
Running with MTP torchair graph mode with Prefilling Decoding
Disaggregation , if all requests processed by the D node are requests
just transmitted from the P node, it will break the torchair graph.

Reason: During PD Disaggregation , the P node only transmits the KV
cache and prompt to the D node, not the actual tokens inferred (neither
the main model tokens nor the MTP tokens are transmitted). Therefore,
the D node will treat this request as one without MTP tokens for
inference (seq_len=1).
The community does not have graph mode issues because the community's
attention has a seq_len=1 for each batch during the decode phase.
We have issues because the graph mode pads according to processing 2
tokens per request. When there are some seq_len=1 and some seq_len=2,
padding is done at the end. If all requests received by the D node are
seq_len=1, padding cannot be performed normally according to the
attention's fia operator constraints.

Solution:

The kv consumer uses extra torchair graph padding to avoid breaking FIA
graph constrains (The one this PR implemented).

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-10-17 21:57:05 +08:00
xuyexiong
30e3d86b0f Revert "[BUGFIX] Mtp torchair pd fix (#3449)" (#3500)
This reverts commit b0ae203e72.

### What this PR does / why we need it?
The fix is not ready yet, conflict with #3411 need to revert first. Will
fix this issue later

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-10-17 09:42:48 +08:00
xuyexiong
b0ae203e72 [BUGFIX] Mtp torchair pd fix (#3449)
### What this PR does / why we need it?
In memory of https://github.com/vllm-project/vllm-ascend/pull/2610
In the pd Disaggregation scenario, the first token of the inference
after the d node receives the kv follows the eager mode.

Fixes:
Running with MTP torchair graph mode with Prefilling Decoding
Disaggregation , if all requests processed by the D node are requests
just transmitted from the P node, it will break the torchair graph.

Reason: During PD Disaggregation , the P node only transmits the KV
cache and prompt to the D node, not the actual tokens inferred (neither
the main model tokens nor the MTP tokens are transmitted). Therefore,
the D node will treat this request as one without MTP tokens for
inference (seq_len=1).
The community does not have graph mode issues because the community's
attention has a seq_len=1 for each batch during the decode phase.
We have issues because the graph mode pads according to processing 2
tokens per request. When there are some seq_len=1 and some seq_len=2,
padding is done at the end. If all requests received by the D node are
seq_len=1, padding cannot be performed normally according to the
attention's fia operator constraints.

Solution:

The kv consumer uses extra torchair graph padding to avoid breaking FIA
graph constrains (The one this PR implemented).

The kv producer provides the correct tokens to the kv consumer, so that
our graph mode constraints are not broken, and all logic is the same as
the PD mixed deployment . Since we are using the community scheduler,
the modification requires patching the vllm scheduler, but
theoretically, performance should be better. (Maybe later )

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-10-16 09:03:49 +08:00
Mengqing Cao
8abe517870 [Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 (#3432)
### What this PR does / why we need it?
Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch.

The final goal is to remove all the patches and align the code arch to
vllm, thus we need to do the following work in next prs.
TODO:
- [x] remove patch on attention spec
- [ ] refactor the kvcache creation logic

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
1. CI passed with existing test.
2. Test pass with deepseek-v3.2-exp


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-15 17:48:58 +08:00
wangxiyuan
81bd6e4c99 Add DeepSeek V3.2 support (#3270)
### What this PR does / why we need it?

This PR added the initial DeepSeek V3.2 support with [vLLM
v0.11.0](https://github.com/vllm-project/vllm/tree/releases/v0.11.0)
(not released yet). We will complete vLLM adaptation as soon as
possible. This feature will be ready in recent 1-2 days.

Related doc: https://github.com/vllm-project/vllm-ascend/pull/3223 .

### Does this PR introduce _any_ user-facing change?
Yes!

### How was this patch tested?
CI passed and Run deepseek doc soon.


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: wxsIcey <1790571317@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-09-30 03:25:58 +08:00
linfeng-yuan
d01fd1d1c3 [misc][torchair] fix bugs around deepseek mtp, enable_shared_expert_dp and use_cached_kv_cache_bytes (#3074)
### What this PR does / why we need it?
This miscellaneous​ contains several small fixes:
1) fix initialization and forward bugs of DeepseekMTPLayer with
`shared_expert_dp` enabled.
2) fix a tensor shape mismatches after o_proj caused by a work-aroud
change in NPUModelRunner.
3) avoid unnecessary decline of kv_cache memory (default: 64MB) with
`use_cached_kv_cache_bytes` disabled.
4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding
logic of `mc2_mask` is incompatible with input hidden_states when
`shared_expert_dp` enabled.

Once this PR is merged, users can launch disaggregated_prefill
deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as
`v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline
compared to `v0.9.1-dev` will be resolved by
https://github.com/vllm-project/vllm-ascend/pull/3073.
 
### Does this PR introduce _any_ user-facing change?

No.
### How was this patch tested?
E2E vllm serving about deepseek_mtp with torchair graph mode and
`enable_shared_expert_dp` with eager mode. Large ep deployments are also
tested with this PR.


- vLLM version: v0.10.2
- vLLM main:
5aeb925452

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-23 14:52:42 +08:00
Yizhou
338231acaf [Feat][Graph] Support FULL_DECODE_ONLY mode for GQA/MHA models (#2128)
Note: This depends on [vLLM
#25161](https://github.com/vllm-project/vllm/pull/25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of #1503 and #1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-22 17:14:28 +08:00
Lucas Kabela
53ecd89e8f [Bugfix] Remove VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE (#2969)
### What this PR does / why we need it?
This PR prepares for deleting this enviroment variable,
`VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE`, as vllm requires `fullgraph=True`
to run

- Fixes https://github.com/vllm-project/vllm/issues/21834

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
See CI 

- vLLM version: v0.10.2
- vLLM main:
99cc41ad50

---------

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
2025-09-20 08:22:30 +08:00
xuyexiong
6681dde902 [Feat][Graph] Support MTP for ACL Graph (#2932)
### What this PR does / why we need it?
This PR depends on the merge of #2707 and has adapted the aclgraph
functionality to support MTP.

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
2b85697031

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-18 14:05:33 +08:00
1Fire4
1f6465c399 Add an option of enable frozen parameter (#2869)
### What this PR does / why we need it?
Add an option of enable  frozen parameter

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
68dbde5dbb

Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
2025-09-17 12:00:44 +08:00
xuyexiong
ae758dda05 [Bugfix] Fix mtp torchair in pd Disaggregation scenario (#2951)
### What this PR does / why we need it?
1. In memory of #2509, Fix mtp torchair in pd Disaggregation scenario
2. fix mla bug in SpecDecoding Scenario, since num_decodes !=
num_decode_tokens


### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
5206ab20ba

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-17 09:07:58 +08:00
linfeng-yuan
1c5900327b [refactor] refactor deepseek-related files (#2849)
### What this PR does / why we need it?
This PR deletes ~2K lines of code about deepseek modeling. It falls back
CustomDeepseekV2 modules to original vllm implementations and adapts
some modifications in vllm about deepseek and moe.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E  vllm serving with torchair graph mode and eager mode.

- vLLM version: v0.10.2
- vLLM main:
759ef49b15

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-16 14:13:07 +08:00
wangxiyuan
c556038ef0 [New model] Qwen3-next support (#2917)
### What this PR does / why we need it?
Add Qwen3-next support.

### Does this PR introduce _any_ user-facing change?
Yes, users can use Qwen3 next.
Related doc: https://github.com/vllm-project/vllm-ascend/pull/2916 the
tutorial will be ready in
[here](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html)

### How was this patch tested?
Doc CI passed

Related: https://github.com/vllm-project/vllm-ascend/issues/2884

Co-Authored-By: Angazenn <supperccell@163.com>
Co-Authored-By: zzzzwwjj <1183291235@qq.com>
Co-Authored-By: MengqingCao <cmq0113@163.com>
Co-Authored-By: linfeng-yuan <1102311262@qq.com>
Co-Authored-By: hust17yixuan <303660421@qq.com>
Co-Authored-By: SunnyLee219 <3294305115@qq.com>
Co-Authored-By: maoxx241 <maoxx241@umn.edu>


- vLLM version: v0.10.2
- vLLM main:
b834b4cbf1

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: hust17yixuan <303660421@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: Angazenn <supperccell@163.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: hust17yixuan <303660421@qq.com>
2025-09-16 01:17:42 +08:00
realliujiaxu
d3c3538ddc [Bugfix]fix bug when graph_size is not divisible by tp_size (#2719)
### What this PR does / why we need it?
fix https://github.com/vllm-project/vllm-ascend/issues/2702
- A2: skip graph_size update that makes it to tp_size because
dispatch/combine op support different batch size across EP ranks
- A3: add `max_num_reqs = max(new_graph_batch_sizes)` to fix graph_size
and max_num_reqs mismatch

### Does this PR introduce _any_ user-facing change?
Nope
### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-09-08 14:52:33 +08:00
1092626063
5b3646ab21 [FEATURE][MTP] Support MTP > 1 (#2708)
### What this PR does / why we need it?
[RFC:Support MTP > 1 for
DeepSeek](https://github.com/vllm-project/vllm-ascend/issues/2745)

- [x] dp1 tp16
- [x] dp4 tp4
- [x] dp2 tp 8
- [x] torchair graph

- vLLM version: v0.10.1.1
- vLLM main:
c9f7081f9c

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-09-05 09:11:22 +08:00
linfeng-yuan
90a75a90a9 [bugfix] fix torchair runtime error caused by configuration mismtaches and file missing (#2532)
### What this PR does / why we need it?
This PR ports #2312 #2506 #2531 to main branch.

Original implementation of torchair caching forces users to make
everything prepared, fix all the configuration and enable
`use_cached_npu_graph`, and it might cause some problems confusing to
understand and tackle for users. It is better to compile the graph twice
instead of reusing the old kvcaches and cached torchair graph. And the
extra duration time is acceptable. Additionally, this pr fixes a
recompilation problem of torchair graph mode caused by
`running_in_graph` variable in `AscendMLATorchairImpl`.

### Does this PR introduce _any_ user-facing change?
If users want to enabling torchair.cache_compile with high compilation
speed, it is recommended to enable both `use_cached_kv_cache_bytes` and
`use_cached_graph` in `torchair_graph_config`. Without
`use_cached_kv_cache_bytes`, we'll compile torchair computation graph
twice to avoid runtime error caused by configuration mismtaches (the
second compilation will be much faster). Additionally, we've made a
change to how the TORCHAIR_CACHE_HOME enviroment variable is utilized to
enhance safety and prevent accidental file deletion by adding a suffix
directory.

### How was this patch tested?
CI and e2e vllm serving pass.


- vLLM version: v0.10.1.1
- vLLM main:
70549c1245

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-03 17:56:12 +08:00
panchao-hub
ea53f9076e support torchair mode (#2641)
### What this PR does / why we need it?
support torchair mode
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
5438967fbc

Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com>
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>
2025-09-01 15:49:07 +08:00
Wang Yixuan
c2c97f3079 [5/N][refactor]add torchair rotary ops (#2559)
### What this PR does / why we need it?
Move torchair related rotary ops into torchair dir to make the code
clear. Next step we'll remove all torchair related code outside of
torchair rotary ops.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19


- vLLM version: v0.10.1.1
- vLLM main:
81eea3d348

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-09-01 09:09:21 +08:00
yiz-liu
aadc75c247 [Fix] Resolve data-parallel (DP) assertion errors in TorchAir (#2626)
### What this PR does / why we need it?
It is confirmed that `num_input_tokens` must be assigned the value of
`maybe_padded_num_tokens` under all circumstances.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Waiting for daily test for TorchAir.
- vLLM version: v0.10.1.1
- vLLM main:
006477e60b

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-29 16:06:49 +08:00
yiz-liu
dfc7eb39ad [Fix] Fix DP-related padding logic (#2582)
### What this PR does / why we need it?
The determination of attention state, padding, and other forward
metadata has been moved to an earlier stage within the input preparation
process. This change enables us to utilize a single all-reduce
operation, maximizing synchronization efficiency as early as possible.

The logic for synchronizing metadata—such as the number of tokens,
prefill status, and DBO status—across data parallel (DP) ranks has now
been unified and simplified.

For performance improvements, the all-reduce operation has been switched
from the `gloo` backend to the `npu` backend, which results in an
reduction of several milliseconds per step (**approximately 10%
performance gain for TPOT!**).

Additionally, the multi-DP server hang issue has been resolved, ensuring
no more hangs occur when `num_requests < dp_size`. Alas, a relief.

Finally, the miscalculated memory usage issue has been addressed by
removing the unnecessary `DummyCommImpl`, allowing the system to use the
real communication method when determining available memory.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Maybe we should add an test case for multi-DP online server?
@MengqingCao


- vLLM version: v0.10.1.1
- vLLM main:
c5d004aaaf

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-28 19:39:58 +08:00
Wang Yixuan
20a7bc4b71 [3/N][refactor] refactoer quantization (#2504)
### What this PR does / why we need it?
Move torchair related qunatization section into torchair dir to make the
code clear. Next step we'll remove all torchair related code outside of
torchair quantization.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19


- vLLM version: v0.10.1.1
- vLLM main:
959783fb99

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-08-27 10:45:50 +08:00
weiguihua2
acdc53c2f6 [Bugfix] Fix the bug of cos invalid shape when dp (#2558)
### What this PR does / why we need it?
 Fix the bug of cos invalid shape when dp

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
1fdc732419

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-08-27 10:36:23 +08:00
wangxiyuan
de7649492d [Refactor] cleanup converting_weight_acl_format_format (#2482)
move maybe_converting_weight_acl_format_format to torchair module, it's
only used with 310p+torchair

- vLLM version: v0.10.1.1
- vLLM main:
49ab23b3cc

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-25 19:48:55 +08:00
weiguihua2
0dca4c6dbd refact runner model v1 (#2461)
refact model runner v1

### What this PR does / why we need it?
1. Separate the execute model logic from the prepare input logic
2. Disassemble the torchchair in model runner v1

- vLLM version: v0.10.0
- vLLM main:
68fcd3fa73

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-08-21 08:54:57 +08:00
Mengqing Cao
1327f9be1c Fix some ci issue and refactor modelrunner (#2445)
### What this PR does / why we need it?
Fix some ci issue and refactor modelrunner

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.0
- vLLM main:
4d9c61993a

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: weiguihua2 <weiguihua2@huawei.com>
2025-08-20 09:01:04 +08:00
linfeng-yuan
3fc31ee1cb [1/N][refactor] torchair deepseek modeling refactor (#2384)
### What this PR does / why we need it?

Move torchair related model arch into torchair moduel to make the code
clear. Next step we'll remove all torchair related code outside of
torchair moduel.

### Does this PR introduce _any_ user-facing change?
No.

- vLLM version: v0.10.0
- vLLM main:
08d5f7113a

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-08-18 15:00:37 +08:00
wangxiyuan
1a70564e7c [5/N][Refactor] torchair model runner refactor (#2216)
There is lot of torchair code in model runner leading the code hard for
maintenance. We'll create new torchair_model_runner to split torchair
related logic. Following the workflow #2203

What's this PR do:

create common function `_capture_model` for capture_model

- vLLM version: v0.10.0
- vLLM main:
1891a265d3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-12 14:24:50 +08:00
wangxiyuan
c8b0f5f799 [4/N][Refactor] torchair model runner refactor (#2208)
There is lot of torchair code in model runner leading the code hard for
maintenance. We'll create new torchair_model_runner to split torchair
related logic. Following the workflow #2203, this is the first PR.

What's this PR do:

create common function `_convert_torch_foramt`  for initialize_kv_cache


- vLLM version: v0.10.0
- vLLM main:
14a5d903ab

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-11 21:39:24 +08:00
wangxiyuan
881e36d6a9 [3/N][Refactor] torchair model runner refactor (#2207)
There is lot of torchair code in model runner leading the code hard for
maintenance. We'll create new torchair_model_runner to split torchair
related logic. Following the workflow #2203, this is the first PR.

What's this PR do:

create common function `_build_attention_metadata` and
`_generate_dummy_run_hidden_states` for dummy_run

- vLLM version: v0.10.0
- vLLM main:
ebf7605b0d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-11 18:03:19 +08:00
wangxiyuan
1ab15414bb [2/N][Refactor] torchair model runner refactor (#2204)
There is lot of torchair code in model runner leading the code hard for
maintenance. We'll create new torchair_model_runner to split torchair
related logic. Following the workflow #2203

What's this PR do:

move `torchair` related logic into `_get_forward_metadata_across_dp` and
override it in torchair model runner


- vLLM version: v0.10.0
- vLLM main:
1b99028069

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-11 14:06:49 +08:00
wangxiyuan
292fb8f696 [1/N][Refactor] torchair model runner refactor (#2205)
There is lot of torchair code in model runner leading the code hard for
maintenance. We'll create new torchair_model_runner to split torchair
related logic. Following the workflow #2203, this is the first PR.

What this PR does:

create the new torchair model runner, more function will be added later


- vLLM version: v0.10.0
- vLLM main:
586f286789

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-05 18:43:04 +08:00