### What this PR does / why we need it?
Fix mtp tochair bug cuased by #2719
Since FIA need extra space for padding, we need to enforce
`self.max_num_seqs > self.scheduler_config.max_num_seqs` in KV consumer
+ MTP
This means that, `self.max_num_seqs` **>** the actual maximum requests
(`self.scheduler_config.max_num_seqs`)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
The `force_attention` parameter is designed for flash infer kernel
warmup, we don't actually need it on Ascend device (at least for
now).And it tends to make things more complicated. So we replace the
`force_attention` parameter with `aclgraph_runtime_mode` in the
attention metadata creation logic.
This change makes the control flow more explicit by directly using the
graph runtime mode to determine how to build attention metadata, rather
than relying on an intermediate boolean flag. This simplification
removes redundant logic and clarifies the conditions for building
attention metadata for full decode graph mode.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
DP + `FULL_DECODE_ONLY` + online serving.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Optimize torchair kv_consumer padding logic. Only pad when it is spec
decoding
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
In memory of https://github.com/vllm-project/vllm-ascend/pull/2610 and
#3449 Fix Mtp torchair pd bug.
In the pd Disaggregation scenario, the first token of the inference
after the d node receives the kv follows the eager mode.
Fixes:
Running with MTP torchair graph mode with Prefilling Decoding
Disaggregation , if all requests processed by the D node are requests
just transmitted from the P node, it will break the torchair graph.
Reason: During PD Disaggregation , the P node only transmits the KV
cache and prompt to the D node, not the actual tokens inferred (neither
the main model tokens nor the MTP tokens are transmitted). Therefore,
the D node will treat this request as one without MTP tokens for
inference (seq_len=1).
The community does not have graph mode issues because the community's
attention has a seq_len=1 for each batch during the decode phase.
We have issues because the graph mode pads according to processing 2
tokens per request. When there are some seq_len=1 and some seq_len=2,
padding is done at the end. If all requests received by the D node are
seq_len=1, padding cannot be performed normally according to the
attention's fia operator constraints.
Solution:
The kv consumer uses extra torchair graph padding to avoid breaking FIA
graph constrains (The one this PR implemented).
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
This reverts commit b0ae203e72.
### What this PR does / why we need it?
The fix is not ready yet, conflict with #3411 need to revert first. Will
fix this issue later
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
In memory of https://github.com/vllm-project/vllm-ascend/pull/2610
In the pd Disaggregation scenario, the first token of the inference
after the d node receives the kv follows the eager mode.
Fixes:
Running with MTP torchair graph mode with Prefilling Decoding
Disaggregation , if all requests processed by the D node are requests
just transmitted from the P node, it will break the torchair graph.
Reason: During PD Disaggregation , the P node only transmits the KV
cache and prompt to the D node, not the actual tokens inferred (neither
the main model tokens nor the MTP tokens are transmitted). Therefore,
the D node will treat this request as one without MTP tokens for
inference (seq_len=1).
The community does not have graph mode issues because the community's
attention has a seq_len=1 for each batch during the decode phase.
We have issues because the graph mode pads according to processing 2
tokens per request. When there are some seq_len=1 and some seq_len=2,
padding is done at the end. If all requests received by the D node are
seq_len=1, padding cannot be performed normally according to the
attention's fia operator constraints.
Solution:
The kv consumer uses extra torchair graph padding to avoid breaking FIA
graph constrains (The one this PR implemented).
The kv producer provides the correct tokens to the kv consumer, so that
our graph mode constraints are not broken, and all logic is the same as
the PD mixed deployment . Since we are using the community scheduler,
the modification requires patching the vllm scheduler, but
theoretically, performance should be better. (Maybe later )
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch.
The final goal is to remove all the patches and align the code arch to
vllm, thus we need to do the following work in next prs.
TODO:
- [x] remove patch on attention spec
- [ ] refactor the kvcache creation logic
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
1. CI passed with existing test.
2. Test pass with deepseek-v3.2-exp
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
This miscellaneous contains several small fixes:
1) fix initialization and forward bugs of DeepseekMTPLayer with
`shared_expert_dp` enabled.
2) fix a tensor shape mismatches after o_proj caused by a work-aroud
change in NPUModelRunner.
3) avoid unnecessary decline of kv_cache memory (default: 64MB) with
`use_cached_kv_cache_bytes` disabled.
4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding
logic of `mc2_mask` is incompatible with input hidden_states when
`shared_expert_dp` enabled.
Once this PR is merged, users can launch disaggregated_prefill
deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as
`v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline
compared to `v0.9.1-dev` will be resolved by
https://github.com/vllm-project/vllm-ascend/pull/3073.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E vllm serving about deepseek_mtp with torchair graph mode and
`enable_shared_expert_dp` with eager mode. Large ep deployments are also
tested with this PR.
- vLLM version: v0.10.2
- vLLM main:
5aeb925452
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Note: This depends on [vLLM
#25161](https://github.com/vllm-project/vllm/pull/25161) and the
torch\_npu release from September 30.
### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:
* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.
**Known issues:**
1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.
There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.
This is essentially a port of #1503 and #1677, but includes two major
changes:
1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.
### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
"cudagraph_mode": "FULL_DECODE_ONLY",
},
```
### How was this patch tested?
Tests included.
- vLLM version: v0.10.2
- vLLM main:
9607d5eb44
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
This PR prepares for deleting this enviroment variable,
`VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE`, as vllm requires `fullgraph=True`
to run
- Fixes https://github.com/vllm-project/vllm/issues/21834
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
See CI
- vLLM version: v0.10.2
- vLLM main:
99cc41ad50
---------
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
### What this PR does / why we need it?
This PR depends on the merge of #2707 and has adapted the aclgraph
functionality to support MTP.
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
2b85697031
---------
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
Add an option of enable frozen parameter
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
68dbde5dbb
Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
### What this PR does / why we need it?
1. In memory of #2509, Fix mtp torchair in pd Disaggregation scenario
2. fix mla bug in SpecDecoding Scenario, since num_decodes !=
num_decode_tokens
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
5206ab20ba
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
This PR deletes ~2K lines of code about deepseek modeling. It falls back
CustomDeepseekV2 modules to original vllm implementations and adapts
some modifications in vllm about deepseek and moe.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E vllm serving with torchair graph mode and eager mode.
- vLLM version: v0.10.2
- vLLM main:
759ef49b15
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
fix https://github.com/vllm-project/vllm-ascend/issues/2702
- A2: skip graph_size update that makes it to tp_size because
dispatch/combine op support different batch size across EP ranks
- A3: add `max_num_reqs = max(new_graph_batch_sizes)` to fix graph_size
and max_num_reqs mismatch
### Does this PR introduce _any_ user-facing change?
Nope
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
This PR ports #2312#2506#2531 to main branch.
Original implementation of torchair caching forces users to make
everything prepared, fix all the configuration and enable
`use_cached_npu_graph`, and it might cause some problems confusing to
understand and tackle for users. It is better to compile the graph twice
instead of reusing the old kvcaches and cached torchair graph. And the
extra duration time is acceptable. Additionally, this pr fixes a
recompilation problem of torchair graph mode caused by
`running_in_graph` variable in `AscendMLATorchairImpl`.
### Does this PR introduce _any_ user-facing change?
If users want to enabling torchair.cache_compile with high compilation
speed, it is recommended to enable both `use_cached_kv_cache_bytes` and
`use_cached_graph` in `torchair_graph_config`. Without
`use_cached_kv_cache_bytes`, we'll compile torchair computation graph
twice to avoid runtime error caused by configuration mismtaches (the
second compilation will be much faster). Additionally, we've made a
change to how the TORCHAIR_CACHE_HOME enviroment variable is utilized to
enhance safety and prevent accidental file deletion by adding a suffix
directory.
### How was this patch tested?
CI and e2e vllm serving pass.
- vLLM version: v0.10.1.1
- vLLM main:
70549c1245
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
support torchair mode
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
5438967fbc
Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com>
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>
### What this PR does / why we need it?
Move torchair related rotary ops into torchair dir to make the code
clear. Next step we'll remove all torchair related code outside of
torchair rotary ops.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19
- vLLM version: v0.10.1.1
- vLLM main:
81eea3d348
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
It is confirmed that `num_input_tokens` must be assigned the value of
`maybe_padded_num_tokens` under all circumstances.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Waiting for daily test for TorchAir.
- vLLM version: v0.10.1.1
- vLLM main:
006477e60b
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
The determination of attention state, padding, and other forward
metadata has been moved to an earlier stage within the input preparation
process. This change enables us to utilize a single all-reduce
operation, maximizing synchronization efficiency as early as possible.
The logic for synchronizing metadata—such as the number of tokens,
prefill status, and DBO status—across data parallel (DP) ranks has now
been unified and simplified.
For performance improvements, the all-reduce operation has been switched
from the `gloo` backend to the `npu` backend, which results in an
reduction of several milliseconds per step (**approximately 10%
performance gain for TPOT!**).
Additionally, the multi-DP server hang issue has been resolved, ensuring
no more hangs occur when `num_requests < dp_size`. Alas, a relief.
Finally, the miscalculated memory usage issue has been addressed by
removing the unnecessary `DummyCommImpl`, allowing the system to use the
real communication method when determining available memory.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Maybe we should add an test case for multi-DP online server?
@MengqingCao
- vLLM version: v0.10.1.1
- vLLM main:
c5d004aaaf
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Move torchair related qunatization section into torchair dir to make the
code clear. Next step we'll remove all torchair related code outside of
torchair quantization.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19
- vLLM version: v0.10.1.1
- vLLM main:
959783fb99
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
Fix the bug of cos invalid shape when dp
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
1fdc732419
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
refact model runner v1
### What this PR does / why we need it?
1. Separate the execute model logic from the prepare input logic
2. Disassemble the torchchair in model runner v1
- vLLM version: v0.10.0
- vLLM main:
68fcd3fa73
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
Fix some ci issue and refactor modelrunner
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
4d9c61993a
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
Move torchair related model arch into torchair moduel to make the code
clear. Next step we'll remove all torchair related code outside of
torchair moduel.
### Does this PR introduce _any_ user-facing change?
No.
- vLLM version: v0.10.0
- vLLM main:
08d5f7113a
Signed-off-by: linfeng-yuan <1102311262@qq.com>
There is lot of torchair code in model runner leading the code hard for
maintenance. We'll create new torchair_model_runner to split torchair
related logic. Following the workflow #2203
What's this PR do:
create common function `_capture_model` for capture_model
- vLLM version: v0.10.0
- vLLM main:
1891a265d3
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
There is lot of torchair code in model runner leading the code hard for
maintenance. We'll create new torchair_model_runner to split torchair
related logic. Following the workflow #2203, this is the first PR.
What's this PR do:
create common function `_convert_torch_foramt` for initialize_kv_cache
- vLLM version: v0.10.0
- vLLM main:
14a5d903ab
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
There is lot of torchair code in model runner leading the code hard for
maintenance. We'll create new torchair_model_runner to split torchair
related logic. Following the workflow #2203, this is the first PR.
What's this PR do:
create common function `_build_attention_metadata` and
`_generate_dummy_run_hidden_states` for dummy_run
- vLLM version: v0.10.0
- vLLM main:
ebf7605b0d
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
There is lot of torchair code in model runner leading the code hard for
maintenance. We'll create new torchair_model_runner to split torchair
related logic. Following the workflow #2203
What's this PR do:
move `torchair` related logic into `_get_forward_metadata_across_dp` and
override it in torchair model runner
- vLLM version: v0.10.0
- vLLM main:
1b99028069
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
There is lot of torchair code in model runner leading the code hard for
maintenance. We'll create new torchair_model_runner to split torchair
related logic. Following the workflow #2203, this is the first PR.
What this PR does:
create the new torchair model runner, more function will be added later
- vLLM version: v0.10.0
- vLLM main:
586f286789
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>