### What this PR does / why we need it?
This PR aims to fix setting of `speculative_config.enforce_eager` in
deepseek v3.2 mtp. The point is that, vllm sets
`speculative_config.enforce_eager` as True if using deepseek_v32 with
mtp. Since we support graph mode, we simply ignore it here. However,
this fix will also implicitly ignore user setting of
`speculative_config.enforce_eager`, we need to take care and remove it
once vllm supports this feature.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
Cancel the embeddings sharing when the embeddings of main model and the
embeddings of eagle model are different.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
Cause i don't have `Meta-Llama-3.1-8B-Instruc`t locally, i commented it
and run:
```shell
pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance
```
The output is fine:
```text
.
======================================================================================================================== warnings summary =========================================================================================================================
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================================================== 3 passed, 1 skipped, 2 warnings in 196.19s (0:03:16) =======================================================================================================
```
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
4/N EAGLE refactor plan devided into many parts, this PR is the first
change, which modifies the attn_metadata update method by modifying
common_metadata and then rebuilding the code.
### Does this PR introduce _any_ user-facing change?
ut
### How was this patch tested?
no
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
Signed-off-by: Zetong Li <slippersss@126.com>
Co-authored-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
1. Fix DeepSeek-V3.2-W8A8-Pruning mtp
2. Add DeepSeek-V3.2-W8A8-Pruning e2e test
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR aims to extract common methods from eagle_proposer and
mtp_proposer. This is a small step towards merging eagle and mtp.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
Fixed an accuracy problem when using eagle3 with sp.
The problem is described in
https://github.com/vllm-project/vllm-ascend/issues/5825.
It also adds a much more precise way to determine whether drafter should
use `sp` or not.
Also, it changes the `eager` of drafter to be a real `eager` in frontend
to avoid a `fx-graph` problem.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
For simpilicity, we test it as in
https://github.com/vllm-project/vllm-ascend/issues/5825.
And we get the same result of `eagle3` with `sp` disabled.
```text
--------------------------------------------------
total_num_output_tokens: 1000
num_drafts: 437
num_draft_tokens: 1311
num_accepted_tokens: 564
mean acceptance length: 2.29
--------------------------------------------------
acceptance at token 0: 0.62
acceptance at token 1: 0.40
acceptance at token 2: 0.27
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00
```
* vLLM version: v0.13.0
* vLLM main:
2f4e6548ef
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
According to the official documentation, the parameter
"draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3
model. However, based on actual debugging, it was found that the number
of tensor parallelisms (tp) of the Eagle model is consistent with that
of the target model. The setting of tp for the draft model did not take
effect as expected.
**Note:** This feature has not been superimposed and tested with `sp`
and `dp`. It will be adapted later
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```python
from vllm import LLM, SamplingParams
def main():
prompts = [
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=4,
gpu_memory_utilization=0.9,
enforce_eager=True,
speculative_config={
"method": "eagle3",
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
"draft_tensor_parallel_size": 1,
"num_speculative_tokens": 3,
},
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
print(f"Outputs: {outputs}")
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1Fixesvllm-project/vllm#31345
Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Co-authored-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
mlapo in deepseek is a huge performance improvement in decode, this pr
support pcp & dcp with mlapo
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it?
This addresses the issue brought up by #5356 and #4963, and we believe
the unnecessary conditions are the root cause.
Change the unpad trigger to be driven by actual size mismatches
(num_reqs vs base_num_reqs or scheduled vs input token counts) rather
than specific speculative-method flags. Then remove brittle workarounds
that forced request counts and sliced query start locations.
This prevents incorrect indexing and length mismatches during
speculative decoding and makes metadata unpadding more robust across
scheduling modes.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Tested by existing cases.
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Adapted sp to eagle3.
There may still be some problems, e.g., accuracy in some scenes,
`sp`+`dp`...
We will fix them later.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
We tested it mainly in a new `e2e`.
```shell
pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance
```
```text
.
=============================== warnings summary ===============================
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============= 3 passed, 1 skipped, 2 warnings in 142.05s (0:02:22) =============
```
It passed.
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
Import global var form vllm instead of overwirte it, so that we could
use the correct global variant value
- vLLM version: v0.13.0
- vLLM main:
5326c89803
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
## What this PR does / why we need it?
This PR fixes the `AttentionMaskBuilder` singleton initialization issue
introduced in PR #4779 and removes the unused `pcp_prefill_mask` field.
### Background
After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton`
decorator, the class constructor now requires a `device` parameter.
However, two initialization sites were still using the old parameterless
constructor, causing failures.
### Changes
1. **Fix singleton initialization**
- Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)`
in `AscendMLAMetadataBuilder.__init__()`
- Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)`
in `AscendAttentionMetadataBuilder.__init__()`
2. **Remove unused field**
- Removed `pcp_prefill_mask` field from
`AscendPrefillContextParallelMetadata` (never used in codebase)
- Updated related test assertions
### Related
- Issue #5463
- PR #4779 (Unify all mask generation methods)
- PR #5389 (Make AttentionMaskBuilder singleton)
## Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring.
## How was this patch tested?
- ✅ Local testing: No linter errors
- ✅ Unit tests for attention modules verified
- ⏳ CI pipeline
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
When launching the service in the scenario where the
cudagraph_mode is set to FULL and Eagle3 acceleration is enabled for
inference, an error in fia will cause graph capture to fail. This PR
fixes the issue.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: WithHades <244036962@qq.com>
### What this PR does / why we need it?
This PR aims to delete redundant methods in mtp_proposer. All the
deleted methods now can be found in eagle_proposer. We also remove some
methods in eagle_proposer since they are identical to those in
vllm-eagle.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
Replace multiple PyTorch operations with a fused Triton kernel to
determine token indices for sampling during speculative decoding. This
reduces kernel launch overhead and memory traffic, improving overall
performance on Ascend hardware.
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
1. Refactor eagle and mtp function: load_model and generate_token_ids
2. Remove redundant code in mtp and eagle file
3. Refactor the UT of file
2/N of Refactor and merge mtp and eagle
Relational RFC: https://github.com/vllm-project/vllm-ascend/issues/5467
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut and tests
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Supported to use full-graph with Qwen3-Next-MTP.
In detail, we adatpted `AscendAttentionState.ChunkedPrefill` in main
model, and also adapted `AscendAttentionState.ChunkedPrefill` in mtp
model.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
We changed the test of Qwen3-Next-MTP in
`tests/e2e/multicard/test_qwen3_next.py` to make it a test of
`FULL_DECODE_ONLY`. Then run `pytest -s
tests/e2e/multicard/test_qwen3_next.py::test_qwen3_next_distributed_mp_eager_mtp_similarity_tp4`.
And this test passed.
```text
.
================================================================================================================================= warnings summary =================================================================================================================================
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================================== 1 passed, 2 warnings in 271.89s (0:04:31) =====================================================================================================================
```
- vLLM version: v0.13.0
- vLLM main:
5326c89803
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
Refactor pcp& dcp related code. we use pcp_manager class to Unifiy
Manage pcp & dcp . as we do this , many code can be deleted from
model_runner, and can avoid break pcp & dcp by other developments.
RFC:https://github.com/vllm-project/vllm-ascend/issues/5449
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
### What this PR does / why we need it?
This PR aims to refactor eagle-related modules in vllm-ascend.
This is the starting PR of eagle refactoring. Provided with vllm-eagle,
ascend-eagle and ascend-mtp, we first let ascend-mtp inherit from
ascend-eagle and let ascend-eagle inherit from vllm-eagle. As a
initialization, we just delete `__init__` in mtp_proposer and simplify
the corresponding logic in eagle_proposer.
Based on "vllm-eagle <----- ascend-eagle <----- ascend-mtp", our target
is to gradually delete ascend-mtp and enable ascend-eagle to converge to
vllm-eagle. So the main workspace is eagle_proposer. In this way, we
hope that contributors can concurrently refactor eagle.
Incoming changes:
1. delete common methods in vllm-eagle & ascend-eagle & ascend-mtp
2. delete `load_model` in mtp_proposer
3. delete `dummy_run` and `propose` in mtp_proposer
4. ......
RFC: #5467
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
We support to use full graph with eagle.
Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
2. Adapt the full-graph mode in eagle_proposer, include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.
NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
### What this PR does / why we need it?
This is to prepare for the migration to vLLM's `EagleProposer`, it does
not have `name` attribution. Also it's a breakdown of #5100 .
Introduces logic to determine whether eagle3 heads require auxiliary
hidden states based on configuration, ensuring consistent handling
across related components. Prevents incorrect assumptions for eagle3
variants that do not use auxiliary outputs, improving compatibility and
correctness.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
This PR deletes `cudagraph_batch_sizes` in `MtpProposer` and reuses the
one in `NPUModelRunner`.
During our deployment of DeepSeek-V3.2 with MTP across machines 2P2D and
conducting AISBench stress testing, an error occurred (see below). After
investigation, we found that
`compilation_config.cudagraph_capture_sizes` is modified by
`adjust_cudagraph_sizes_for_spec_decode` in `NPUModelRunner`. This
modification only updates `cudagraph_batch_sizes` in `NPUModelRunner`
but is not synchronized to `MtpProposer`. After discussion (CC @yiz-liu)
, we believe it is unnecessary to maintain `cudagraph_batch_sizes` in
`MtpProposer`; it should directly use the variable from
`NPUModelRunner`.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629
Reason:
The metadata data class contains an excessive number of variables. We
will inherit the metadata of the community and simultaneously remove
some variables that are no longer needed at present.
Todo:
1. remove attn_state partly.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it?
This PR aims to fix failure of `enable_force_load_balance` caused by
missing `in_profile_run` in `dummy_run` of mtp_proposer.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
Extracts repeated `attn_metadata[layer_name].decode` access into a
single variable to improve code readability and reduce redundancy.
Uses `getattr` with a default value to safely access the decode
attribute, making the code more defensive against potential attribute
errors.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Synchronize the host query_start_loc with device values to prevent shape
mismatches when not enable async scheduling.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
delete sekf.in_profile_run in model_runner to make EPLB works as expect
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
When enable the async_scheduling, in large scale EP scene, mtp module
goes to eagler mode, which results in the mismatch of
seq_lens_list、block_table. So adapt the judgement before the draft model
forward.
fix#4986
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
This PR add `qkv_rmsnorm_rope` operator and introduces a graph fusion
pass for `qknorm_rope` operations. The implementation includes a new
configuration flag, a pattern matching pass using
`torch._inductor.pattern_matcher`, and a custom Triton kernel for the
fused operation.
Co-authored-by: Angazenn
[supperccell@163.com](mailto:supperccell@163.com)
### Does this PR introduce _any_ user-facing change?
Yes, add new additional_config
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
We refactored the eagle_proposer.py to adapt the framework of eagle.py
in vllm-v0.12.0, to support the logit of padded drafter batch and
async-scheduler.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
(1)refactor npu_model_runner for profile_run
(2) move _select_moe_comm_method to ascend_forward_context
(3) delete _init_model_kwargs in npu_model_runner
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Na
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>
### What this PR does / why we need it?
Fix the bug " TypeError: 'NoneType' object is not iterable' " in
vllm_ascend/compilation/acl_graph.py
The reason of that is the attn_metadata is none in the dummy_run of MTP.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
The `attn_metadata` is not used by any draft proposer, so we can remove
it.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
### What this PR does / why we need it?
In PR #4188, a small bug was introduced that caused sfa-cp to be unable
to find the global_pp_size parameter during initialization, and this PR
fixed the issue.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
### What this PR does / why we need it?
- Fix a premature `return` in `moe_init_routing_quant_v2.cpp` so the
routing kernel completes correctly instead of exiting early in certain
paths.
- Switch `FusedAlltoAllCommImpl` to use the MC2-based token dispatcher
and prepare/finalize routines, aligning MoE communication with the MC2
algorithm optimized for Ascend devices.
- Add a temporary override in `MtpProposer` to map `FUSED_ALLTOALL` back
to `ALLTOALL` until the MoE communication type selection logic is fully
finalized, avoiding incorrect behavior in dummy-run flows.
- Simplify the MoE communication selection for Ascend 910-93 in
`NPUModelRunner` by removing the EP-size guard on `FUSED_ALLTOALL`,
which fixes failures in multi-node / larger-EP configurations while
keeping MC2 routing under the configured token capacity.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: mojave2 <chenchen145@huawei.com>
### What this PR does / why we need it?
Corrects attention metadata size for MTP when both asynchronous
scheduling and full ACL graph mode are enabled. This prevents potential
size mismatches during execution.
Additionally, improves the robustness of calculating token sample
indices by explicitly aligning tensor shapes.
Finally, prevents padding when the number of input tokens exceeds the
maximum ACL graph batch size to avoid out-of-bounds errors.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Need to add corresponding test case ASAP.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
pick from https://github.com/vllm-project/vllm-ascend/pull/4736 to fix
the merge conflict
### What this PR does / why we need it?
Currently, the all_reduce operation in _sync_metadata_across_dp is
performed with gloo backend which is extremely time-consuming when
DPEngineCores are in different nodes. This operation cannot be ignored
by async scheduling in multi-node-scenarios with speculative decoding
(e.g., EAGLE, mtp).
This pr eliminates the all_reduce operation for D Nodes and change the
input parameter of MoEDispatch & MoeCombine operators to make MC2EP
support different num_tokens across all ranks.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with PD disaggregation (2P: DP2TP8EP16 1D: DP8TP4EP32) scenarios
while enabling async scheduling. This pr can remove cross-node
all_reduce with gloo backend and further reduce latency with correct
accuracy.
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
refactor npu_modelrunner, we should be close to gpu_modelrunner
### Does this PR introduce _any_ user-facing change?
NO
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>
This PR clean up useless torchair logic in model runner. The moge doc is
only for torchair, it can be removed as well.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>