### What this PR does / why we need it?
PR #5672 attempted to remove the -1 padding for duplicate tokens in the
decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler
slicing approach. However, in the single-ops logic and mixed PD batches,
the decode slot_mapping did not eliminate the -1 and also shared the
slicing method, resulting in incorrect slot_mapping. This PR resolves
this issue, and the logic will be further consolidated in subsequent
refactoring PRs.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it?
Qwen3-VL-MoE EAGLE support for vLLM-Ascend
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
The patch tested with Qwen3-VL-30B-A3B-Instruct model
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: Sergey_Zlobin <sirg_zlobin@mail.ru>
### What this PR does / why we need it?
This PR fixes a critical bug in the PD-separated inference pipeline
where KV cache on the Prefill (P) side was not being properly released.
The issue arises when multiple clients use the same x-request-id: to
avoid request ID collisions, both Prefill and Decode nodes append a
random suffix to the incoming x-request-id. A previous PR ensured
consistency by having the P-side pass its final request_id as
remote_request_id to the D-side via kv_transfer_param. However, during
KV cache cleanup, the D-side incorrectly used the local req_id (instead
of remote_request_id) to select the target P-side rank. This mismatch
caused the P-side KV cache to remain unreleased on certain ranks,
leading to memory leaks. This PR corrects the logic to use
remote_request_id consistently when determining the P-side rank.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The fix was validated by running multiple concurrent benchmark instances
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: ghphotoframe <854746559@qq.com>
We patched deepseek before since we notice asserterror raised by
transformers. Now due to transformers upgrade, the patch looks useless
now. Let's remove it.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
PR https://github.com/vllm-project/vllm/pull/32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Disable enable_shared_expert_dp by default if tensor_parallel_size=1
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: underfituu <hzhucong@163.com>
### What this PR does / why we need it?
This PR rebases RecomputeScheduler codebase to vllm tags/v0.14.1 in
order to fix the incompatibility with vllm's original Scheduler and
AsyncScheduler. Main changes focus on multimodal model and speculative
decoding parts.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
We tested this PR with 2P1D E2E serving test case.
- vLLM version: v0.14.1
- vLLM main:
d68209402d
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Adds a CUDA graph profiling stats field to the execution state and
updates the NPU model runner to set, unpack, and forward those stats
during execution. This preserves CUDA graph metrics across state
transitions, improving observability for later use and diagnostics.
### Does this PR introduce _any_ user-facing change?
Enable this by set
```python
llm = LLM(
...
disable_log_stats=False,
cudagraph_metrics=True,
...
)
```
or `--cudagraph-metrics` and make sure do not disable log stats.
After this, you should be able to see something like this, which is
really helpful for some light debugging:
```
[loggers.py:257] Engine 000: Avg prompt throughput: 32.3 tokens/s, Avg generation throughput: 114.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
[cuda_graph.py:117] **CUDAGraph Config Settings:**
[cuda_graph.py:117]
[cuda_graph.py:117] - Mode: FULL_DECODE_ONLY
[cuda_graph.py:117] - Capture sizes: [1, 2, 4, 8, 16, 24, 32]
[cuda_graph.py:117]
[cuda_graph.py:117] **CUDAGraph Stats:**
[cuda_graph.py:117]
[cuda_graph.py:117] | Unpadded Tokens | Padded Tokens | Num Paddings | Runtime Mode | Count |
[cuda_graph.py:117] |-----------------|---------------|--------------|--------------|-------|
[cuda_graph.py:117] | 4 | 4 | 0 | FULL | 18 |
[cuda_graph.py:117] | 5 | 5 | 0 | NONE | 1 |
[cuda_graph.py:117] | 1 | 1 | 0 | FULL | 1 |
[cuda_graph.py:117] | 18 | 18 | 0 | NONE | 1 |
```
### How was this patch tested?
None.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Fix the MTP test failure caused by accessing non-existent attribute
`forward_context.draft_attn_metadatas`.
**Root cause:**
In `AscendAttentionBackendImpl.update_graph_params`, the code
incorrectly accessed `forward_context.draft_attn_metadatas`, but
`ForwardContext` class doesn't have this attribute. The original code
passed this value via function parameter.
**Fix:**
Add `draft_attn_metadatas` parameter to the entire call chain:
- `update_full_graph_params` function in `acl_graph.py`
- All `update_graph_params` methods in attention backends
- Pass the parameter correctly in `eagle_proposer.py`
Also applied Gemini's suggestion to make `vllm_config=None` in
`AscendAttentionCPImpl.update_graph_params` for API consistency.
Related to item 9 in #5463
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This fixes the CI test failure:
`test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]`
Signed-off-by: lico67373 <918688502@qq.com>
### What this PR does / why we need it?
The structure of the `excute_model` and `_dymmy_run` methods in
NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the `_prepare_inputs` method into `_prepare_inputs`,
`_determine_batch_execution_and_padding`, `_build_attention_metadata`,
and `_preprocess`.
Modify `_generate_process_reqs_hidden_states` to `_model_forward`.
Align the implementation of the `postprocess` phase
**Related-RFC**: https://github.com/vllm-project/vllm-ascend/issues/5449
**Co-authored-by**: @zhenwenqi2024
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
d68209402d
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it?
This PR optimizes the torch_npu profiler configuration to significantly
reduce overhead and trace file size. The key changes include:
**Enable Data Simplification**: Explicitly sets data_simplification=True
in _ExperimentalConfig. This filters out unnecessary intermediate data
during profiling, drastically reducing the memory footprint and I/O
overhead.
**Use Lightweight Stack Tracing**: Replaces with_stack with with_modules
when torch_profiler_with_stack is enabled. In torch_npu, with_stack
introduces heavy latency. with_modules provides equivalent semantic
information with much lower overhead.
**Code Simplification:** Removes redundant parameter configurations in
_ExperimentalConfig by utilizing default values, making the codebase
cleaner and easier to maintain.
**Test setup:**
max length = 50, profiler + stack enabled
**Before optimization:**
Profiler data size: 651 MB
Generate time: 3 seconds
**After optimization:**
Profiler data size: 156 MB (≈76% reduction)
Generate time: <1 second
### Does this PR introduce _any_ user-facing change?
No API changes. Users profiling on Ascend will experience faster
profiling execution and smaller trace files when stack tracing is
enabled.
### How was this patch tested?
Manually verified on Ascend NPU by running vLLM with the profiler
enabled. Confirmed that trace files are generated correctly containing
necessary stack/module info, while showing the reported reduction in
size and time.
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: mengchengTang <745274877@qq.com>
### What this PR does / why we need it?
This PR builds upon PR
https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to
further enhance the npu_graph_ex_passes module. Based on prior work, we
have added graph optimization support for the add_rms_quant fused
operator in scenarios where a bias term is present—ensuring the fusion
pattern is correctly registered and matched into the computation graph.
This time, we performed the operator fusion of MatmulAllReduceAddRMSNorm
and added corresponding ST test cases for regression monitoring.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: cjian <2318164299@qq.com>
### What this PR does / why we need it?
This PR removes `update_default_aclgraph_sizes`. In earlier versions, we
add this function to change default `cudagraph_capture_sizes` because
`_npu_paged_attention` degrades significantly on certain shapes (which
is included in default `cudagraph_capture_sizes` of VLLM). Now since we
use FIA as default attention op (which does not contain such performance
degradation), there is no need to add this default change. Otherwise, it
could cause some conflicts if we set a small `cudagraph_capture_sizes`
that < 20 now.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
d68209402d
---------
Signed-off-by: Angazenn <supperccell@163.com>
This PR fixes the numerical error in moe_load accumulation under ACL
graph mode on NPU: using += for NPU tensors in graph mode does not throw
errors but leads to incorrect values, so we replace it with the in-place
add_() method to ensure accurate calculation.
Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
- Fixed the computing of final hidden_states when enabling pipeline
parallel and prefill context parallel at the same time. Only in the last
PP rank, hidden_states are required and have right tensor type.
- Fixed the shape of intermediate_tensors in the dummy_run when enabling
pipeline parallel and flashcomm1. The intermediate_tensors should be
divided by tp_size. Otherwise, the moe will raise issues.
- Fixed the shape of self.intermediate_tensors for sufficient slice
space
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
### What this PR does / why we need it?
The static kernel in torch_npu is uninstalled through Python's atexit
mechanism.
However, in vllm-ascend, when inference ends or the service stops, the
worker process is terminated. This way, ending the process does not
trigger the atexit mechanism, causing the static kernel not to be
unloaded.
When using the nougraph_ex backend and enabling the static kernel, we
registered a signal handler to explicitly unload the static kernel.
When there are many static kernels, unloading usually takes some time,
whereas vllm will directly kill the process after sending a terminate
event. Therefore, we choose to handle it by starting a new process.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
d68209402d
---------
Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
### What this PR does / why we need it?
310P support guides updates, as currently has supported in main branch.
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
To support elastic scaling when using mooncake connector, we should
support to **configure different tp sizes for different nodes**.
As a result, we transfer the prefill node information, such as tp size,
through **the request's kv_transfer_params**.
The decode nodes **get the prefill tp size** through the request's
kv_transfer_params, instead of getting it from the configuration of the
mooncake connector .
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: yuxinshan <syx_ctyg@126.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
### What this PR does / why we need it?
After removing codepsell a while, we discovered that typo had a problem
correctly recognizing certain misspelled words, so I suggested adding it
back.
- vLLM version: v0.14.1
- vLLM main:
d68209402d
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Align with upstream vLLM. This PR will help downstream vLLM-Omni reduce
the cost for maintaining the _prepare_inputs. Besides, it helps
vLLM-Ascend code more readable. In the future, we can follow closer to
vLLM. The `preprocess` logic is same as GPUModelRunner. We don't need to
maintain it anymore.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI.
- vLLM version: v0.14.0
- vLLM main:
d68209402d
---------
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it?
Currently, we pad the last dim of qkv to 128 before flash attention (in
`AscendMMEncoderAttention`) to get better performance on Ascend NPU.
However, the qkv padding is executed serially, which may lead to more
overhead when launching `aclnnConstantPadNd` (launch 3 times).
Since the three operations are mutually independent, we stack qkv first
and then pad them in one kernel launch. With this optimization, **TTFT**
has been reduced by **3.15%**, **peak throughput** has been increased by
**4.20%**.
---
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Launch the server:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 16384 \
--max-num-batched-tokens 16384
```
Run benchmark:
```bash
vllm bench serve \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--hf-split train \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--num-prompts 1000 \
--no-stream
```
Before this PR:
```
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Benchmark duration (s): 122.33
Total input tokens: 66638
Total generated tokens: 122845
Request throughput (req/s): 8.17
Output token throughput (tok/s): 1004.18
Peak output token throughput (tok/s): 3073.00
Peak concurrent requests: 1000.00
Total token throughput (tok/s): 1548.90
---------------Time to First Token----------------
Mean TTFT (ms): 51757.16
Median TTFT (ms): 44853.42
P99 TTFT (ms): 110700.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 226.06
Median TPOT (ms): 206.85
P99 TPOT (ms): 935.31
---------------Inter-token Latency----------------
Mean ITL (ms): 208.82
Median ITL (ms): 96.37
P99 ITL (ms): 2183.13
==================================================
```
After this PR:
```
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Benchmark duration (s): 121.47
Total input tokens: 66638
Total generated tokens: 122860
Request throughput (req/s): 8.23
Output token throughput (tok/s): 1011.47
Peak output token throughput (tok/s): 3202.00
Peak concurrent requests: 1000.00
Total token throughput (tok/s): 1560.08
---------------Time to First Token----------------
Mean TTFT (ms): 50125.08
Median TTFT (ms): 46270.85
P99 TTFT (ms): 108107.12
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 227.11
Median TPOT (ms): 205.13
P99 TPOT (ms): 816.08
---------------Inter-token Latency----------------
Mean ITL (ms): 204.60
Median ITL (ms): 92.66
P99 ITL (ms): 2219.02
==================================================
```
- vLLM version: v0.14.0
- vLLM main:
d68209402d
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Implement `apply_top_k_top_p` via ascendC to eliminate the constraint of
k [1,1024]. It enables high performance TopKTopP calculation and avoid
D2H synchronization introduced by k validation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E serving with `k=4096` and `p=0.95`
- vLLM version: v0.13.0
- vLLM main:
d68209402d
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
This patch purpose to add the `update_max_model_len` interface.
- vLLM version: v0.14.0
- vLLM main:
d68209402d
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
In vLLM-Omni, there exists the empty `ModelConfig`. We need to add a
check before accessing the sub-field of model_config.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Will checked by CI.
- vLLM version: v0.14.0
- vLLM main:
d68209402d
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
This reverts commit 8966a99710.
It breaks the test
`tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]`
- vLLM version: v0.14.0
- vLLM main:
d68209402d
### What this PR does / why we need it?
Since CI has integrated Triton, `fuse_qknorm_rope` is enabled by
default.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.14.0
- vLLM main:
d68209402d
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| `vllm_ascend/distributed/kv_transfer/__init__.py` |
| `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_connector.py` |
|
`vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py`
|
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.0
- vLLM main:
d68209402d
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
* Refactor the LayerNorm and activation operator classes to decouple the
310P device implementation from the main branch.
* Refactor `mm_encoder_attention` on 310P to use the
`torch_npu._npu_flash_attention_unpad` operator.
* Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P
so they are no longer padded to 16× alignment.
* Refactor `model_runner` on 310P to align the KV-cache initialization
logic with the mainline implementation.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
use the e2e tests.
- vLLM version: v0.13.0
- vLLM main:
d68209402d
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
### What this PR does / why we need it?
#5790 changes default addrmsnormBias operator if custom ops is enabled.
This PR modifies AddRmsNormQuant pass to align with addrmsnormBias.
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
**Refactor: Unify full-graph parameter update logic**
This PR consolidates the scattered full-graph parameter update logic
into a unified approach, improving code architecture and eliminating
duplication.
**Key improvements:**
1. **Unified interface**
- Create `update_full_graph_params` as the single entry point for all
full-graph updates
- Replace multiple scattered update calls with one unified function
- Remove ~50 lines of duplicated if-else logic across
`model_runner_v1.py` and `eagle_proposer.py`
2. **Better architecture**
- Move update logic to respective Backend classes
(`AscendAttentionBackend`, `AscendMLABackend`)
- Each Backend manages its own parameter update logic internally
- Simplify caller code to just dispatch to the appropriate Backend
3. **Cleaner parameter handling**
- Remove unnecessary `pcp_size` and `dcp_size` parameter passing
- Get parallel configuration directly from distributed groups
- Consistent with how other parts of the codebase obtain these values
**Why we need it:**
- **Maintainability**: Future changes only need to be made in one place
per Backend
- **Code quality**: Follows DRY principle and Single Responsibility
Principle
- **Readability**: Cleaner, more intuitive code structure
### Does this PR introduce _any_ user-facing change?
**No.** This is a pure refactoring with no functional changes - same
behavior, cleaner code.
### How was this patch tested?
- All existing unit tests pass with updated mocks
- No new tests needed (pure refactoring, no behavior changes)
- CI validates correctness
---
- vLLM version: v0.13.0
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: drslark <slarksblood@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
Remove restrictions on mooncake for IPv6
Dependencies: cann8.5、mooncake v0.3.8.post1
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
This PR:
1. Enhances the logic of `_skip_all_reduce_across_dp_group` to skip all
cpu dp allreduce for dense models. This is also for purpose 2.
2. Adds `_skip_all_reduce_across_dp_group` into eagle_proposer. Now
models like Qwen3-235b supports eagle3 spec decode. A typical setting
for these moe models on pd disaggregation often introduce `dp_size > 1`.
This requires `set_forward_context` to call a cpu dp allreduce to
retrieve `num_tokens_across_dp` on all cases. Skipping this allreduce
greatly improves performance.
- vLLM version: v0.14.0
- vLLM main:
d68209402d
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
ucm_connector add has `has_connector_metadata` interface to adapt to the
latest KV connector in vLLM.
### Does this PR introduce _any_ user-facing change?
this PR doesn't introduce _any_ user-facing change.
### How was this patch tested?
- vLLM version: v0.14.0
- vLLM main:
d68209402d
Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
### What this PR does / why we need it?
If `sp` is enabled and `tp_size` >= 16,
`torch_npu.npu_mm_reduce_scatter_base` will raises a exception.
After consulting with the operator developer, we learned that the
operator does not work when `tp` = 16.
So, we disable the operator when `tp` = 16.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested
We started a server with `sp` enabled and `tp` = 16.
It started successfully.
```text
[0;36m(APIServer pid=1855938)[0;0m INFO: Started server process [1855938]
[0;36m(APIServer pid=1855938)[0;0m INFO: Waiting for application startup.
[0;36m(APIServer pid=1855938)[0;0m INFO: Application startup complete.
```
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
This PR is to replace addRmsNorm and Add With addRmsNormBias. This way
can lead to a more effecient result.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Full Test Pass
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com>
Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>
### What this PR does / why we need it?
Rectify the problem that the pcp and pd separation and kv pooling
scenario.
In the pooling scenario, multi_nodes_meta_mapping is empty. As a result,
an error is reported when the remote_host information is obtained
through the get_remote_port_send_num method.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
Due to the long-term lack of synchronization with the upstream code, a
problem that led to a decrease in the acceptance rate of the
Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the
bug(#5967). Now, synchronize with the upstream and fix this bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```python
from vllm import LLM, SamplingParams
def main():
prompts = [
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen3-30B-A3B",
tensor_parallel_size=4,
gpu_memory_utilization=0.9,
enforce_eager=True,
speculative_config={
"method": "eagle3",
"model": "AngelSlim/Qwen3-a3B_eagle3"
"num_speculative_tokens": 3,
},
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
print(f"Outputs: {outputs}")
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Co-authored-by: drslark <slarkblood@qq.com>