Cherry-pick https://github.com/vllm-project/vllm-ascend/pull/8683
### What this PR does / why we need it?
This PR relaxes the TTFT threshold from `0.4` to `0.5` to improve
robustness under Data Parallel (DP) load imbalance.
#### Background
The current assertion enforces: prefix75 < prefix0 * 0.4
#### ❌ Nightly Failure Cases (Observed)
| prefix0 | threshold (0.4x) | prefix75 | delta |
|--------|------------------|----------|--------|
| 4696.24 | 1878.50 | 1883.99 | +5.49 |
| 4696.20 | 1878.48 | 1896.01 | +17.53 |
| 4636.73 | 1854.69 | 1902.48 | +47.79 |
| 4655.17 | 1862.07 | 1913.54 | +51.47 |
| 4685.35 | 1874.14 | 1919.36 | +45.22 |
| 4660.33 | 1864.13 | 1915.41 | +51.28 |
| 4648.30 | 1859.32 | 1950.50 | +91.18 |
| 4655.30 | 1862.12 | 1962.32 | +100.20 |
---
#### ✅ Nightly Passing Cases (Observed)
| prefix0 | threshold (0.4x) | prefix75 | margin |
|--------|------------------|----------|---------|
| 4685.64 | 1874.26 | 1864.46 | -9.80 |
| 5520.28 | 2208.11 | 1928.97 | -279.14 |
| 4639.23 | 1855.69 | 1846.86 | -8.83 |
| 4651.64 | 1860.66 | 1854.30 | -6.36 |
| 4640.39 | 1856.15 | 1840.32 | -15.83 |
| 4677.20 | 1870.88 | 1848.35 | -22.53 |
---
#### Key Observations
- Failures exceed the threshold by only **~5 ms to ~100 ms (~0.3%–5%)**
- Passing cases often have **very tight margins (~5–10 ms)**
- There is clear **overlap between pass and fail boundaries**
- Many failures are **borderline violations**, not real regressions
---
#### Root Cause
The instability is caused by **Data Parallel (DP) load imbalance**,
which introduces systematic variance:
- Uneven request distribution across workers
- Queueing delays
- Increased TTFT variance (especially for `prefix75`)
---
#### Conclusion
- The current threshold (`0.4x`) is **too strict**
- Observed natural fluctuation:
- Absolute: up to ~100 ms
- Relative: up to ~5% over threshold
- Pass/fail boundary is currently **too sensitive to runtime jitter**
---
#### Change
We relax the threshold: **0.4 → 0.5**
This adjustment:
- Accounts for expected runtime variance
- Reduces false negatives
- Maintains a meaningful performance constraint
Even with `0.5`, the requirement remains strict (`prefix75 < 50% of
prefix0`) and does not mask real regressions.
---
### Does this PR introduce _any_ user-facing change?
No.
This change only affects internal test assertions and does not impact
user-facing behavior or model performance.
---
### How was this patch tested?
- Verified against existing TTFT test cases:
- Previously failing cases (due to small variance) now pass
- No regressions observed in other scenarios
- Confirmed that failures were due to DP load imbalance rather than
actual performance degradation
- Ensured the updated threshold still enforces a meaningful constraint
on TTFT
Signed-off-by: underfituu <hzhucong@163.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
#### Fixed:
1. The function name in test_moe_init_routing_custom.py is incorrect; it
is not named as a test case function starting with 'test'.
2.In Night ops singlecard_ops add the printing of timestamps for use
cases, making it easier to quickly locate issues after a timeout occurs.
#### To be repaired:
1. The test_penality.py test case partially fails. It takes one hour.
The owner has been notified to fix the case after the 5.1 holiday.
——Yang Cheng
3. The csrc/copy_and_expand_eagle_inputs operator invoked by
test_copy_and_expand_eagle_inputs.py supports only 910b.——HF001
4. The test_causal_conv1d.py test case is incorrect. The triton operator
`causal_conv1d_fn` invoked by the test_causal_conv1d.py test case uses
`get_forward_context`, but the operator case does not use
`set_forward_context` (which is normal in the model). ——Zeng Tian
5. The test_causal_conv1d.py case is incorrect. In this scenario,
uboverflow occurs when the triton invoked ——Zeng Tian
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
no
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
nightly
Signed-off-by: ZT-AIA <1028681969@qq.com>
### What this PR does / why we need it?
After he completes the subsequent repairs, it can be restored. For now,
let's skip test_copy_and_expand_eagle_inputs
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
nightly
Signed-off-by: ZT-AIA <1028681969@qq.com>
### What this PR does / why we need it?
To improve the quality of certain docs by revising specific content.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
- vLLM version: v0.19.0
- vLLM main:
6f786f2c50
---------
Signed-off-by: Lucky1 <144669645+verylucky01@users.noreply.github.com>
### What this PR does / why we need it?
This PR introduces a caching mechanism for CPU-based `torch.Generator`
objects in the `_random_sample_310p` function to optimize sampling
performance. It includes unit tests for cache persistence and state
recovery. Feedback highlights a critical bug where keying the cache by
batch index instead of generator ID can break RNG reproducibility during
request re-scheduling, and notes a potential memory leak in the global
cache.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested via new unit tests in `tests/ut/_310p/sample/test_sampler_310.py`
verifying cache logic and error handling.
---------
Signed-off-by: csoulnd <daidaicurry@foxmail.com>
### What this PR does / why we need it?
Add a detailed 310 deployment tutorial.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
Update DeepSeekOCR2.md for releases/v0.18.0
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
NO
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
vLLM version: v0.18.0
vLLM main:
bcf2be9612
---------
Signed-off-by: Wangbei25 <wangbei41@huawie.com>
Signed-off-by: Wangbei25 <wangbei41@huawei.com>
Co-authored-by: Wangbei25 <wangbei41@huawie.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
The triton kernels in sample encounter some problems, scenarios are
shown below:
1. 【expand_kernel/ rejection_random_sample_kernel/
prepare_inputs_padded_kernel】, these three operations will use
‘tl.load(prt + offsets -1, mask)’ in their implementations, but triton
compiler reports that the masks in these scenarios are not static and
contiguous. As a result, compiler will first access this memory and
apply the mask. Therefore, I modified the code to ‘tl.load(prt
+tl.maximum(offsets - 1, 0), mask)’ to ensure no -1 reads.
2. 【sample_recovered_tokens_kernel/ rejection_random_sample_kernel】,
this kernel uses draft_token_id as an address offset for the load
operation. In the PD separation scenario, if the pad token is -1,
illegal memory reads and writes can occur. Therefore, i modified the
kernel and so they can do well with -1 token.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: ppppeng <zepengliu912@qq.com>
Co-authored-by: zepengliu912@qq.com <root@localhost.localdomain>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
Correct the descriptive errors in the document.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
no
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
doc test
---------
Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
### What this PR does / why we need it?
Backport validate pd mode feature gates no fused mc2 v0.18.0 clean
backport #8582
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
Added NPU soft partitioning + cudagraph.piecewise limitation in graph
mode user guide doc.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
fix documentation error or non-standard description in releases/v0.18.0
branch
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Documentation check.
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
This PR improves the readability of the documentation by fixing typos,
correcting command extensions.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Documentation changes only.
Signed-off-by: sunshine202600 <sunshine202600@163.com>
### What this PR does / why we need it?
This backports the forced-tool-choice `content=None` guard to the
`releases/v0.18.0` compatibility layer.
Upstream vLLM still has forced named tool-choice branches that assert
`content is not None` after reasoning extraction. Some reasoning parsers
can legally consume the full output and return `(reasoning, None)`,
which makes the assert reachable and can surface as a server-side
failure.
This PR follows the same compatibility-patch pattern used by:
- `7314bbe2` fix(platform): reimplement MiniMax usage accounting patch
(#7835)
- `f83cb0e6` [Bugfix][Platform] Fix GLM47 tool-call finish backfill
(#7710)
The patch is intentionally narrow:
- normalize `content=None` to `""` only for forced named tool choice
- patch both chat-completions and responses parser entry points
- keep the rest of upstream behavior unchanged
Upstream tracking:
- issue: vllm-project/vllm#40147
- PR: vllm-project/vllm#40148
### Does this PR introduce _any_ user-facing change?
Yes.
Forced named tool choice becomes robust when the reasoning parser
returns no post-reasoning content, avoiding an internal assertion
failure and emitting an empty-argument function call instead.
### How was this patch tested?
Unit tests:
```bash
pytest -sv tests/ut/patch/platform/test_patch_tool_choice_none_content.py \
tests/ut/patch/platform/test_patch_glm_tool_call_parser.py \
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py
```
Result: 22 passed.
---------
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
### What this PR does / why we need it?
This PR renames the environment variable VLLM_NIXL_ABORT_REQUEST_TIMEOUT
to VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT to align with the Mooncake
connector naming convention. It also updates the documentation and test
configurations to reflect this change and adjusts the suggested timeout
value in the documentation to 480 seconds for consistency.
### Does this PR introduce _any_ user-facing change?
Yes. The environment variable for configuring the abort request timeout
has been renamed. Users should update their environment settings from
VLLM_NIXL_ABORT_REQUEST_TIMEOUT to VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT.
### How was this patch tested?
The changes were verified by updating the corresponding test
configuration files and ensuring consistency across the documentation.
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
### What this PR does / why we need it?
This PR clarifies the CPU binding documentation for managing the
`irqbalance` service.
The previous wording only mentioned Ubuntu while the command shown is
specific to systemd-based Linux distributions. This update describes the
command as applicable to Ubuntu and other systemd-based distributions,
and adds a note for non-systemd systems to use the distribution-specific
service-management command.
### Does this PR introduce _any_ user-facing change?
No. This is a documentation-only update and does not change vLLM or
vllm-ascend runtime behavior.
### How was this patch tested?
Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
### What this PR does / why we need it?
This PR introduce stricter Ascend `additional_config.layer_sharding`
validation to the 0.18 release branch so it is only accepted on
PD-disaggregated P nodes with `kv_role="kv_producer"`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E test
---------
Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
### What this PR does / why we need it?
update GLM4.7 doc. Fix configuration issues,
including:VLLM_ASCEND_ENABLE_FLASHCOMM1、VLLM_ASCEND_BALANCE_SCHEDULING、VLLM_NIXL_ABORT_REQUEST_TIMEOUT
etc.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
doc test
---------
Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Signed-off-by: aipaes <82140963+aipaes@users.noreply.github.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
### What this PR does / why we need it?
Remove unused layers assignment in mooncake connector
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
by nightly
Signed-off-by: liziyu <liziyu16@huawei.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
Fix kv pool CLI flag typo and formatting
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
### What this PR does / why we need it?
change --compilation_config to --compilation-config
change --max-model-len 133008 to --max-model-len 131072 for matching
128k
### Does this PR introduce _any_ user-facing change?
No
Signed-off-by: Yang Yuxi <907276627@qq.com>
### What this PR does / why we need it?
fix tl.extract_slice and tl.insert_slice to extract_slice and
insert_slice from torch_utils
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
---------
Signed-off-by: wangx700 <wangxin700@huawei.com>
### What this PR does / why we need it?
Fix the issue where the Mooncake connector does not handle the MTP layer
KV cache when TP is unbalanced.
backport: #8540
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
by nightly
Signed-off-by: liziyu <liziyu16@huawei.com>
cherry-pick https://github.com/vllm-project/vllm-ascend/pull/8539
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
Due to end-to-end testing , three optimization points for the decode
scenario have been reverted in dispatch_ffn_combine kernel.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: l00893928 <liuquanlu@huawei.com>
Co-authored-by: l00893928 <liuquanlu@huawei.com>
### What this PR does / why we need it?
This PR updates the `MOONCAKE_TAG` version from `v0.3.8.post1` to
`v0.3.9` across all Dockerfiles.
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Fix the issue where a request does not return due to a specific NPU on
node D having no transmission tasks in the scenario where node D is
enabled with DCP.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
by nightly
Signed-off-by: liziyu <liziyu16@huawei.com>
Backport of #7882 to releases/v0.18.0. Adds aime2025 benchmark test for
DeepSeek-V3.2-W8A8 EP with disaggregated prefill on A3 (4-node, 16 NPUs
per node, accuracy benchmark baseline 66.67%).
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
### What this PR does / why we need it?
This PR updates the model deployment tutorial template to include a
requirement for authors to add a comment when code examples contain
version numbers. This ensures that users are prompted to use the version
appropriate for their specific environment.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A (Documentation change)
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
### What this PR does / why we need it?
This PR enables synchronization for the `PIECEWISE` runtime mode in ACL
graph replay. Previously, synchronization was only performed in `FULL`
mode. However, `PIECEWISE` mode also requires this barrier to ensure
that parameter updates are completed before the graph is replayed,
preventing accuracy loss.
The logic is also corrected to skip synchronization specifically for
EAGLE draft models, as intended.
Fixes #
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed.
---------
Signed-off-by: 1zzk <785396250@qq.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
Update CI for GLM-5 configuration on vllm-ascend/releases/v0.18.0 branch
在0.18.0版本上对glm5-w4a8做测试
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: yangjiuhua <y00845194@china.huawei.com>
Co-authored-by: yangjiuhua <y00845194@china.huawei.com>
### What this PR does / why we need it?
The env `VLLM_ASCEND_ENABLE_FUSED_MC2` should only enabled in the
decoder node during Prefill-Decode Disaggregation scenario
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
What this PR does / why we need it?
switch Ascend conv3d forward_oot to use forward_native and add ut
Does this PR introduce any user-facing change?
No
How was this patch tested?
by CI
---------
Signed-off-by: zouyizhou <zouyizhou@huawei.com>
…(#8405)"
This reverts commit b992b11545.
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: l00893928 <liuquanlu@huawei.com>
Co-authored-by: l00893928 <liuquanlu@huawei.com>
### What this PR does / why we need it?
GDN Attention uses FIA's query_start_loc (padded), which may cause
conv1d update errors under high concurrency when dp > 1, and this PR is
to make GDN use its own query_start_loc (unpadded).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.18.0
Signed-off-by: Wangbingjie <wangbj1207@126.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: l00893928 <liuquanlu@huawei.com>
Co-authored-by: l00893928 <liuquanlu@huawei.com>
Reverts vllm-project/vllm-ascend#8133
- Reversion of Logic: This pull request reverts the changes introduced
in a previous commit that attempted to handle dimension mismatches
during SP padding.
Signed-off-by: Wangbingjie <wangbj1207@126.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
- Enforce recompute scheduler only in PD-disaggregated mode.
- Enforce balance scheduling only in PD-mixed mode.
- Enforce fused MC2 only on PD-disaggregated D-side (kv_consumer).
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
No
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
By ci
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
1. there is no synchronization between steps. However, in async
scheduling with aclgraph, it is possible that the CPU's record event for
the current iteration completes before the previous iteration's graph
execution has finished. If cpu is fast enough, device will hang on
event_wait in interation i+1 (assume that event_record is executed
immediately on update stream of device).
2. Under ENPU, eagle proposers also need to follow event.record first,
and then event.Wait.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
---------
Signed-off-by: 1zzk <785396250@qq.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->show known issues for Qwen3.5-397B
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->NO
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->NA
---------
Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
Cherry-picked from https://github.com/vllm-project/vllm-ascend/pull/7858
### What this PR does / why we need it?
This PR fixes a `RuntimeError` (dimension mismatch) that occurs when
Sequence Parallelism (SP) is enabled and the padding added for SP causes
`num_tokens_padded` to differ from `num_tokens_unpadded`. In such cases,
`_pad_query_start_loc_for_fia` adds a dummy request, increasing
`num_reqs_padded`. This mismatch between the actual number of requests
and the padded number of requests leads to errors in downstream token
count computations (e.g., `compute_num_computed_tokens`).
The fix modifies the restrictive condition `num_tokens_padded ==
num_tokens_unpadded` when reverting the dummy request padding if SP is
enabled, as SP padding is handled by stripping it after communication
and should not be treated as an additional request in the attention
metadata.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
vLLM version: v0.18.0
vLLM-Ascend version: releases/v0.18.0
Signed-off-by: Wangbj127 <wangbj1207@126.com>
### What this PR does / why we need it?
PR #8220 in v0.18.0
In a previous PR #7843 , the o_proj layer of GLM-5 was reverted to TP
(Tensor Parallel) splitting when flashcomm1 was enabled. However, this
was a temporary workaround and did not address the root cause of the
precision issues observed in the o_proj layer under flashcomm1.
I am working on a definitive fix for this issue. Currently, a clear bug
has been identified in
880e20fdde/vllm_ascend/quantization/methods/w8a8_static.py (L124):
during quantized matrix multiplication, quant_bias is not added if
tp_rank > 0. In the flashcomm1 scenario, all ranks actually require the
addition of quant_bias, meaning tp_rank=0 should be passed to ensure the
bias is applied correctly.
This PR aims to resolve this logic error and fix the underlying
precision issue.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
glm5 e2e test
---------
Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: triomino <15924998+triomino@users.noreply.github.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
### What this PR does / why we need it?
This PR is partially cherry-picked from #8172.
This PR aims to fix mismatched capture sizes after rounding operations
when using sp or speculative. The reason is that original
`self.cudagraph_capture_sizes` is no longer updated and remains as the
initial sizes. Now we use `self.cudagraph_dispatcher.get_capture_descs`
to the get up-to-date sizes.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
Signed-off-by: Zetong Li <slippersss@126.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
This PR updates the documentation to replace specific hardware terms
(e.g., HBM, 910B, 310P) with more generic or branded terms (e.g.,
on-chip memory, Atlas inference products) to comply with sensitive word
requirements.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
## Problem
In PD-disaggregated serving with `mooncake_connector` and
`VLLM_ASCEND_BALANCE_SCHEDULING=1`, requests may enter
`WAITING_FOR_REMOTE_KVS` and never be promoted back to runnable state
after remote KV transfer finishes.
The issue is in `BalanceScheduler`'s handling of
`WAITING_FOR_REMOTE_KVS` requests. The current code treats
`_update_waiting_for_remote_kv()` as if it returns a boolean readiness
flag:
```python
is_ready = self._update_waiting_for_remote_kv(request)
if is_ready:
...
else:
...
```
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>