### What this PR does / why we need it?
This PR adds 2 jobs to a3 nightly test, which contains 4 test cases, we
need test them nightly
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
dcp pcp support full aclgraph, including mla attention_v1
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
The cache for MLA decode graph parameters was holding strong references
to tensors, preventing them from being garbage collected and leading to
increased memory usage.
This change wraps the cached tensors in weak references, allowing them
to be deallocated when no longer in use and reducing overall memory
pressure.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
vllm requires opencv-python-headless >= 4.11.0 which requires
(numpy<2.3.0,>=2), but vllm-ascend numpy version must be less than
2.0.0, so limit opencv-python-headless less than 4.11.0.86 will fix this
conflict.
### How was this patch tested?
tested by CI
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
### What this PR does / why we need it?
Fix Qwen3NextGatedDeltaNet, caused by
https://github.com/vllm-project/vllm/pull/26437
### How was this patch tested?
```
def main():
prompts = [
"窗前明月光,",
"The president of the United States is Mr.",
"The capital of France is",
"The future of AI is",
"感时花溅泪,",
"家书抵万金啥意思?",
"plz tell me a story: ",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
# Create an LLM.
llm = LLM(
model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-Next-80B-A3B-Instruct",
tensor_parallel_size=4,
enforce_eager=True,
trust_remote_code=True,
max_model_len=256,
gpu_memory_utilization=0.7,
block_size=64
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
This PR adds a qwq case for nightly test for qwen-qwq on A3 ,we need
test them daily
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
by running the test
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: ckhw <cuikai1@huawei.com>
### What this PR does / why we need it?
Remove codes of dbo.
Currently, vLLM has supported dbo with pr:
https://github.com/vllm-project/vllm/pull/23693.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
This PR adds a prefix cache case for nightly test for
DeepSeek-r1-0528-W8A8 on A3, we need test them daily.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
---------
Signed-off-by: root <root@hostname-2pbfv.foreman.pxe>
Co-authored-by: root <root@hostname-2pbfv.foreman.pxe>
### What this PR does / why we need it?
Caps the calculated maximum number of tokens at 512.
This prevents allocating an excessively large buffer when a cudagraph
capture size is not specified, mitigating the risk of out-of-memory
errors.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
1. Rename common_fused_moe.py to fused_moe.py.
2. Rename fused_moe_prepare_and_finalize.py / FusedMoEPrepareAndFinalize
to prepare_finalize.py / PrepareAndFinalize.
3. Rename vllm_ascend/ops/moe to vllm_ascend/ops/fused_moe.
4. Move vllm_ascend/ops/fused_moe.py to
vllm_ascend/ops/fused_moe/fused_moe.py
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
This PR comments out newly added vlm e2e test of ascend scheduler
scenario because I found that when running in multi-batch this will
stuck. Need to add this back after dealing with this issue.
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
We optimized the _prepare_input method in eagle_proposer and no longer
use the _prepare_eagle_input_sequential method, improving the
performance of eagle-3.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```
python3 -m vllm.entrypoints.openai.api_server
--host 0.0.0.0
--port 13963
--dtype bfloat16
--model meta-llama/Llama-3.1-8B-Instruct
--served-model-name Llama-3.1-8B-Instruct
--tensor-parallel-size 1
--gpu-memory-utilization 0.85
--max-model-len 32768
--trust-remote-code
--seed 42
--no-enable-prefix-caching
--speculative_config '{"method":"eagle3","model":"yuhuili/EAGLE3-LLaMA3.1-Instruct-8B","num_speculative_tokens":2,"draft_tensor_parallel_size":1}'
```
Co-authored-by: QilaiZhang (245706640@qq.com )
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: lio <1983142975@qq.com>
### What this PR does / why we need it?
Since Attention and LinearAttention share the same ```slot_mapping```,
and the ```slot_mapping``` for LinearAttention is all zeros, the
```slot_mapping``` for Attention gets overwritten, resulting in the
computed output being all zeros.
This PR removes the uniformly managed ```self.slot_mapping``` and
directly passes the ```slot_mapping``` from ```input_batch.blocktable```
to ```attn_metadata```, along with modifying the relevant references.
Due to hardware, the data type of ```block_table.slot_mapping``` needs
to be set to int32.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: QilaiZhang <245706640@qq.com>
This PR fix the bug related with running multi-modal models with
AscendScheduler. This bug was introduced by PR #2372 by using the same
parameter names as vLLM with different default values.
Currently I fix this bug by changing the default values of these two
parameters to align with vLLM.
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
### What this PR does / why we need it?
Enable the unit tests that #3612 skipped.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Unit tests.
- vLLM main:
17c540a993
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it?
This pull request mainly do the following things:
1. Add a doc for multi-node CI, The main content is the mechanism
principle and how to contribute
2. Simplify the config yaml for more developer-friendly
3. Optimized the mooncake installation script to prevent accidental
failures during installation
4. Fix the workflow to ensure the kubernetes can be apply correctly
5. Add Qwen3-235B-W8A8 disaggregated_prefill test
6. Add GLM-4.5 multi dp test
7. Add 2p1d 4nodes disaggregated_prefill test
8. Refactor nightly tests
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Modify the recalculation logic to prevent waiting requests from filling
up the D node KVCache
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: underfituu <hzhucong@163.com>
### What this PR does / why we need it?
This PR refactors the Ascend attention implementation to align with
vLLM's core interfaces, simplifying the code and improving
maintainability.
### Key Changes:
* **Align with vLLM's Attention Interface**: The `forward` method
signature in `AscendAttentionBackendImpl` now matches the base
`AttentionImpl` in vLLM, removing the custom `trace_flag`.
* **Enable Opaque Attention Operator**: By adding `opaque_attention_op`
to `AscendPlatform`, we allow vLLM to wrap our attention kernel in its
standard `vllm.unified_attention_with_output` operator. This avoids the
need for a custom call path.
* **Remove Obsolete Code**:
* The custom op `vllm.unified_ascend_attention_with_output` has been
deleted as it is now redundant.
* The `trace_flag` and its associated logic were removed, reducing code
complexity.
* An outdated quantization branch within the attention implementation
was cleaned up.
* **Improve Readability**: Renamed output variables (`output` vs.
`intermediate_output`) and added comments to clarify the in-place nature
of the attention output.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
No extra tests needed.
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
This PR boosts performance by introducing a fused kernel for the matrix
matmul and reduce scatter operations. It supports both unquantized
(e.g., BFloat16) and W8A8 quantized models.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: ZYang6263 <zy626375@gmail.com>
### What this PR does / why we need it?
In multi-Tensor Parallel (TP) scenarios, the KV pool only queries the
first GPU card. When keys on other cards are released, the query result
still returns as successful, introducing accuracy issues. This PR
modifies the KV pool's query logic to check all cards, resolving this
problem.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
In certain scenarios, the performance of synchronously loading data from
the pool is better than that of asynchronously loading data. Therefore,
a control logic (or switch) for asynchronous loading from the pool has
been added.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
fix bug: In the mooncake pooling scenario, when the client closes the
request, the server fails to locate the request, leading to the server
hanging.oling scenario, when the client closes the request, the server
fails to locate the request, leading to the server hanging.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Pull up the PD separated pooling service, send requests using aisbench,
press CTRL+C twice, and check if the vllm_ascend service exit.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: linhebiwen <linhebiwen@gmail.com>
### What this PR does / why we need it?
Check all expert maps when using muilty instance.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Qwen 235B in double A3.
case1:master has expert map, slave has not expert map.
case2: master has expert map, slave has error expert map.
case3: master has expert map,slave has correct expert map.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
This PR adds the initial prefix cache case for nightly test for
Qwen3-32b-int8 on A3, we need test them daily.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
`vanilla_chunked_prefill_mla` and `vanilla_decode_mla` is unused, so
remove it.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
This 'test_rejection_sampler' unit test is something wrong.
> def test_sample_recovered_tokens_pytorch_autoregressive(self):
> output_token_ids = torch.empty(2, dtype=torch.int32)
> cu_num_draft_tokens = torch.tensor([1, 1])
> draft_token_ids = torch.tensor([0, 1])
len(draft_token_ids ) = 2, cu_num_draft_tokens should be
torch.tensor([1, 2]) or torch.tensor([2, 2])
I fix it and set cu_num_draft_tokens = torch.tensor([1, 2]). The methods
before and after optimization can pass.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
NA
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: lio <1983142975@qq.com>
This PR moves the communication operation of shared experts out of extra
stream because I found that this might cause rtMemcpy related errors
when running shared experts multistream with aclgraph.
Furthermore, I utilize a global variable as extra stream object to avoid
allocating streams for each layer in full-graph mode.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
On Arm systems, os.sched_yield() does not take effect, causing the GIL
(Global Interpreter Lock) to remain unrelinquished and resulting in CPU
bound issues. This PR applies a patch to sched_yield in vLLM, making the
process execute time.sleep(0) instead to release the GIL.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
The #3624 PR fix the precision of deepseek torchair, but don't consider
the limitation of torch compile which results in the recompile, This PR
fixs this problem
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: hust17yixuan <303660421@qq.com>
Resolves a `TypeError: got an unexpected keyword argument 'layer_type'`.
A recent change (PR #3311) started passing the `layer_type` argument
when calling `get_pergroup_param()`. This specific implementation does
not use this parameter, causing the error.
This patch adds `layer_type=None` to the method signature to maintain
API compatibility and ignore the unused argument.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
This PR add a new Qwen3-32b-int8 test case for nightly test. This test
case mainly test the performance and accuracy of Qwen3-32b-int8 with a
new feature.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: root <root@hostname-2pbfv.foreman.pxe>
Co-authored-by: root <root@hostname-2pbfv.foreman.pxe>
### What this PR does / why we need it?
Update the Pangu Pro MoE tutorials.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
Fix mooncake connector. In scenarios where TP is not equal, when the
prefill TP size is less than the number of key-value heads,
_get_remote_tp_ranks_for_req will return a list of np.arrays. Performing
an operation like int in list of np.arrays will cause an error.
Converting the list of np.arrays into a single np.array resolves this
issue.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
qwen235B
P tp16, D tp1
P tp8, D tp1
P tp4, D tp1
P tp8, D tp2
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: underfituu <hzhucong@163.com>
Co-authored-by: underfituu <hzhucong@163.com>
### What this PR does / why we need it?
This PR adds the initial multi modal model for nightly test, including 3
cases for Qwen2.5-vl-7b acc/perf test on A3, we need test them daily. It
also inclues 8 cases for deepseek-r1-0528-w8a8 func, acc and perf tests
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>